> Since I seem to be the main regex hacker for Parrot, I'll respond to > this as best I can. > > Currently, we are using bitmaps for character classes. Well, sort of. > A Bitmap in Parrot is defined like this: > > typedef struct bitmap_t { > char* bmp; > STRING* bigchars; > } Bitmap; > > Characters <256 are stored as a bitmap in bmp; other characters are > stored in bigchars and linear-searched. This is a temporary measure,
This is similar to how Perl 5 does them: the low eight bits are in a 32-byte bitmap, the "wide characters" are stored after it (in a funky data structure, I won't go into more detail so that people won't lose their lunch/breakfast/meal) > since Parrot isn't yet dealing with many characters outside of ASCII. > Several schemes have been proposed for the final version; I'm currently > leaning towards an array of arrays of arrays of bitmaps (one level for > each byte of the character): > > INTVAL ch; > return > bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch) > >>3] & (1<<(FORTH_BYTE(ch) & 7)); dup + dup * ... oh, you meant FOURTH. > Ungainly, but it works. It would actually be a bit more > complicated--only the arrays that we actually used would be allocated to > save space--but you get the idea. (However, I'm quite flexible on the > implementation chosen. I'll look at the ideas you propose in more > detail; if anyone else has any suggestions, suggest them.) Ungainly, yes. (1) There are 5.125 bytes in Unicode, not four. (2) I think the above would suffer from the same problem as one common suggestion, two-level bitmaps (though I think the above would suffer less, being of finer granularity): the problem is that a lot of space is wasted, since the "usage patterns" of Unicode character classes tend to be rather scattered and irregular. Yes, I see that you said: "only the arrays that we actually used would be allocated to save space"-- which reads to me: much complicated logic both in creation and access to make the data structure *look* simple. I'm a firm believer in getting the data structures right, after which the code to access them almost writes itself. I would suggest the inversion lists for the first try. As long as character classes are not very dynamic once they have been created (and at least traditionally that has been the case), inversion lists should work reasonably well. > As for character encodings, we're forcing everything to UTF-32 in > regular expressions. No exceptions. If you use a string in a regex, > it'll be transcoded. I honestly can't think of a better way to > guarantee efficient string indexing. I'm fine with that. The bloat is of course a shame, but as long as that's not a real problem for someone, let's not worry about it too much. > --Brent Dax > [EMAIL PROTECTED] > Parrot Configure pumpking and regex hacker > > <obra> mmmm. hawt sysadmin chx0rs > <lathos> This is sad. I know of *a* hawt sysamin chx0r. > <obra> I know more than a few. > <lathos> obra: There are two? Are you sure it's not the same one? -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen