Jarkko Hietaniemi: <from attachment> About the implementation of character classes: since the Unicode code point range is big, a single big bitmap won't work any more: firstly, it would be big. Secondly, for most cases, it would be wastefully sparse. A balanced binary tree of (begin, end) points of ranges is suggested. That would seem to give the required flexibility and reasonable compromise betwen speed and space for implementing the operations required by both the traditional regular expression character classes (complement, case-ignorance) and the new Unicode character class semantics (difference, intersection) (see the Unicode Technical Report #18, I<Unicode Regular Expression Guidelines>, http://www.unicode.org/unicode/reports/tr18/ )
Another, possible simpler way would be to use inversion lists: 1-dimensional arrays where odd (starting from zero) indices store the beginnings of ranges belonging to the class, and and even indices store the beginnings of ranges not belonging to the class. Note "array" instead of (a linked) "list": with an array one can do binary search to determine membership (so an inversion list is in effect a flattened binary tree). Yet another way would be to use various two-level table schemes. The choice of the appropriate data structure, as always, depends on the expected operational (read vs modify) mix and the expected data distribution. ### Since I seem to be the main regex hacker for Parrot, I'll respond to this as best I can. Currently, we are using bitmaps for character classes. Well, sort of. A Bitmap in Parrot is defined like this: typedef struct bitmap_t { char* bmp; STRING* bigchars; } Bitmap; Characters <256 are stored as a bitmap in bmp; other characters are stored in bigchars and linear-searched. This is a temporary measure, since Parrot isn't yet dealing with many characters outside of ASCII. Several schemes have been proposed for the final version; I'm currently leaning towards an array of arrays of arrays of bitmaps (one level for each byte of the character): INTVAL ch; return bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch) >>3] & (1<<(FORTH_BYTE(ch) & 7)); Ungainly, but it works. It would actually be a bit more complicated--only the arrays that we actually used would be allocated to save space--but you get the idea. (However, I'm quite flexible on the implementation chosen. I'll look at the ideas you propose in more detail; if anyone else has any suggestions, suggest them.) As for character encodings, we're forcing everything to UTF-32 in regular expressions. No exceptions. If you use a string in a regex, it'll be transcoded. I honestly can't think of a better way to guarantee efficient string indexing. --Brent Dax [EMAIL PROTECTED] Parrot Configure pumpking and regex hacker <obra> mmmm. hawt sysadmin chx0rs <lathos> This is sad. I know of *a* hawt sysamin chx0r. <obra> I know more than a few. <lathos> obra: There are two? Are you sure it's not the same one?