> Since I seem to be the main regex hacker for Parrot, I'll respond to
> this as best I can.
> 
> Currently, we are using bitmaps for character classes.  Well, sort of.
> A Bitmap in Parrot is defined like this:
> 
>       typedef struct bitmap_t {
>               char*           bmp;
>               STRING* bigchars;
>       } Bitmap;
> 
> Characters <256 are stored as a bitmap in bmp; other characters are
> stored in bigchars and linear-searched.  This is a temporary measure,

This is similar to how Perl 5 does them: the low eight bits are in a
32-byte bitmap, the "wide characters" are stored after it (in a funky
data structure, I won't go into more detail so that people won't lose
their lunch/breakfast/meal)

> since Parrot isn't yet dealing with many characters outside of ASCII.
> Several schemes have been proposed for the final version; I'm currently
> leaning towards an array of arrays of arrays of bitmaps (one level for
> each byte of the character):
> 
>       INTVAL ch;
>       return
> bmp->bmp[FIRST_BYTE(ch)][SECOND_BYTE(ch)][THIRD_BYTE(ch)][FORTH_BYTE(ch)
> >>3] & (1<<(FORTH_BYTE(ch) & 7));

dup + dup *  ... oh, you meant FOURTH.

> Ungainly, but it works.  It would actually be a bit more
> complicated--only the arrays that we actually used would be allocated to
> save space--but you get the idea.  (However, I'm quite flexible on the
> implementation chosen.  I'll look at the ideas you propose in more
> detail; if anyone else has any suggestions, suggest them.)

Ungainly, yes.

(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
    suggestion, two-level bitmaps (though I think the above would suffer
    less, being of finer granularity): the problem is that a lot of
    space is wasted, since the "usage patterns" of Unicode character
    classes tend to be rather scattered and irregular.  Yes, I see
    that you said: "only the arrays that we actually used would be
    allocated to save space"-- which reads to me: much complicated
    logic both in creation and access to make the data structure *look*
    simple.  I'm a firm believer in getting the data structures right,
    after which the code to access them almost writes itself.

I would suggest the inversion lists for the first try.  As long as
character classes are not very dynamic once they have been created
(and at least traditionally that has been the case), inversion lists
should work reasonably well.

> As for character encodings, we're forcing everything to UTF-32 in
> regular expressions.  No exceptions.  If you use a string in a regex,
> it'll be transcoded.  I honestly can't think of a better way to
> guarantee efficient string indexing.

I'm fine with that.  The bloat is of course a shame, but as long as
that's not a real problem for someone, let's not worry about it too
much.

> --Brent Dax
> [EMAIL PROTECTED]
> Parrot Configure pumpking and regex hacker
> 
> <obra> mmmm. hawt sysadmin chx0rs
> <lathos> This is sad. I know of *a* hawt sysamin chx0r.
> <obra> I know more than a few.
> <lathos> obra: There are two? Are you sure it's not the same one?

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Reply via email to