> (1) There are 5.125 bytes in Unicode, not four.
> (2) I think the above would suffer from the same problem as one common
>     suggestion, two-level bitmaps (though I think the above would suffer
>     less, being of finer granularity): the problem is that a lot of
>     space is wasted, since the "usage patterns" of Unicode character
>     classes tend to be rather scattered and irregular.  Yes, I see
>     that you said: "only the arrays that we actually used would be
>     allocated to save space"-- which reads to me: much complicated
>     logic both in creation and access to make the data 
> structure *look*
>     simple.  I'm a firm believer in getting the data structures right,
>     after which the code to access them almost writes itself.
> 
> I would suggest the inversion lists for the first try.  As long as
> character classes are not very dynamic once they have been created
> (and at least traditionally that has been the case), inversion lists
> should work reasonably well.

My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did 
measurement on it, to handle all unicode category, simple casing,
and decimal digit value, I need about 23KB table for Unicode 3.1
(0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF).

For simple character class, such as [\p{IsLu}\p{InGreak}], the regex
does not need to emit optimized bitmap. Instead, the regex just generate
an union, the first one will use standard unicode category lookup, the
second one is a simple range.

If user mandate to use fast bitmap, and the character class is not
extremely complicated, we will only probably need about several K for
each char class.

> > As for character encodings, we're forcing everything to UTF-32 in
> > regular expressions.  No exceptions.  If you use a string in a regex,
> > it'll be transcoded.  I honestly can't think of a better way to
> > guarantee efficient string indexing.

I don't think UTF-32 will save you much. The unicode case map is variable
length, combining character, canonical equivalence, and many other thing
will require variable length mapping. For example, if I only want to
parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
time, the regcomp() can find out whether this regexp will need complicated
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully 
decompose input string, strip off any combining character, and do 8-bit
Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
(BTW, the equivalent matching means match English spelling against French
spell, disregarding diacritics.)

I think we should explore more choices and do some experiments.

Hong

Reply via email to