> > My proposal is we should use mix method. The Unicode standard class,
> > such as \p{IsLu}, can be handled by a standard splitbin table. Please
> > see Java java.lang.Character or Python unicodedata_db.h. I did 
> > measurement on it, to handle all unicode category, simple casing,
> > and decimal digit value, I need about 23KB table for Unicode 3.1
> > (0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF).
> 
> Don't try to compete with inversion lists on the size: their size is
> measured in bytes.  For example "Latin script", which consists of 22
> separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
> ints, or 176 bytes. Searching for membership in an inversion list is
> O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> bordering on a joke: two ints, or 8 bytes.

When I said mixed method, I did intend to include binary search. The binary
search is a win for sparse character class. But bitmap is better for large
one. Python uses two level bitmap for first 64K character.

Hong

Reply via email to