> > My proposal is we should use mix method. The Unicode standard class, > > such as \p{IsLu}, can be handled by a standard splitbin table. Please > > see Java java.lang.Character or Python unicodedata_db.h. I did > > measurement on it, to handle all unicode category, simple casing, > > and decimal digit value, I need about 23KB table for Unicode 3.1 > > (0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF). > > Don't try to compete with inversion lists on the size: their size is > measured in bytes. For example "Latin script", which consists of 22 > separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44 > ints, or 176 bytes. Searching for membership in an inversion list is > O(N log N) (binary search). "Encoding the whole range" is a non-issue > bordering on a joke: two ints, or 8 bytes.
When I said mixed method, I did intend to include binary search. The binary search is a win for sparse character class. But bitmap is better for large one. Python uses two level bitmap for first 64K character. Hong