Re: on parrot strings

Jarkko Hietaniemi Fri, 18 Jan 2002 12:31:03 -0800

On Fri, Jan 18, 2002 at 12:20:53PM -0800, Hong Zhang wrote:
> > > My proposal is we should use mix method. The Unicode standard class,
> > > such as \p{IsLu}, can be handled by a standard splitbin table. Please
> > > see Java java.lang.Character or Python unicodedata_db.h. I did 
> > > measurement on it, to handle all unicode category, simple casing,
> > > and decimal digit value, I need about 23KB table for Unicode 3.1
> > > (0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF).
> > 
> > Don't try to compete with inversion lists on the size: their size is
> > measured in bytes.  For example "Latin script", which consists of 22
> > separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
> > ints, or 176 bytes. Searching for membership in an inversion list is
> > O(N log N) (binary search).  "Encoding the whole range" is a non-issue
> > bordering on a joke: two ints, or 8 bytes.
> 
> When I said mixed method, I did intend to include binary search. The binary
> search is a win for sparse character class. But bitmap is better for large
> one.


"Better" in what sense?  Smaller?  Certainly not.  Faster?  Maybe, maybe not.
Yes, accessing the right bytes and doing the bit arithmetics is about as
fast as one can hope doing anything in CPUs.  But: the 15KB is quite a lot
of stuff to move around for, say, [0-9].  Yes, bitmaps win in pathological
cases where you, say, choose every other character of the Unicode.

I guess I agree with you that a combination of bitmaps and binary
searchable things (inversion lists or trees) is good, but I guess we
differ in that my gut feeling is that the latter should be the
default, not the bitmaps.

I also think this low-level detail should be completely hidden from,
say, the writers of the regex engine, all they should see is
code_point_in_class(cp, cc), and that the low-level "character class
engine" should dynamically pick whichever low-level implementation is
"best", and naturally that only one of the low-level implementations
is being used (for one character class) at a time: hybrids (meaning
dual book-keeping) sound to me like a fruitful breeding area for bugs.

> Python uses two level bitmap for first 64K character.

And their Unicode implementation is doing how well? :-)

> Hong

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Re: on parrot strings

Reply via email to