Re: on parrot strings

Jarkko Hietaniemi Fri, 18 Jan 2002 12:05:39 -0800

On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote:
> > (1) There are 5.125 bytes in Unicode, not four.
> > (2) I think the above would suffer from the same problem as one common
> >     suggestion, two-level bitmaps (though I think the above would suffer
> >     less, being of finer granularity): the problem is that a lot of
> >     space is wasted, since the "usage patterns" of Unicode character
> >     classes tend to be rather scattered and irregular.  Yes, I see
> >     that you said: "only the arrays that we actually used would be
> >     allocated to save space"-- which reads to me: much complicated
> >     logic both in creation and access to make the data 
> > structure *look*
> >     simple.  I'm a firm believer in getting the data structures right,
> >     after which the code to access them almost writes itself.
> > 
> > I would suggest the inversion lists for the first try.  As long as
> > character classes are not very dynamic once they have been created
> > (and at least traditionally that has been the case), inversion lists
> > should work reasonably well.
> 
> My proposal is we should use mix method. The Unicode standard class,
> such as \p{IsLu}, can be handled by a standard splitbin table. Please
> see Java java.lang.Character or Python unicodedata_db.h. I did 
> measurement on it, to handle all unicode category, simple casing,
> and decimal digit value, I need about 23KB table for Unicode 3.1
> (0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF).


Don't try to compete with inversion lists on the size: their size is
measured in bytes.  For example "Latin script", which consists of 22
separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search).  "Encoding the whole range" is a non-issue
bordering on a joke: two ints, or 8 bytes.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Re: on parrot strings

Reply via email to