On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote:
> > (1) There are 5.125 bytes in Unicode, not four.
> > (2) I think the above would suffer from the same problem as one common
> > suggestion, two-level bitmaps (though I think the above would suffer
> > less, being of finer granularity): the problem is that a lot of
> > space is wasted, since the "usage patterns" of Unicode character
> > classes tend to be rather scattered and irregular. Yes, I see
> > that you said: "only the arrays that we actually used would be
> > allocated to save space"-- which reads to me: much complicated
> > logic both in creation and access to make the data
> > structure *look*
> > simple. I'm a firm believer in getting the data structures right,
> > after which the code to access them almost writes itself.
> >
> > I would suggest the inversion lists for the first try. As long as
> > character classes are not very dynamic once they have been created
> > (and at least traditionally that has been the case), inversion lists
> > should work reasonably well.
>
> My proposal is we should use mix method. The Unicode standard class,
> such as \p{IsLu}, can be handled by a standard splitbin table. Please
> see Java java.lang.Character or Python unicodedata_db.h. I did
> measurement on it, to handle all unicode category, simple casing,
> and decimal digit value, I need about 23KB table for Unicode 3.1
> (0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF).
Don't try to compete with inversion lists on the size: their size is
measured in bytes. For example "Latin script", which consists of 22
separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
ints, or 176 bytes. Searching for membership in an inversion list is
O(N log N) (binary search). "Encoding the whole range" is a non-issue
bordering on a joke: two ints, or 8 bytes.
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen