On Fri, Jan 18, 2002 at 11:44:00AM -0800, Hong Zhang wrote: > > (1) There are 5.125 bytes in Unicode, not four. > > (2) I think the above would suffer from the same problem as one common > > suggestion, two-level bitmaps (though I think the above would suffer > > less, being of finer granularity): the problem is that a lot of > > space is wasted, since the "usage patterns" of Unicode character > > classes tend to be rather scattered and irregular. Yes, I see > > that you said: "only the arrays that we actually used would be > > allocated to save space"-- which reads to me: much complicated > > logic both in creation and access to make the data > > structure *look* > > simple. I'm a firm believer in getting the data structures right, > > after which the code to access them almost writes itself. > > > > I would suggest the inversion lists for the first try. As long as > > character classes are not very dynamic once they have been created > > (and at least traditionally that has been the case), inversion lists > > should work reasonably well. > > My proposal is we should use mix method. The Unicode standard class, > such as \p{IsLu}, can be handled by a standard splitbin table. Please > see Java java.lang.Character or Python unicodedata_db.h. I did > measurement on it, to handle all unicode category, simple casing, > and decimal digit value, I need about 23KB table for Unicode 3.1 > (0x0 to 0x10FFFF), about 15KB for (0x0 to 0xFFFF).
Don't try to compete with inversion lists on the size: their size is measured in bytes. For example "Latin script", which consists of 22 separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44 ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). "Encoding the whole range" is a non-issue bordering on a joke: two ints, or 8 bytes. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen