William D Clinger scripsit:

> In the reference implementation, all of the Unicode tables
> add up to a little over 85000 bytes (on a 32-bit machine).

Impressive, and thanks for pointing me there.  I see you mostly use
inversion lists (a sorted vector of codepoints at which the value of
a property changes), with associated values where required, and nicely
fast-path ASCII and the BMP (plane 0).

Inversion lists are compact, but in most cases ICU uses
tries, trading space for improved lookup speed.  Details are
at http://macchiato.com/slides/Bits_of_Unicode.ppt and
http://icu-project.org/docs/papers/foldedtrie_iuc21.ppt .

Mozilla uses (unless it has changed) binary trees of SSGO records:
Start/Size/Gap/Offset, where Offset is used for mappings, and is the
delta between a codepoint and the codepoint it's mapped to.  Gap is a
flag that is set if this particular Start-Size range is gappy; that is,
if it only includes every other codepoint.  This comes up where Unicode
encodes alternating upper and lower case letters.  ASCII is fast-pathed.

-- 
John Cowan     http://ccil.org/~cowan    [email protected]
Monday we watch-a Firefly's house, but he no come out.  He wasn't home.
Tuesday we go to the ball game, but he fool us.  He no show up.  Wednesday he
go to the ball game, and we fool him.  We no show up.  Thursday was a
double-header.  Nobody show up.  Friday it rained all day.  There was no ball
game, so we stayed home and we listened to it on-a the radio.  --Chicolini

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to