Ray Dillinger scripsit:

> If you are appending and taking substrings, the codepoint level is one
> of several wrong choices to make about where to allow string divisions,
> for exactly this reason.

You understate the case: *every* level is a wrong choice for some purposes
(and a right choice for others).

> What human beings think of as characters, are represented in unicode
> by a base codepoint plus nondefective sequence of combining modifiers
> and variant selectors, each of which is also a codepoint.

The DGC level (which you are describing) is also arbitrary; for some
languages it works well, for others not.

For example, in all (mainstream) Indic scripts the DGC is a consonant
with zero or one vowel added, and this is indeed right for Tamil, whose
users think of it as a syllabary.  In Hindi, though, it's more common
to think of *all* the consonants before a vowel as being part of the
character, even though they are in different DGCs according to Unicode,
because that's the way they (mostly) ligature together.

-- 
Income tax, if I may be pardoned for saying so,         John Cowan
is a tax on income.  --Lord Macnaghten (1901)           [email protected]

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to