"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes:
> "D. Starner" <[EMAIL PROTECTED]> writes:
>
> > You could hide combining characters, which would be extremely useful if we 
> > were just using Latin 
> > and Cyrillic scripts.
> 
> It would need a separate API for examining the contents of a combining
> character. You can't avoid the sequence of code points completely.

Not a seperate API; a function that takes a character and returns an array of 
integers.

> It would yield to surprising semantics: for example if you concatenate
> a string with N+1 possible positions of an iterator with a string with
> M+1 positions, you don't necessarily get a string with N+M+1 positions
> because there can be combining characters at the border.

The semantics there are surprising, but that's true no matter what you
do. An NFC string + an NFC string may not be NFC; the resulting text
doesn't have N+M graphemes. Unless you're explicitly adding a combining
character, a combining character should never start a string. This could 
be fixed several ways, including by inserting a dummy character to hold 
the combining character, and "normalizing" the string by removing the dummy 
characters. That would, for the most part, only hurt pathological cases.

> It would impose complexity in cases where it's not needed. Most of the
> time you don't care which code points are combining and which are not,
> for example when you compose a text file from many pieces (constants
> and parts filled by users) or when parsing (if a string is specified
> as ending with a double quote, then programs will in general treat a
> double quote followed by a combining character as an end marker).

If you do so with an language that includes <, you violate the Unicode
standard, because <&#824; (not <) and &#8814; are canonically equivalent. You've
either got to decompose first or look at the individual characters as
a whole instead of looking at code points.

Has anyone considered this while defining a language? How about the official
standards bodies? Searching for XML in the archives is a bit unhelpful, and
UTR #20 doesn't mention the issue. Your solution is just fine if you're
considering the issue on the bit level, but it strikes me as the wrong answer,
and I would think that it would surprising to a user that didn't understand
Unicode, especially in the &#8814; case. A warning either way would be nice.

I'll see if I have time after finals to pound out a basic API that implements
this, in Ada or Lisp or something. It's not going to be the most efficient 
thing,
but I doubt it's going to be a big difference for most programs, and if you want
C, you know where to find it.

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm



Reply via email to