"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: > "D. Starner" <[EMAIL PROTECTED]> writes: > > > You could hide combining characters, which would be extremely useful if we > > were just using Latin > > and Cyrillic scripts. > > It would need a separate API for examining the contents of a combining > character. You can't avoid the sequence of code points completely.
Not a seperate API; a function that takes a character and returns an array of integers. > It would yield to surprising semantics: for example if you concatenate > a string with N+1 possible positions of an iterator with a string with > M+1 positions, you don't necessarily get a string with N+M+1 positions > because there can be combining characters at the border. The semantics there are surprising, but that's true no matter what you do. An NFC string + an NFC string may not be NFC; the resulting text doesn't have N+M graphemes. Unless you're explicitly adding a combining character, a combining character should never start a string. This could be fixed several ways, including by inserting a dummy character to hold the combining character, and "normalizing" the string by removing the dummy characters. That would, for the most part, only hurt pathological cases. > It would impose complexity in cases where it's not needed. Most of the > time you don't care which code points are combining and which are not, > for example when you compose a text file from many pieces (constants > and parts filled by users) or when parsing (if a string is specified > as ending with a double quote, then programs will in general treat a > double quote followed by a combining character as an end marker). If you do so with an language that includes <, you violate the Unicode standard, because ≮ (not <) and ≮ are canonically equivalent. You've either got to decompose first or look at the individual characters as a whole instead of looking at code points. Has anyone considered this while defining a language? How about the official standards bodies? Searching for XML in the archives is a bit unhelpful, and UTR #20 doesn't mention the issue. Your solution is just fine if you're considering the issue on the bit level, but it strikes me as the wrong answer, and I would think that it would surprising to a user that didn't understand Unicode, especially in the ≮ case. A warning either way would be nice. I'll see if I have time after finals to pound out a basic API that implements this, in Ada or Lisp or something. It's not going to be the most efficient thing, but I doubt it's going to be a big difference for most programs, and if you want C, you know where to find it. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm

