On Thursday, August 07, 2003 1:13 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote:
> Well, yes, which is why I have been advocating it as the > solution to the Biblical Hebrew text representation problem. > I agree with you about that. But it need not be characterized > as "legal" in opposition to the other examples I cited above. > All of these sequences are "legal" and allowed by the > standard. Once again sorry if I used the terms "ill-formed" or "well-formed" instead of "defective" or "non defective" (normal?). Such distinction in the standard does not help its understanding when discussing about interoperability of text processing where neither ill-formed nor defective sequences should be used if interoperability is the main focus (and also normally the design focus for Unicode). The canonical equivalences (NFC, NFD, canonical ordering) is needed now for XML processing and in fact it greatly reduces the number of ill-formed, invalid, or defective sequences or whatever bad encoding of actual text, to simplify its processing. Still these equivalences don't solve all the issues and create their own (and this is now a good reason to use CGJ to override the canonical ordering of combining diacritics). Of course there may be a lot of strings created with Unicode which are not "ill-formed" and not canonically equivalent (per NFC, NFD, canonical ordering), but I won't enter in that zone. For XML what is relevant is that it processes strings in NFC form and thus implies only canonical equivalences, but XML will still process "defective" sequences by correctly processing characters per its canonical combining sequences. I'd like to see a more formal rule for defective uses of CGJ used to fix canonical ordering. What I suggested was to specify that only some sequences with CGJ would be "non defective", if the CGJ appears before a base character or between two combining characters. The character model needs then to be refined to be more precise to document which uses are considered non defective, and which ones are not. So a sequence <..., ring above, CGJ, cedilla, ...> would not be defective as it fixes the canonical ordering, even if in this case it does not interact graphically (note that this statement supposes that the cedilla effectively appears below, something which is wrong with some languages, where the cedilla appears in fact like an acute accent above right...). The example of the effective rendering of diacritics at the presupposed placement indicated by their combining class is significant: it shows that combining classes just handle some common placement rules, but not every case, and a particular language or renderer may need to place diacritics on other positions, in which case the canonical ordering would have an impact on the renderer. That's a good enough reason to justify and document the use of CGJ as a combining class override for diacritics, whose usage should be restricted for interoperability. This has a consequence for input methods and editors: users can type base characters and diacritics, and the editor will, by default, use a canonical ordering, that the user may fix if needed for a particular language with a control command that would "swap" two misplaced diacritics by automatically inserting a CGJ only if needed because both diacritics have distinct combining classes: this editor control command would have no other effect if executed after two diacritics with identical combining, or after a single diacritic, and the editor should make its best effort to not allow user enter ill-formed or defective sequences. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.