Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

jcowan Tue, 09 Dec 2003 06:33:33 -0800

Philippe Verdy scripsit:

> When in doubt, don't perform any normalization of XML _files_ as they are
> NOT plain text: you need a XML parser to do it safely only in relevant
> sections of this file. All you could do safely is to possibly reencode XML
> files (for example from UTF-8 to UTF-16 encoding schemes).


This is wildly overstated.  XML files most certainly are plain text,
though they may be interpreted as fancy text in contexts that understand
XML.  With the insignificant exception of a markup ">" immediately
followed by a U+0338 character, it is entirely safe to normalize XML
files according to any normalization.  (It is true that NK* normalization
forms may lose information, but XML document authors are discouraged
from using compatibility decomposables in any case.)

What is not allowed, and this makes XML technically non-conformant to the
Unicode Standard, is to make arbitrary and unsystematic replacements of
one canonically equivalent form with another.  For example, if an element
name is "h)Bétérogénéité" (a favorite word of mine), decomposing the
start-tag while leaving the end-tag composed would make the document no
longer well-formed XML.  In my opinion, this is a corner case that may
be safely ignored.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
'Tis the Linux rebellion / Let coders take their place,
The Linux-nationale / Shall Microsoft outpace,
We can write better programs / Our CPUs won't stall,
So raise the penguin banner of / The Linux-nationale.

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

Reply via email to