On 26/11/2003 06:17, Philippe Verdy wrote:

Peter Kirk [mailto:[EMAIL PROTECTED] writes:


Why is this a problem? Quotes and ">" with combining marks are presumably not legal HTML or XML;



You're wrong: it is legal in both HTML and XML. What is not specified correctly is the behavior of HTML and XML parsers face to a XML or HTML document claiming it is coded with a Unicode encoding scheme or any other Unicode-compatible CES (like GB18030, but not completely with MacRoman as it contains supplementary characters that are not part of the Unicode/ISO/IEC 10646 repertoire).



OK, I used the wrong words here. A sequence of a quote or ">" followed by combining characters is legal HTML/XML with the interpretation of a quote or ">" introducing a quoted string or terminating a tag, followed by a defective combining sequence which is part of the quoted string or of the text following the tag. The question is, does such a sequence have any other legal interpretation, within the context of an HTML/XML tag? If not, there is no ambiguity.

...

There could of course be problems if there were any precomposed combinations of quotes or ">" with combining characters, but I don't think there are any, are there?



There are such precomposed sequences in Unicode. Look in NormalizationTest.txt for the places where ">", single and double quotes are used and part of a combining sequence... Notably look at sequences made with the combining solidus overlay; add also the case of enclosing combining characters, and of mathematical operators that can be created with a combining sequence starting by ">" or "=" or single or double quotes, and modified by diacritics.



According to John Cowan there is just one such precomposed character, U+226F. As an HTML/XML document (the whole file, not just the parts between tags) is a Unicode string, the Unicode conformance rules would seem to mandate that an HTML/XML parser should parse U+226F exactly as if it were the sequence <">", U+0338>, i.e. as end of tag followed by a defective combining sequence. Normalisation stability implies that this precomposed character will always be the only such problem case, at least apart from composition exceptions, and so it is possible to write it into parsers as a special case. A bit messy, but less messy than using numeric entities or named entities.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to