On 26/11/2003 06:17, Philippe Verdy wrote:
Peter Kirk [mailto:[EMAIL PROTECTED] writes:
Why is this a problem? Quotes and ">" with combining marks are
presumably not legal HTML or XML;
You're wrong: it is legal in both HTML and XML. What is not specified
correctly is the behavior of HTML and XML parsers face to a XML or HTML
document claiming it is coded with a Unicode encoding scheme or any other
Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
contains supplementary characters that are not part of the Unicode/ISO/IEC
10646 repertoire).
OK, I used the wrong words here. A sequence of a quote or ">" followed
by combining characters is legal HTML/XML with the interpretation of a
quote or ">" introducing a quoted string or terminating a tag, followed
by a defective combining sequence which is part of the quoted string or
of the text following the tag. The question is, does such a sequence
have any other legal interpretation, within the context of an HTML/XML
tag? If not, there is no ambiguity.
...
There could of course be
problems if there were any precomposed combinations of quotes or ">"
with combining characters, but I don't think there are any, are there?
There are such precomposed sequences in Unicode. Look in
NormalizationTest.txt for the places where ">", single and double quotes are
used and part of a combining sequence... Notably look at sequences made with
the combining solidus overlay; add also the case of enclosing combining
characters, and of mathematical operators that can be created with a
combining sequence starting by ">" or "=" or single or double quotes, and
modified by diacritics.
According to John Cowan there is just one such precomposed character,
U+226F. As an HTML/XML document (the whole file, not just the parts
between tags) is a Unicode string, the Unicode conformance rules would
seem to mandate that an HTML/XML parser should parse U+226F exactly as
if it were the sequence <">", U+0338>, i.e. as end of tag followed by a
defective combining sequence. Normalisation stability implies that this
precomposed character will always be the only such problem case, at
least apart from composition exceptions, and so it is possible to write
it into parsers as a special case. A bit messy, but less messy than
using numeric entities or named entities.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/