Yung-Fong Tang wrote:
I see a hole here. How about UTF-8 representing a paired of surrogate code point with two 3 octets sequence instead of an one octets UTF-8 sequence? It should be ill-formed since it is non-shortest form also, right? But we really need to watch out the language used there so we won't create new problem. I DO NOT want people think one 3 otects of UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8 surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.

How would you infer that a pair of any ill-formed sequences is not also ill-formed, without any specific text allowing such?


Remember also that such pairs of 3-byte surrogate sequences were forbidden at the same time CESU-8 was created.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.




Reply via email to