Hi,

What is a conforming application supposed to do if, when decoding a UTF-8 stream (or indeed a UTF-32 stream, etc.), it encounters a sequence of bytes which decodes to U+D800, U+DF00 ?

Of course, if such a sequence were encountered during UTF-16 processing it would be pretty obvious, but I'm not talking UTF-16 any more. At least, not directly. Nonetheless, such a sequence could arise if Application A encodes text to a file using UTF-16, which is then read by Application B (a very old, legacy application, unaware of the existence of codepoints above U+FFFF) and re-saved in UTF-8.

This question generalises to ... should all encoding schemes treat surrogate pairs as surrogate pairs, or just UTF-16 ?

This question generalises further still, to ... do the phrases "surrogate character" and "surrogate pair" have any meaning whatsoever outside UTF-16?

Jill


Reply via email to