Hi,
What is a conforming application supposed to do if, when decoding a
UTF-8 stream (or indeed a UTF-32 stream, etc.), it encounters a
sequence of bytes which decodes to U+D800, U+DF00 ?
Of course, if such a sequence were encountered during UTF-16 processing
it would be pretty obvious, but I'm not talking UTF-16 any more. At
least, not directly. Nonetheless, such a sequence could arise if
Application A encodes text to a file using UTF-16, which is then read
by Application B (a very old, legacy application, unaware of the
existence of codepoints above U+FFFF) and re-saved in UTF-8.
This question generalises to ... should all encoding schemes
treat surrogate pairs as surrogate pairs, or just UTF-16 ?
This question generalises further still, to ... do the phrases
"surrogate character" and "surrogate pair" have any meaning whatsoever
outside UTF-16?
Jill
- RE: UTF-16 inside UTF-8 Jill Ramonsky
- Re: UTF-16 inside UTF-8 Jill Ramonsky
- Re: UTF-16 inside UTF-8 David E. Hollingsworth
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Re: UTF-16 inside UTF-8 Doug Ewell
- Re: UTF-16 inside UTF-8 Peter Kirk
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Re: UTF-16 inside UTF-8 Doug Ewell
- Re: UTF-16 inside UTF-8 Jungshik Shin
- Re: UTF-16 inside UTF-8 Peter Kirk
- Re: UTF-16 inside UTF-8 Philippe Verdy
- Ill-formed sequences (was: Re: UTF-16 insi... Doug Ewell