Markus Kuhn writes:
The appearance of U+FFFD is a kind of error message.
Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded
by a high sorrugate) is equally "a kind of error message". Just one that
contains a bit (well, seven :-) more information.
The difference is
Markus Kuhn [EMAIL PROTECTED]:
I see valuable binary data (PDF ZIP files, etc.) being destroyed
almost every day by accidentally applied stupid lossy CRLF - LF - CRLF
data conversion that supposedly smart software tries to perform on the
fly. I foresee similar non-recoverable data
Markus Kuhn's proposal D:
All the previous options for converting malformed UTF-8 sequences to
UTF-16 destroy information. ...
Malformed UTF-8 sequences consist excludively of the bytes 0x80 -
0xff, and each of these bytes can be represented using a 16-bit
value ...
This way 100% binary
Edmund GRIMLEY EVANS [EMAIL PROTECTED] writes:
B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence
This is what I do in Mutt. It's easy to implement and works for any
multibyte encoding; the program doesn't have to know about UTF-8.
This is what I recommend at the moment, with
Markus Kuhn [EMAIL PROTECTED]:
A) Emit a single U+FFFD per malformed sequence
We discussed this before. I can think of several ways of interpreting
the phrase "malformed sequence".
I think you probably mean either a single octet in the range 80..BF or
a single octet in the range FE..FF or an