Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Bruno Haible
Markus Kuhn writes: The appearance of U+FFFD is a kind of error message. Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded by a high sorrugate) is equally "a kind of error message". Just one that contains a bit (well, seven :-) more information. The difference is

Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Edmund GRIMLEY EVANS
Markus Kuhn [EMAIL PROTECTED]: I see valuable binary data (PDF ZIP files, etc.) being destroyed almost every day by accidentally applied stupid lossy CRLF - LF - CRLF data conversion that supposedly smart software tries to perform on the fly. I foresee similar non-recoverable data

Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-27 Thread Bruno Haible
Markus Kuhn's proposal D: All the previous options for converting malformed UTF-8 sequences to UTF-16 destroy information. ... Malformed UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and each of these bytes can be represented using a 16-bit value ... This way 100% binary

Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-25 Thread Florian Weimer
Edmund GRIMLEY EVANS [EMAIL PROTECTED] writes: B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence This is what I do in Mutt. It's easy to implement and works for any multibyte encoding; the program doesn't have to know about UTF-8. This is what I recommend at the moment, with

Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-23 Thread Edmund GRIMLEY EVANS
Markus Kuhn [EMAIL PROTECTED]: A) Emit a single U+FFFD per malformed sequence We discussed this before. I can think of several ways of interpreting the phrase "malformed sequence". I think you probably mean either a single octet in the range 80..BF or a single octet in the range FE..FF or an