Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn writes: The appearance of U+FFFD is a kind of error message. Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded by a high sorrugate) is equally "a kind of error message". Just one that contains a bit (well, seven :-) more information. The difference is that application writers know how to deal with U+FFFD (hollow box, width 1, etc.) But if a byte 0xBB - U+DC3B appears, applications don't know whether it represents an ISO-8859-1 0xBB (angle quotation mark) or an ISO-8859-6 0xBB (arabic semicolon). I see valuable binary data (PDF ZIP files, etc.) being destroyed almost every day by accidentally applied stupid lossy CRLF - LF - CRLF data conversion that supposedly smart software tries to perform on the fly. It's a problem of the applications. Some application writers think that "as many automatic conversions as possible" and "as many heuristics as possible" qualifies as smart. Try and teach them. I foresee similar non-recoverable data conversion accidents if we try to establish software that wipes out malformed UTF-8 sequence without mercy and destructs all information that they might have contained. I like the way Emacs deals with the problem of (sometimes necessary) conversions: When there is an ambiguity, it asks the user. When I take an ISO-8859-1 file with German umlauts and paste a few Chinese ideograms into it and then attempt to save it, it warns me that the new characters won't fit with the existing file encoding and asks me to choose another file encoding. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn [EMAIL PROTECTED]: I see valuable binary data (PDF ZIP files, etc.) being destroyed almost every day by accidentally applied stupid lossy CRLF - LF - CRLF data conversion that supposedly smart software tries to perform on the fly. I foresee similar non-recoverable data conversion accidents if we try to establish software that wipes out malformed UTF-8 sequence without mercy and destructs all information that they might have contained. Here the problem is that the program is misconverting on the fly and not giving an error. If the program stopped with an error half way through the user would know there was a problem and be able to do something about it. So, I don't think a UTF-8 decoder, as implemented in a library, should do anything other than give an error if it encounters malformed UTF-8. The user should be told that something has gone wrong. Clever reversible conversion of malformed sequences is more likely to hide a real problem, causing a bigger problem later, than to be helpful, I suspect. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn's proposal D: All the previous options for converting malformed UTF-8 sequences to UTF-16 destroy information. ... Malformed UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and each of these bytes can be represented using a 16-bit value ... This way 100% binary transparent UTF-8 - UTF-16/32 - UTF-8 round-trip compatibility can be achieved quite easily. I don't like this proposal for a few reasons: * What interoperable and reliable software needs, is a clear and standardized interchange format. It must say "this is allowed" and "that is forbidden". If after a few years a standard starts saying "this was forbidden but is now allowed", then older software will not accept output from newer programs any more. And the result will be just like the mess we had around 1992 when some but not all Unix software was 8-bit clean. * A program which does something halfway intelligent, like the "fmt" line breaking program, needs to make assumptions about the characters it is treating. (In the case of fmt: recognize spaces and newlines, and know about their width.) The input is UTF-8 and is converted to UCS-4 via fgetwc. If this UCS-4 stream now contains characters which are only substitutes for *unknown* characters, the fmt program will never know the width of these. It will thus output (again in UTF-8) the original characters, but will not have done the correct line breaking. In summary, this leads to "garbage in - garbage out" behaviour of programs. Whereas a central point of Unicode is that applications know the behaviour of *all* characters, definitely. I much prefer the "garbage in - error message" way, because it enables the user or sysadmin to fix the problem (read: call recode on the data files). The appearance of U+FFFD is a kind of error message. * One of your most prominent arguments for the adoption of UTF-8 is that it in 99.99% of the cases an UTF-8 encoded file can easily distinguished from an ISO-8859-1 encoded one. If UTF-8 were extended so that lone bytes in the range 0x80..0xBF were considered valid, this argument would fall apart. Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Edmund GRIMLEY EVANS [EMAIL PROTECTED] writes: B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence This is what I do in Mutt. It's easy to implement and works for any multibyte encoding; the program doesn't have to know about UTF-8. This is what I recommend at the moment, with two exceptions: For UTF-8-to-UTF-16 translation, an UCS-4 character which can't be represented in UTF-16 is replaced with a single replacement character. This also applies to syntactically correct UTF-8 sequences which are either overlong or encode code positions such as surrogates which are forbidden in UTF-8. D) Emit a malformed UTF-16 sequence for every byte in a malformed UTF-8 sequence Not much good if you're not converting to UTF-16. Well, it works with UCS-4 as well (but I would use a private area for this kind of stuff until it's generally accepted practice to do such hacks with surrogates). I think D) could be yet another translation method (in addition to "error" and "replace"), but it shouldn't be the only one a UTF-8 library provides. With method D), your UTF-8 *encoder* might create an invalid UTF-8 stream, which is certainly not desirable for some applications. It's unfortunate that the current UTF-8 stuff for Emacs causes malformed UTF-8 files to be silently trashed. Yes, that's quite annoying. But the whole MULE stuff is a big mess. In-band signalling everywhere. :-( (Some byte sequences in a single-byte buffer do very strange things.) - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder
Markus Kuhn [EMAIL PROTECTED]: A) Emit a single U+FFFD per malformed sequence We discussed this before. I can think of several ways of interpreting the phrase "malformed sequence". I think you probably mean either a single octet in the range 80..BF or a single octet in the range FE..FF or an octet in the range C0..FD followed by any number of octets in the range 80..BF such that it isn't correct UTF-8 and isn't followed by another octet in the range 80..BF. This is probably quite hard to implement consistently, and, as with semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means in particular that you can't decode from a fixed-size buffer in the manner of mbrtowc. B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence This is what I do in Mutt. It's easy to implement and works for any multibyte encoding; the program doesn't have to know about UTF-8. But you have to ask yourself: do I reset the mbstate_t when I replace a bad byte by U+FFFD? If you want consistency, you probably should, as otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ. C) Emit a U+FFFD only for every first malformed sequence in a sequence of malformed UTF-8 sequences I don't think anyone will recommend this. D) Emit a malformed UTF-16 sequence for every byte in a malformed UTF-8 sequence Not much good if you're not converting to UTF-16. So perhaps B should be the generally recommended way. However, I agree that a UTF-8 editor should be able to remember malformed UTF-8 sequences so that you can read in a file, edit part of it and write it out again without it all being rubbished. It's unfortunate that the current UTF-8 stuff for Emacs causes malformed UTF-8 files to be silently trashed. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/