Re: Substituting malformed UTF-8 sequences in a decoder

Bruno Haible Fri, 28 Jul 2000 07:56:53 -0700
Markus Kuhn writes:
> > The appearance of U+FFFD is a kind of error message.
> 
> Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded
> by a high sorrugate) is equally "a kind of error message". Just one that
> contains a bit (well, seven :-) more information.

The difference is that application writers know how to deal with
U+FFFD (hollow box, width 1, etc.) But if a byte 0xBB -> U+DC3B
appears, applications don't know whether it represents an ISO-8859-1
0xBB (angle quotation mark) or an ISO-8859-6 0xBB (arabic semicolon).

> I see valuable binary data (PDF & ZIP files, etc.) being destroyed
> almost every day by accidentally applied stupid lossy CRLF -> LF -> CRLF
> data conversion that supposedly smart software tries to perform on the
> fly.

It's a problem of the applications. Some application writers think
that "as many automatic conversions as possible" and "as many
heuristics as possible" qualifies as smart. Try and teach them.

> I foresee similar non-recoverable data conversion accidents if we
> try to establish software that wipes out malformed UTF-8 sequence
> without mercy and destructs all information that they might have
> contained.

I like the way Emacs deals with the problem of (sometimes necessary)
conversions: When there is an ambiguity, it asks the user. When I take
an ISO-8859-1 file with German umlauts and paste a few Chinese
ideograms into it and then attempt to save it, it warns me that the
new characters won't fit with the existing file encoding and asks me
to choose another file encoding.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to