Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Bruno Haible

Markus Kuhn writes:
  The appearance of U+FFFD is a kind of error message.
 
 Agreed. And the appearance of a U+DCxx (which in UTF-16 is not preceded
 by a high sorrugate) is equally "a kind of error message". Just one that
 contains a bit (well, seven :-) more information.

The difference is that application writers know how to deal with
U+FFFD (hollow box, width 1, etc.) But if a byte 0xBB - U+DC3B
appears, applications don't know whether it represents an ISO-8859-1
0xBB (angle quotation mark) or an ISO-8859-6 0xBB (arabic semicolon).

 I see valuable binary data (PDF  ZIP files, etc.) being destroyed
 almost every day by accidentally applied stupid lossy CRLF - LF - CRLF
 data conversion that supposedly smart software tries to perform on the
 fly.

It's a problem of the applications. Some application writers think
that "as many automatic conversions as possible" and "as many
heuristics as possible" qualifies as smart. Try and teach them.

 I foresee similar non-recoverable data conversion accidents if we
 try to establish software that wipes out malformed UTF-8 sequence
 without mercy and destructs all information that they might have
 contained.

I like the way Emacs deals with the problem of (sometimes necessary)
conversions: When there is an ambiguity, it asks the user. When I take
an ISO-8859-1 file with German umlauts and paste a few Chinese
ideograms into it and then attempt to save it, it warns me that the
new characters won't fit with the existing file encoding and asks me
to choose another file encoding.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-28 Thread Edmund GRIMLEY EVANS

Markus Kuhn [EMAIL PROTECTED]:

 I see valuable binary data (PDF  ZIP files, etc.) being destroyed
 almost every day by accidentally applied stupid lossy CRLF - LF - CRLF
 data conversion that supposedly smart software tries to perform on the
 fly. I foresee similar non-recoverable data conversion accidents if we
 try to establish software that wipes out malformed UTF-8 sequence
 without mercy and destructs all information that they might have
 contained.

Here the problem is that the program is misconverting on the fly and
not giving an error. If the program stopped with an error half way
through the user would know there was a problem and be able to do
something about it.

So, I don't think a UTF-8 decoder, as implemented in a library, should
do anything other than give an error if it encounters malformed UTF-8.
The user should be told that something has gone wrong. Clever
reversible conversion of malformed sequences is more likely to hide a
real problem, causing a bigger problem later, than to be helpful, I
suspect.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/