Re: Substituting malformed UTF-8 sequences in a decoder

Bruno Haible Thu, 27 Jul 2000 14:44:45 -0700
Markus Kuhn's proposal D:
> All the previous options for converting malformed UTF-8 sequences to
> UTF-16 destroy information. ...
> Malformed UTF-8 sequences consist excludively of the bytes 0x80 -
> 0xff, and each of these bytes can be represented using a 16-bit
> value ...
> This way 100% binary transparent UTF-8 -> UTF-16/32 -> UTF-8 round-trip
> compatibility can be achieved quite easily.

I don't like this proposal for a few reasons:

* What interoperable and reliable software needs, is a clear and
  standardized interchange format. It must say "this is allowed" and
  "that is forbidden". If after a few years a standard starts saying
  "this was forbidden but is now allowed", then older software will
  not accept output from newer programs any more. And the result will
  be just like the mess we had around 1992 when some but not all Unix
  software was 8-bit clean.

* A program which does something halfway intelligent, like the "fmt"
  line breaking program, needs to make assumptions about the
  characters it is treating. (In the case of fmt: recognize spaces and
  newlines, and know about their width.) The input is UTF-8 and is
  converted to UCS-4 via fgetwc. If this UCS-4 stream now contains
  characters which are only substitutes for *unknown* characters, the
  fmt program will never know the width of these. It will thus output
  (again in UTF-8) the original characters, but will not have done the
  correct line breaking.

  In summary, this leads to "garbage in - garbage out" behaviour of
  programs. Whereas a central point of Unicode is that applications
  know the behaviour of *all* characters, definitely.

  I much prefer the "garbage in - error message" way, because it
  enables the user or sysadmin to fix the problem (read: call recode
  on the data files). The appearance of U+FFFD is a kind of error
  message.

* One of your most prominent arguments for the adoption of UTF-8 is
  that it in 99.99% of the cases an UTF-8 encoded file can easily
  distinguished from an ISO-8859-1 encoded one. If UTF-8 were extended
  so that lone bytes in the range 0x80..0xBF were considered valid,
  this argument would fall apart.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to