Markus Kuhn's proposal D:
> All the previous options for converting malformed UTF-8 sequences to
> UTF-16 destroy information. ...
> Malformed UTF-8 sequences consist excludively of the bytes 0x80 -
> 0xff, and each of these bytes can be represented using a 16-bit
> value ...
> This way 100% binary transparent UTF-8 -> UTF-16/32 -> UTF-8 round-trip
> compatibility can be achieved quite easily.
I don't like this proposal for a few reasons:
* What interoperable and reliable software needs, is a clear and
standardized interchange format. It must say "this is allowed" and
"that is forbidden". If after a few years a standard starts saying
"this was forbidden but is now allowed", then older software will
not accept output from newer programs any more. And the result will
be just like the mess we had around 1992 when some but not all Unix
software was 8-bit clean.
* A program which does something halfway intelligent, like the "fmt"
line breaking program, needs to make assumptions about the
characters it is treating. (In the case of fmt: recognize spaces and
newlines, and know about their width.) The input is UTF-8 and is
converted to UCS-4 via fgetwc. If this UCS-4 stream now contains
characters which are only substitutes for *unknown* characters, the
fmt program will never know the width of these. It will thus output
(again in UTF-8) the original characters, but will not have done the
correct line breaking.
In summary, this leads to "garbage in - garbage out" behaviour of
programs. Whereas a central point of Unicode is that applications
know the behaviour of *all* characters, definitely.
I much prefer the "garbage in - error message" way, because it
enables the user or sysadmin to fix the problem (read: call recode
on the data files). The appearance of U+FFFD is a kind of error
message.
* One of your most prominent arguments for the adoption of UTF-8 is
that it in 99.99% of the cases an UTF-8 encoded file can easily
distinguished from an ISO-8859-1 encoded one. If UTF-8 were extended
so that lone bytes in the range 0x80..0xBF were considered valid,
this argument would fall apart.
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/