Florian Weimer wrote on 2000-07-25 08:53 UTC:
> > > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> > > UTF-8 sequence
> >
> > Not much good if you're not converting to UTF-16.
>
> Well, it works with UCS-4 as well (but I would use a private area for
> this kind of stuff until it's generally accepted practice to do such
> hacks with surrogates).
No, this way, you would loose transparency for private area characters.
If you do in-band signalling of UTF-8 errors in UCS-4, then you must
only use characters, which are forbidden to be encoded in UTF-8 anyway,
and these are only the surrogates plus U+FFFE and U+FFFF.
> I think D) could be yet another translation method (in addition to
> "error" and "replace"), but it shouldn't be the only one a UTF-8
> library provides. With method D), your UTF-8 *encoder* might create
> an invalid UTF-8 stream, which is certainly not desirable for some
> applications.
I agree that proper terminology should be introduced to avoid confusion
between the different decoder semantics. How about a "UTF-8B decoder",
which decodes a mixture of UTF-8 and arbitrary binary content to UCS-4
without loss of information. The exact byte sequence would be
reconstructed if the UCS-4 is fed into a "UTF-8B encoder". In contrast
to UTF-8 decoders, a UTF-8B decoder will never signal an error for a
malformed sequence. In contrast to UTF-8 encoders, UTF-8B encoders can
produce all possible malformed UTF-8 sequences. Both UTF-8B <-> UTF-16
as well as UTF-8B <-> UCS-4 binary-transparent round-trip conversion
would be possible.
I am considering to draft a Unicode Technical Report on the entire idea
behind option D), which might help to sprinkle some "officially
recognized standard" magic over the idea of using DC80-DCFF as byte
error codes.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/