Re: Substituting malformed UTF-8 sequences in a decoder

Markus Kuhn Tue, 25 Jul 2000 02:47:36 -0700
Florian Weimer wrote on 2000-07-25 08:53 UTC:
> > > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> > >    UTF-8 sequence
> > 
> > Not much good if you're not converting to UTF-16.
> 
> Well, it works with UCS-4 as well (but I would use a private area for
> this kind of stuff until it's generally accepted practice to do such
> hacks with surrogates).

No, this way, you would loose transparency for private area characters.
If you do in-band signalling of UTF-8 errors in UCS-4, then you must
only use characters, which are forbidden to be encoded in UTF-8 anyway,
and these are only the surrogates plus U+FFFE and U+FFFF.

> I think D) could be yet another translation method (in addition to
> "error" and "replace"), but it shouldn't be the only one a UTF-8
> library provides.  With method D), your UTF-8 *encoder* might create
> an invalid UTF-8 stream, which is certainly not desirable for some
> applications.

I agree that proper terminology should be introduced to avoid confusion
between the different decoder semantics. How about a "UTF-8B decoder",
which decodes a mixture of UTF-8 and arbitrary binary content to UCS-4
without loss of information. The exact byte sequence would be
reconstructed if the UCS-4 is fed into a "UTF-8B encoder". In contrast
to UTF-8 decoders, a UTF-8B decoder will never signal an error for a
malformed sequence. In contrast to UTF-8 encoders, UTF-8B encoders can
produce all possible malformed UTF-8 sequences. Both UTF-8B <-> UTF-16
as well as UTF-8B <-> UCS-4 binary-transparent round-trip conversion
would be possible.

I am considering to draft a Unicode Technical Report on the entire idea
behind option D), which might help to sprinkle some "officially
recognized standard" magic over the idea of using DC80-DCFF as byte
error codes.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to