Re: Substituting malformed UTF-8 sequences in a decoder

Florian Weimer Wed, 02 Aug 2000 04:17:39 -0700
  Markus Kuhn <[EMAIL PROTECTED]> writes:

> > > > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> > > >    UTF-8 sequence
> > > 
> > > Not much good if you're not converting to UTF-16.
> > 
> > Well, it works with UCS-4 as well (but I would use a private area for
> > this kind of stuff until it's generally accepted practice to do such
> > hacks with surrogates).
> 
> No, this way, you would loose transparency for private area characters.

You would have to encode them anyway in order to avoid collisions, of
course.  (GNU Emacs 20 assigns some characters in ISO 8859 for private
use, without encoding.  That's certainly a big mess.)  I doubt that
you can live without special purpose private characters in a text
editor like GNU Emacs.

> I agree that proper terminology should be introduced to avoid confusion
> between the different decoder semantics. How about a "UTF-8B decoder",
> which decodes a mixture of UTF-8 and arbitrary binary content to UCS-4
> without loss of information.

IMHO, that's the correct approach.  It's not useful to specify one
decoder behavior -- in pratice, you may need error signalling,
replacement characters *and* UTF-8B, depending on you application.

> I am considering to draft a Unicode Technical Report on the entire idea
> behind option D), which might help to sprinkle some "officially
> recognized standard" magic over the idea of using DC80-DCFF as byte
> error codes.

Okay, some day, I might extend Python's UTF-8 decoders and encoders
so that we've got a reference implementation (provided that the new
Python license turns out to be acceptable to me).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to