Re: Substituting malformed UTF-8 sequences in a decoder

Edmund GRIMLEY EVANS Sun, 23 Jul 2000 15:29:31 -0700
Markus Kuhn <[EMAIL PROTECTED]>:

> A) Emit a single U+FFFD per malformed sequence

We discussed this before. I can think of several ways of interpreting
the phrase "malformed sequence".

I think you probably mean either a single octet in the range 80..BF or
a single octet in the range FE..FF or an octet in the range C0..FD
followed by any number of octets in the range 80..BF such that it
isn't correct UTF-8 and isn't followed by another octet in the range
80..BF.

This is probably quite hard to implement consistently, and, as with
semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
in particular that you can't decode from a fixed-size buffer in the
manner of mbrtowc.

> B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence

This is what I do in Mutt. It's easy to implement and works for any
multibyte encoding; the program doesn't have to know about UTF-8.

But you have to ask yourself: do I reset the mbstate_t when I replace
a bad byte by U+FFFD? If you want consistency, you probably should, as
otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.

> C) Emit a U+FFFD only for every first malformed sequence in a sequence
>    of malformed UTF-8 sequences

I don't think anyone will recommend this.

> D) Emit a malformed UTF-16 sequence for every byte in a malformed
>    UTF-8 sequence

Not much good if you're not converting to UTF-16.

So perhaps B should be the generally recommended way.

However, I agree that a UTF-8 editor should be able to remember
malformed UTF-8 sequences so that you can read in a file, edit part of
it and write it out again without it all being rubbished.

It's unfortunate that the current UTF-8 stuff for Emacs causes
malformed UTF-8 files to be silently trashed.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to