Re: Substituting malformed UTF-8 sequences in a decoder

Edmund GRIMLEY EVANS Mon, 24 Jul 2000 02:53:51 -0700
Markus Kuhn <[EMAIL PROTECTED]>:

> I read what you suggested correctly, then you
> mean that appending 80 to a valid UTF-8 sequence will make it invalid,

Er, no. I was talking about how to cope with invalid sequences once
ordinary decoding has failed. If the prefix is a valid UTF-8 sequence,
then it has already been eaten. The question is how to split up C0 80
or F0 80 80 80 80, say.

> No, it is not. A malformed sequence can never be longer than the longest
> correct sequence, namely 6 bytes.

In that case, perhaps I can guess what you mean by "malformed
sequence". You mean a single 80..BF or FE..FF, or a C0..FD followed by
the right number of 80..BF (but the sequence is over-long) or not
enough 80..BF.

So C0 80 is one malformed sequence, and F0 80 80 80 80 is two
malformed sequences, yes?

But I still can't use library functions to do this, and I don't see
the advantage over plain and simple B. (Must remember to check that I
reset the mbstate_t, though.)

> > > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> > >    UTF-8 sequence
> >
> > Not much good if you're not converting to UTF-16.
> 
> No. Note that you do not have to actually convert to UTF-16 to make use
> of this technique. The exact same trick works also with UCS-2, UCS-4,
> etc.! It is just more educational to explain it in terms of UTF-16,
> because then it becomes very clear, why mapping bytes of malformed
> sequences onto U+DC80 .. U+DCFF is a particularly good choice of error
> codes, since is does not collide with anything that even a UTF-8 ->
> UTF-16 decoder could produce normally.

But aren't ED B2 80 and 80 both mapped to U+DC80, which are both then
mapped back to 80, changing correct UTF-8 into malformed UTF-8?

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to