Re: Substituting malformed UTF-8 sequences in a decoder

Edmund GRIMLEY EVANS Tue, 25 Jul 2000 03:01:31 -0700

Markus Kuhn <[EMAIL PROTECTED]>:

> > > Not much good if you're not converting to UTF-16.
> > 
> > Well, it works with UCS-4 as well (but I would use a private area for
> > this kind of stuff until it's generally accepted practice to do such
> > hacks with surrogates).
> 
> No, this way, you would loose transparency for private area characters.
> If you do in-band signalling of UTF-8 errors in UCS-4, then you must
> only use characters, which are forbidden to be encoded in UTF-8 anyway,
> and these are only the surrogates plus U+FFFE and U+FFFF.

So what should mbtowc(&wc, "\xED\xB2\x80", 3) return?

With the libutf8_plug I have here it returns 3 and sets wc to 0xDC80.

I really don't like the idea of a UTF-8 decoder having to know about
surrogates which have nothing to do with UTF-8. If that sort of thing
starts being imposed, I start to wonder whether Unicode really is too
complex to be secure ...

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to