Markus Kuhn <[EMAIL PROTECTED]>:
> A) Emit a single U+FFFD per malformed sequence
We discussed this before. I can think of several ways of interpreting
the phrase "malformed sequence".
I think you probably mean either a single octet in the range 80..BF or
a single octet in the range FE..FF or an octet in the range C0..FD
followed by any number of octets in the range 80..BF such that it
isn't correct UTF-8 and isn't followed by another octet in the range
80..BF.
This is probably quite hard to implement consistently, and, as with
semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
in particular that you can't decode from a fixed-size buffer in the
manner of mbrtowc.
> B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence
This is what I do in Mutt. It's easy to implement and works for any
multibyte encoding; the program doesn't have to know about UTF-8.
But you have to ask yourself: do I reset the mbstate_t when I replace
a bad byte by U+FFFD? If you want consistency, you probably should, as
otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.
> C) Emit a U+FFFD only for every first malformed sequence in a sequence
> of malformed UTF-8 sequences
I don't think anyone will recommend this.
> D) Emit a malformed UTF-16 sequence for every byte in a malformed
> UTF-8 sequence
Not much good if you're not converting to UTF-16.
So perhaps B should be the generally recommended way.
However, I agree that a UTF-8 editor should be able to remember
malformed UTF-8 sequences so that you can read in a file, edit part of
it and write it out again without it all being rubbished.
It's unfortunate that the current UTF-8 stuff for Emacs causes
malformed UTF-8 files to be silently trashed.
Edmund
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/