Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-23 Thread Markus Kuhn

Edmund GRIMLEY EVANS wrote on 2000-07-23 21:44 UTC:
> Markus Kuhn <[EMAIL PROTECTED]>:
> 
> > A) Emit a single U+FFFD per malformed sequence
> 
> We discussed this before. I can think of several ways of interpreting
> the phrase "malformed sequence".

ISO 10646-1 section R.7 is quite clear here.

> I think you probably mean either a single octet in the range 80..BF or
> a single octet in the range FE..FF

Yes.

> or an octet in the range C0..FD
> followed by any number of octets in the range 80..BF such that it
> isn't correct UTF-8 and isn't followed by another octet in the range
> 80..BF.

No. An octet in the range C0..FD followed by fewer octets in the range
80..BF than the number of leading 1 bits in the first octet is one
single malformed sequence (obviously trailing bytes are missing). An
octet in the range C0..FD followed by more octets in the range 80..BF
than the number of leading 1 bits in the first octet is a *correct*
UTF-8 sequence (except if it encodes a character for which there would
be a shorter code, or a surrogate), followed by a single malformed
sequence for each additional 80..BF byte.

Note that UTF-8 is self-terminating. This means that you cannot make a
valid UTF-8 sequence invalid by appending any additional bytes, and your
decoder must respect this. I read what you suggested correctly, then you
mean that appending 80 to a valid UTF-8 sequence will make it invalid,
which clearly is not a sensible semantics of a decoder as it does not
honor the self-terminating property of UTF-8.

> This is probably quite hard to implement consistently, and, as with
> semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
> in particular that you can't decode from a fixed-size buffer in the
> manner of mbrtowc.

No, it is not. A malformed sequence can never be longer than the longest
correct sequence, namely 6 bytes.

[Option B)]
> But you have to ask yourself: do I reset the mbstate_t when I replace
> a bad byte by U+FFFD? If you want consistency, you probably should, as
> otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.

Of course.

> > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> >UTF-8 sequence
>
> Not much good if you're not converting to UTF-16.

No. Note that you do not have to actually convert to UTF-16 to make use
of this technique. The exact same trick works also with UCS-2, UCS-4,
etc.! It is just more educational to explain it in terms of UTF-16,
because then it becomes very clear, why mapping bytes of malformed
sequences onto U+DC80 .. U+DCFF is a particularly good choice of error
codes, since is does not collide with anything that even a UTF-8 ->
UTF-16 decoder could produce normally.

The entire technique itself is completely independent of UTF-16!

> So perhaps B should be the generally recommended way.

What's wrong with D)? Think about it again and don't get confused just
because I mentioned UTF-16 ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: Substituting malformed UTF-8 sequences in a decoder

2000-07-23 Thread Edmund GRIMLEY EVANS

Markus Kuhn <[EMAIL PROTECTED]>:

> A) Emit a single U+FFFD per malformed sequence

We discussed this before. I can think of several ways of interpreting
the phrase "malformed sequence".

I think you probably mean either a single octet in the range 80..BF or
a single octet in the range FE..FF or an octet in the range C0..FD
followed by any number of octets in the range 80..BF such that it
isn't correct UTF-8 and isn't followed by another octet in the range
80..BF.

This is probably quite hard to implement consistently, and, as with
semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
in particular that you can't decode from a fixed-size buffer in the
manner of mbrtowc.

> B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence

This is what I do in Mutt. It's easy to implement and works for any
multibyte encoding; the program doesn't have to know about UTF-8.

But you have to ask yourself: do I reset the mbstate_t when I replace
a bad byte by U+FFFD? If you want consistency, you probably should, as
otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.

> C) Emit a U+FFFD only for every first malformed sequence in a sequence
>of malformed UTF-8 sequences

I don't think anyone will recommend this.

> D) Emit a malformed UTF-16 sequence for every byte in a malformed
>UTF-8 sequence

Not much good if you're not converting to UTF-16.

So perhaps B should be the generally recommended way.

However, I agree that a UTF-8 editor should be able to remember
malformed UTF-8 sequences so that you can read in a file, edit part of
it and write it out again without it all being rubbished.

It's unfortunate that the current UTF-8 stuff for Emacs causes
malformed UTF-8 files to be silently trashed.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Re: utf-8 encoding scheme

2000-07-23 Thread Florian Weimer

  Larry Wall <[EMAIL PROTECTED]> writes:

> [EMAIL PROTECTED] writes:
> :   "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
> : 
> : > The alternate spelling
> : > 
> : >   1101 10001011
> : > 
> : > ... is not the character K  but INVALID SEQUENCE.  One
> : > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> : > CHARACTER on encountering illegal sequences.
> : 
> : Is there any consensus whether to use one or two U+FFFD characters in
> : such situations? For example, what do Perl, Tcl and Java here?

In the meantime, I've looked at Tcl: Invalid UTF-8 sequences are
treated as characters from ISO-8859-1, i.e. the sequence "c0 80" is
converted to "U+00C0 U+0080".  (Perhaps my test routine is wrong?
This behavior doesn't match the comments in the C source.)

Python is going to follow RFC 2279 strictly.  Invalid UTF-8 sequences
raise an exception or are replaced by U+FFFD characters (how many of
them is still subject to debate, that's why I asked).

Sun's Java documentation doesn't specify what happens if their
UTF-8 decoder is fed with invalid sequences.  It's probably
implementation-dependent.

> At the moment Perl does no input validation on UTF-8.

> That being said, we will certainly be having input disciplines that do
> validation and canonicalization, and I'd imagine that we'll allow the
> user to choose how picky to be.

Thanks for your explanation.  IOW, the Unicode/UTF-8 support in Perl
is still quite rudimentary.

Anyway, Why are most UTF-8 decoders ignoring the advice in RFC
2279?  Maybe Bruce Schneier is right at all when he claims UTF-8 is
inherently insecure.  Perhaps we would have been better off with
a slightly more complicated format in which there is exactly one
representation for each UCS character which can be encoded (like
UTF-16, for example).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/lists/



Substituting malformed UTF-8 sequences in a decoder

2000-07-23 Thread Markus Kuhn

Florian Weimer wrote on 2000-07-22 18:23 UTC:
> "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
> > The alternate spelling
> > 
> > 1101 10001011
> > 
> > ... is not the character K  but INVALID SEQUENCE.  One
> > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > CHARACTER on encountering illegal sequences.
> 
> Is there any consensus whether to use one or two U+FFFD characters in
> such situations?

I believe to have finally found the clear RightThingToDo[TM] variant
[see option D) below], but it is none of your options and probably also
not yet widely known.

It might be worth to discuss first in more detail the reasons for the
various choices before presenting the solution:

A) Emit a single U+FFFD per malformed sequence

This goes essentially back to the interpretation of some holly
scripture, namely the words of the Lord in the book ISO 10646-1:1993(E).
Section R.7 on "Incorrect sequences of octets: Interpretation by
receiving devices" says what a malformed UTF-8 sequence is
(unfortunately it does not yet define overlong sequences as malformed!)
and says that a receiving device "shall interpret that malformed
sequence in the same way that it interprets a character that is outside
the adopted subset that has been identified for the device (see 2.3c)".
Here a malformed sequence is suggested to be treated like a single
unknown character, irrespective of how many bytes there are in the
sequence. Section 2.3c just says the obvious thing:

  "[...] Any corresponding characters that are not within the adopted
   subset shall be indicated to the user in a way which need not allow
   them to be distinguished from each other.

   NOTES

   1  An indication to the user may consist of making available the same
   character to represent all characters not in the adopted subset, or
   providing a distinctive audible or visible signal when appropriate to
   the type of user.

   2  See also annex H for receiving devices with retransmission capability."

References:

  http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

This is the approach that I have chosen for the UTF-8 decoder in xterm.
Xterm has a definition of what a malformed sequence is that is closely
aligned with ISO 10646-1 section R.7, except that xterm also treats
overlong sequences as malformed, as I hope a future version of the
standard will as well. The UTF-8 decoder in xterm does not keep track of
how many bytes it has already read for a character, so implementing
semantics B) would have required me to add an additional variable to the
decoder data structure. This might indicate that semantics A) might also
be slightly simpler to implement if you write your own UTF-8 decoder
(not really clear though).

The final column indicator bar in

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

should line up nicely if this test file, which is full of malformed
UTF-8 sequences, is sent to xterm or another monospaced output device
that follows semantics A).

B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence

If you use an existing UTF-8 decoder, for example the one provided by
your C runtime environment in the form of mbtowc() function (ISO C,
section 7.20.7.2), then the interface to this UTF-8 decoder might not
provide you with any way of finding out, where a malformed sequence ends
(especially if it is followed immediately by another malformed
sequence). All you get back from mbtowc() and similar functions is the
information that the start of the byte sequence that you want to have
converted is the start of a malformed sequence. A very simple way of
using this information is to emit a U+FFFD character and then try again
to call mbtowc() one byte later. This results in a U+FFFD being emitted
for every byte in a malformed UTF-8 sequence.

I expect that a significant number of applications might delegate their
UTF-8 encoding/decoding to C's multibyte functions (which then will
automatically add support for various legacy multibyte encodings as
well), and will therefore most likely adopt this semantics.

In order to allow semantics A) to be used with the C multibyte
functions, a locale would have to be added that never signals a
malformed sequence by returning -1 in mbtowc(), but that decodes it
properly into U+FFFD instead. If no -1 is returned for a malformed
sequence, the return value will be the length of the malformed sequence,
such that it can be skipped like a regular sequence. Note that malformed
UTF-8 sequences cannot be longer than the longest normal UTF-8 sequence.

Glibc 2.2 does at the moment not provide such a UTF-8 locale with
in-band signalling of malformed sequences. Note also that in such a
locale, a correctly UTF-8 encoded U+FFFD value could not be
distinguished from a malformed sequence.

References:

  http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-C-FDIS.1999-04.txt

C) Emit a U+FFFD only for every first malformed sequence in a sequence
   of malformed UTF-8 sequences

A user of mbtowc() co