Larry Wall <[EMAIL PROTECTED]> writes:

> [EMAIL PROTECTED] writes:
> :   "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
> : 
> : > The alternate spelling
> : > 
> : >   11000001 10001011
> : > 
> : > ... is not the character K <U+004B> but INVALID SEQUENCE.  One
> : > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> : > CHARACTER on encountering illegal sequences.
> : 
> : Is there any consensus whether to use one or two U+FFFD characters in
> : such situations? For example, what do Perl, Tcl and Java here?

In the meantime, I've looked at Tcl: Invalid UTF-8 sequences are
treated as characters from ISO-8859-1, i.e. the sequence "c0 80" is
converted to "U+00C0 U+0080".  (Perhaps my test routine is wrong?
This behavior doesn't match the comments in the C source.)

Python is going to follow RFC 2279 strictly.  Invalid UTF-8 sequences
raise an exception or are replaced by U+FFFD characters (how many of
them is still subject to debate, that's why I asked).

Sun's Java documentation doesn't specify what happens if their
UTF-8 decoder is fed with invalid sequences.  It's probably
implementation-dependent.

> At the moment Perl does no input validation on UTF-8.

> That being said, we will certainly be having input disciplines that do
> validation and canonicalization, and I'd imagine that we'll allow the
> user to choose how picky to be.

Thanks for your explanation.  IOW, the Unicode/UTF-8 support in Perl
is still quite rudimentary.

Anyway, Why are most UTF-8 decoders ignoring the advice in RFC
2279?  Maybe Bruce Schneier is right at all when he claims UTF-8 is
inherently insecure.  Perhaps we would have been better off with
a slightly more complicated format in which there is exactly one
representation for each UCS character which can be encoded (like
UTF-16, for example).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to