Larry Wall <[EMAIL PROTECTED]> writes:
> [EMAIL PROTECTED] writes:
> : "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
> :
> : > The alternate spelling
> : >
> : > 11000001 10001011
> : >
> : > ... is not the character K <U+004B> but INVALID SEQUENCE. One
> : > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> : > CHARACTER on encountering illegal sequences.
> :
> : Is there any consensus whether to use one or two U+FFFD characters in
> : such situations? For example, what do Perl, Tcl and Java here?
In the meantime, I've looked at Tcl: Invalid UTF-8 sequences are
treated as characters from ISO-8859-1, i.e. the sequence "c0 80" is
converted to "U+00C0 U+0080". (Perhaps my test routine is wrong?
This behavior doesn't match the comments in the C source.)
Python is going to follow RFC 2279 strictly. Invalid UTF-8 sequences
raise an exception or are replaced by U+FFFD characters (how many of
them is still subject to debate, that's why I asked).
Sun's Java documentation doesn't specify what happens if their
UTF-8 decoder is fed with invalid sequences. It's probably
implementation-dependent.
> At the moment Perl does no input validation on UTF-8.
> That being said, we will certainly be having input disciplines that do
> validation and canonicalization, and I'd imagine that we'll allow the
> user to choose how picky to be.
Thanks for your explanation. IOW, the Unicode/UTF-8 support in Perl
is still quite rudimentary.
Anyway, Why are most UTF-8 decoders ignoring the advice in RFC
2279? Maybe Bruce Schneier is right at all when he claims UTF-8 is
inherently insecure. Perhaps we would have been better off with
a slightly more complicated format in which there is exactly one
representation for each UCS character which can be encoded (like
UTF-16, for example).
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/