[EMAIL PROTECTED] writes:
:   "H. Peter Anvin" <[EMAIL PROTECTED]> writes:
: 
: > The alternate spelling
: > 
: >     11000001 10001011
: > 
: > ... is not the character K <U+004B> but INVALID SEQUENCE.  One
: > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
: > CHARACTER on encountering illegal sequences.
: 
: Is there any consensus whether to use one or two U+FFFD characters in
: such situations? For example, what do Perl, Tcl and Java here?

At the moment Perl does no input validation on UTF-8.  This is not as
big a problem as you might expect, since in high-security situations
Perl marks any input strings as "tainted", so you can't use them
directly in secure operations anyway.  And when "vetting" such strings
for use in secure operations, we always tell people to check for the
presence of good" characters, not the absence of "bad" characters.
That's just good policy regardless of the character set.

That being said, we will certainly be having input disciplines that do
validation and canonicalization, and I'd imagine that we'll allow the
user to choose how picky to be.  If they don't choose, how picky the
default discipline will be may depend on whether we're running in a
high-security situation (that is, whether taint mode is turned on).
One of the reasons we chose UTF-8 for the internal representation of
strings in Perl was so that we could slurp in a UTF-8 file very
efficiently.

As for whether a strict discipline ought to substitute one or two
U+FFFD characters for the sequence above, that'd probably depend on
whether you thought the author of the data was trying to sneak some
naughty bits in, or just screwed up by embedding Latin-1 in a UTF-8
file.  I expect that the latter is likelier in practice.

On the other hand, good security experts never attribute to stupidity
that which can adequately be explained by malice.

Larry
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to