On Oct 22, 2004, at 20:42, Bjoern Hoehrmann wrote:
No, you misread the bug report, I expect that

  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"
  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"

behave the same in that the malformed sequence \xF6 gets replaced by
U+FFFD as documented in `perldoc Encode` for check = Encode::FB_DEFAULT.
Encode::utf8::decode_xs() fails to do that for the reason outlined in my
bug report so the current result is

"\xF6" ALONE does not mean that the sequence is malformed. Try

  perl -Mencoding=utf8 -le 'print "\x{180000}"' | hexdump -C

Though unicode.org does not assign any character on U+180000 (yet), "\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of view. Perl only finds it corrupted when it reaches the following 'r'.

In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from UTF-8's point of view).

  Bj
  Bj\x{FFFD}rnx

it should be

  Bj\x{FFFD}rn
  Bj\x{FFFD}rnx

So you can't really say which behavior is "correct".

I fail to see what this has to do with how Perl treats the string as
from a Perl perspective there is no real difference here, Perl works
as expected, decode() does not.

(I've posted this to RT but it again does not show up there, see
http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html).

IMHO I believe the current implementation is correct since you can't really tell if the sequnece is
corrupted just by looking at a given octet. At the same time I believe this should be documented somehow somewhere.


Dan the Encode Maintainer



Reply via email to