No, you misread the bug report, I expect that
perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))" perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"
behave the same in that the malformed sequence \xF6 gets replaced by
U+FFFD as documented in `perldoc Encode` for check = Encode::FB_DEFAULT.
Encode::utf8::decode_xs() fails to do that for the reason outlined in my
bug report so the current result is
"\xF6" ALONE does not mean that the sequence is malformed. Try
perl -Mencoding=utf8 -le 'print "\x{180000}"' | hexdump -C
Though unicode.org does not assign any character on U+180000 (yet), "\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of view. Perl only finds it corrupted when it reaches the following 'r'.
In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from UTF-8's point of view).
Bj Bj\x{FFFD}rnx
it should be
Bj\x{FFFD}rn Bj\x{FFFD}rnx
So you can't really say which behavior is "correct".
I fail to see what this has to do with how Perl treats the string as from a Perl perspective there is no real difference here, Perl works as expected, decode() does not.
(I've posted this to RT but it again does not show up there, see http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html).
IMHO I believe the current implementation is correct since you can't really tell if the sequnece is
corrupted just by looking at a given octet. At the same time I believe this should be documented somehow somewhere.
Dan the Encode Maintainer