Autrijus,

Thanks for the report :) -- murphy's law strikes :(

On Friday, Sep 26, 2003, at 17:23 Asia/Tokyo, Autrijus Tang wrote:
$ perl -MEncode -e'print Encode::decode_utf8(1, 1)'
Too many arguments for Encode::decode_utf8 at -e line 1, at end of line

$ perldoc Encode |grep decode_utf8
       $string = decode_utf8($octets [, CHECK]);

A tricky bug you have found. Here is what the document says.


$string = decode_utf8($octets [, CHECK]);
equivalent to "$string = decode("utf8", $octets [, CHECK])". The
sequence of octets represented by $octets is decoded from UTF-8 into
a sequence of logical characters. Not all sequences of octets form
valid UTF-8 encodings, so it is possible for this call to fail. For
CHECK, see "Handling Malformed Data".

and here is how it is really implemented:


sub decode_utf8($)
{
    my ($str) = @_;
    return undef unless utf8::decode($str);
    return $str;
}

which is RIGHT so long as the prototype of utf8::decode() is '$'

% perl -e 'print utf8::decode()'
Usage: utf8::decode(sv) at -e line 1.
% perl -e 'print utf8::decode(1)'
1
% perl -le 'print utf8::decode(1,1)'
Usage: utf8::decode(sv) at -e line 1.

and utf8::decode is not designed to return status.


% perl -MEncode -e 'print decode_utf8("\xC2\x80")' | hexdump -C
00000000  80                                                |.|
00000001
% perl -MEncode -e 'print decode_utf8("\x80")' | hexdump -C
% perl -MEncode -e 'print decode_utf8("\x7f")' | hexdump -C
00000000  7f                                                |.|
00000001

I consider this a feature bug than a documentation bug. But I wonder how I should fix it. fixing utf8::decode() involves tweaking core so it would be nice if it can be fixed on Encode side. Fortunately Encode::decode("utf8" => $str) works.


% perl -MEncode -e '$a="\xC2\x80"; print decode("utf8"=>$a, 1)' | hexdump -C
00000000 80 |.|
00000001
% perl -MEncode -e '$a="\x80"; print decode("utf8"=>$a, 1)' | hexdump -C
utf8 "\x80" does not map to Unicode at /usr/local/lib/perl5/5.8.0/i386-freebsd/Encode.pm line 164.
% perl -MEncode -e '$a="\x7f"; print decode("utf8"=>$a, 1)' | hexdump -C
00000000 7f |.|
00000001

so we can make decode_utf8() as follows;


sub decode_utf8($;$)
{
    my ($str, $check) = @_;
    if ($check){
                return decode("utf8", @_);
        }else{
                return undef unless utf8::decode($str);
                return $str;
        }
}

Dan the Encode Maintainer



Reply via email to