[Encode] 2.07 Released
Porters On Oct 22, 2004, at 15:31, Dan Kogai wrote: I just updated Encode to version 2.06. Within less than 24hrs I resorted to release version 2.07. What the heck. 5.8.6 is soon =head1 Availability http://www.dan.co.jp/~dankogai/cpan/Encode-2.07.tar.gz or CPAN near you =head1 Changes $Revision: 2.7 $ $Date: 2004/10/22 19:35:52 $ ! lib/Encode/Encoding.pm "Remove Carp from warnings.pm" that influences Encode, by Tels. Message-Id: <[EMAIL PROTECTED]> ! Encode.xs AUTHORS t/fallback.t Now Encode::utf8's fallbacks are compliant to Encode standard. Thank Bjoern Hoehrmann for persistently convincing me. Message-Id: <[EMAIL PROTECTED]> ! Encode.pm POD further revised. =head1 Signature Dan the Encode Maintainer
Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars
On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote: C12a in Unicode 4.0.1 notes [...] For example, in UTF-8 every code unit of the form 110 must be followed by a code unit of the form 10xx. A sequence such as 110x 0xxx is illformed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110x as an illegally terminated code unit sequence--for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD [...] [snip] Okay, you win. You have convinced me that Encode::utf8 should behave the same as Encode::XS (UCM-base encodings). And the patch to make that way is deceptively simple, as follow; === RCS file: Encode.xs,v retrieving revision 2.0 diff -u -r2.0 Encode.xs --- Encode.xs 2004/05/16 20:55:15 2.0 +++ Encode.xs 2004/10/22 18:00:29 @@ -297,7 +297,7 @@ U8 skip = UTF8SKIP(s); if ((s + skip) > e) { /* Partial character - done */ - break; + goto decode_utf8_fallback; } else if (is_utf8_char(s)) { /* Whole char is good */ @@ -313,6 +313,7 @@ /* Invalid start byte */ } /* If we get here there is something wrong with alleged UTF-8 */ +decode_utf8_fallback: if (check & ENCODE_DIE_ON_ERR){ Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", (UV)*s); XSRETURN(0); === The most decisive comment of yours is this: holds true and I expect that my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6" decode("utf-8", $x, Encode::FB_CROAK); croaks. Which apparently did not. Thank you for being so persitent on this problem. I'd be honor to add your name to AUTHORS file for this. I will $Encode::VERSION++ as soon as I am done w/ the test suites and Tel's patch. This time I will be careful not to screw up (maint|bread)perl so give me some time before the update is ready (but I won't keep you waiting for too long since 5.8.6 deadline is soon). Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is documented as [...] is_utf8(STRING [, CHECK]) [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise. [...] And D36 in Unicode 4.0.1 is very clear that [...] As a consequence of the well-formedness conditions specified in Table 3-6, the following byte values are disallowed in UTF-8: C0–C1, F5–FF. [...] That's because perl's notion of Unicode is broader than that of unicode.org. So far Unicode.org's mapping only spans from U+ to U+1f, While that of perl is U+ or even U+ (in other words, MAX_UINT). See Camel 3 on details. And I think we can leave this :) Dan the Encode Maintainer
Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars
* Dan Kogai wrote: >> perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))" >> perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))" >Though unicode.org does not assign any character on U+18 (yet), >"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of >view. Perl only finds it corrupted when it reaches the following 'r'. > >In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the >following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from >UTF-8's point of view). C12a in Unicode 4.0.1 notes [...] For example, in UTF-8 every code unit of the form 110 must be followed by a code unit of the form 10xx. A sequence such as 110x 0xxx is illformed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110x as an illegally terminated code unit sequence--for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD [...] IOW, the \xF6. According to `perldoc Encode` [...] *CHECK* = Encode::FB_DEFAULT ( == 0) If *CHECK* is 0, (en|de)code will put a *substitution character* in place of a malformed character. For UCM-based encodings, will be used. For Unicode, the code point 0xFFFD is used. If the data is supposed to be UTF-8, an optional lexical warning (category utf8) is given. [...] the module chooses the replacement character approach and I thus expect that none of decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6rn") decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6r") decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6") holds true and I expect that my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6" decode("utf-8", $x, Encode::FB_CROAK); croaks. The partial decoding approach is useful but only if check is set to something where the remaining octets are made available to the script and not for check == 0. Why would anyone want it to behave differently? Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is documented as [...] is_utf8(STRING [, CHECK]) [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise. [...] And D36 in Unicode 4.0.1 is very clear that [...] As a consequence of the well-formedness conditions specified in Table 3-6, the following byte values are disallowed in UTF-8: C0âC1, F5âFF. [...] I would thus never expect that Encode::is_utf8(decode(utf8 => qq(\xF6\x80\x80\x80)), 1) returns true or that my $x = qq(\xF6\x80\x80\x80); decode(utf8 => $x, Encode::FB_CROAK); does not croak. The byte string here is *not* well-formed UTF-8! I do not really understand why it one would expect something different. If this is really intentional and kept unchanged, there should at least be highly visible warnings in the documentation on when malformed input is ignored silently (and/or where "UTF-8" does not mean UTF-8 as defined in Unicode or RFC 3629). Clearly, if "well-formed UTF-8" means something different in Perl and outside Perl people necessarily get confused... >>[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"] >>[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"] > >IMHO I believe the current implementation is correct since you can't >really tell if the sequnece is corrupted just by looking at a given octet. Well, there is no need to look at just a single octet here, nothing stops the routine from checking the octets following 0xF6, so I would say there needs to be a better reason to consider this behavior correct. I do not think the implementation matches the documentation or what one would expect from the Unicode standard.
Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars
On Oct 22, 2004, at 20:42, Bjoern Hoehrmann wrote: No, you misread the bug report, I expect that perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))" perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))" behave the same in that the malformed sequence \xF6 gets replaced by U+FFFD as documented in `perldoc Encode` for check = Encode::FB_DEFAULT. Encode::utf8::decode_xs() fails to do that for the reason outlined in my bug report so the current result is "\xF6" ALONE does not mean that the sequence is malformed. Try perl -Mencoding=utf8 -le 'print "\x{18}"' | hexdump -C Though unicode.org does not assign any character on U+18 (yet), "\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of view. Perl only finds it corrupted when it reaches the following 'r'. In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from UTF-8's point of view). Bj Bj\x{FFFD}rnx it should be Bj\x{FFFD}rn Bj\x{FFFD}rnx So you can't really say which behavior is "correct". I fail to see what this has to do with how Perl treats the string as from a Perl perspective there is no real difference here, Perl works as expected, decode() does not. (I've posted this to RT but it again does not show up there, see http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html). IMHO I believe the current implementation is correct since you can't really tell if the sequnece is corrupted just by looking at a given octet. At the same time I believe this should be documented somehow somewhere. Dan the Encode Maintainer