Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

Nick Ing-Simmons Sun, 24 Oct 2004 11:01:23 -0700

Dan Kogai <[EMAIL PROTECTED]> writes:
>On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote:
>> C12a in Unicode 4.0.1 notes
>>
>> [...]
>>   For example, in UTF-8 every code unit of the form 110xxxx must be
>>   followed by a code unit of the form 10xxxxxx. A sequence such as
>>   110xxxxx 0xxxxxxx is illformed and must never be generated. When
>>   faced with this ill-formed code unit sequence while transforming or
>>   interpreting text, a conformant process must treat the first code 
>> unit
>>   110xxxxx as an illegally terminated code unit sequence--for example,
>>   by signaling an error, filtering the code unit out, or representing
>>   the code unit with a marker such as U+FFFD
>> [...]
>> [snip]
>
>Okay, you win.  You have convinced me that Encode::utf8 should behave 
>the same as Encode::XS (UCM-base encodings).  And the patch to make 
>that way is deceptively simple, as follow;


I think "\xF6r" is indeed wrong.

But as Dan said at the start \xF6 on its own (say as 1023 octet 
in a 0..1023 1024-octet buffer is not a fail.
Changing that will make :encoding() layer have problems as buffer 
boundaries can occur in the middle of characters.


>
>===================================================================
>RCS file: Encode.xs,v
>retrieving revision 2.0
>diff -u -r2.0 Encode.xs
>--- Encode.xs   2004/05/16 20:55:15     2.0
>+++ Encode.xs   2004/10/22 18:00:29
>@@ -297,7 +297,7 @@
>             U8 skip = UTF8SKIP(s);
>             if ((s + skip) > e) {
>                 /* Partial character - done */
>-               break;
>+               goto decode_utf8_fallback;
>             }
>             else if (is_utf8_char(s)) {
>                 /* Whole char is good */
>@@ -313,6 +313,7 @@
>             /* Invalid start byte */
>         }
>         /* If we get here there is something wrong with alleged UTF-8 */
>+    decode_utf8_fallback:
>         if (check & ENCODE_DIE_ON_ERR){
>             Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", (UV)*s);
>             XSRETURN(0);
>
>===================================================================
>
>The most decisive comment of yours is this:
>
>> holds true and I expect that
>>
>>   my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
>>   decode("utf-8", $x, Encode::FB_CROAK);
>>
>> croaks.
>
>Which apparently did not.  Thank you for being so persitent on this 
>problem.  I'd be honor to add your name to AUTHORS file for this.
>
>I will $Encode::VERSION++ as soon as I am done w/ the test suites and 
>Tel's patch.  This time I will be careful not to screw up 
>(maint|bread)perl so give me some time before the update is ready (but 
>I won't keep you waiting for too long since 5.8.6 deadline is soon).
>
>> Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 
>> is
>> documented as
>>
>> [...]
>>   is_utf8(STRING [, CHECK])
>>     [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
>>     If CHECK is true, also checks the data in STRING for being
>>     well-formed UTF-8. Returns true if successful, false otherwise.
>> [...]
>>
>> And D36 in Unicode 4.0.1 is very clear that
>>
>> [...]
>>   As a consequence of the well-formedness conditions specified in Table
>>   3-6, the following byte values are disallowed in UTF-8: C0âC1, F5âFF.
>> [...]
>
>That's because perl's notion of Unicode is broader than that of 
>unicode.org.  So far Unicode.org's mapping only spans from U+0000 to 
>U+1fFFFF, While that of perl is U+ffffFFFF or even U+ffffFFFFffffFFFF 
>(in other words, MAX_UINT).  See Camel 3 on details.
>
>And I think we can leave this :)
>
>Dan the Encode Maintainer

Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

Reply via email to