Dan Kogai <[EMAIL PROTECTED]> writes: >On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote: >> C12a in Unicode 4.0.1 notes >> >> [...] >> For example, in UTF-8 every code unit of the form 110xxxx must be >> followed by a code unit of the form 10xxxxxx. A sequence such as >> 110xxxxx 0xxxxxxx is illformed and must never be generated. When >> faced with this ill-formed code unit sequence while transforming or >> interpreting text, a conformant process must treat the first code >> unit >> 110xxxxx as an illegally terminated code unit sequence--for example, >> by signaling an error, filtering the code unit out, or representing >> the code unit with a marker such as U+FFFD >> [...] >> [snip] > >Okay, you win. You have convinced me that Encode::utf8 should behave >the same as Encode::XS (UCM-base encodings). And the patch to make >that way is deceptively simple, as follow;
I think "\xF6r" is indeed wrong. But as Dan said at the start \xF6 on its own (say as 1023 octet in a 0..1023 1024-octet buffer is not a fail. Changing that will make :encoding() layer have problems as buffer boundaries can occur in the middle of characters. > >=================================================================== >RCS file: Encode.xs,v >retrieving revision 2.0 >diff -u -r2.0 Encode.xs >--- Encode.xs 2004/05/16 20:55:15 2.0 >+++ Encode.xs 2004/10/22 18:00:29 >@@ -297,7 +297,7 @@ > U8 skip = UTF8SKIP(s); > if ((s + skip) > e) { > /* Partial character - done */ >- break; >+ goto decode_utf8_fallback; > } > else if (is_utf8_char(s)) { > /* Whole char is good */ >@@ -313,6 +313,7 @@ > /* Invalid start byte */ > } > /* If we get here there is something wrong with alleged UTF-8 */ >+ decode_utf8_fallback: > if (check & ENCODE_DIE_ON_ERR){ > Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", (UV)*s); > XSRETURN(0); > >=================================================================== > >The most decisive comment of yours is this: > >> holds true and I expect that >> >> my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6" >> decode("utf-8", $x, Encode::FB_CROAK); >> >> croaks. > >Which apparently did not. Thank you for being so persitent on this >problem. I'd be honor to add your name to AUTHORS file for this. > >I will $Encode::VERSION++ as soon as I am done w/ the test suites and >Tel's patch. This time I will be careful not to screw up >(maint|bread)perl so give me some time before the update is ready (but >I won't keep you waiting for too long since 5.8.6 deadline is soon). > >> Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 >> is >> documented as >> >> [...] >> is_utf8(STRING [, CHECK]) >> [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. >> If CHECK is true, also checks the data in STRING for being >> well-formed UTF-8. Returns true if successful, false otherwise. >> [...] >> >> And D36 in Unicode 4.0.1 is very clear that >> >> [...] >> As a consequence of the well-formedness conditions specified in Table >> 3-6, the following byte values are disallowed in UTF-8: C0âC1, F5âFF. >> [...] > >That's because perl's notion of Unicode is broader than that of >unicode.org. So far Unicode.org's mapping only spans from U+0000 to >U+1fFFFF, While that of perl is U+ffffFFFF or even U+ffffFFFFffffFFFF >(in other words, MAX_UINT). See Camel 3 on details. > >And I think we can leave this :) > >Dan the Encode Maintainer