* Dan Kogai wrote: >> perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))" >> perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"
>Though unicode.org does not assign any character on U+180000 (yet), >"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of >view. Perl only finds it corrupted when it reaches the following 'r'. > >In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the >following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from >UTF-8's point of view). C12a in Unicode 4.0.1 notes [...] For example, in UTF-8 every code unit of the form 110xxxx must be followed by a code unit of the form 10xxxxxx. A sequence such as 110xxxxx 0xxxxxxx is illformed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx as an illegally terminated code unit sequence--for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD [...] IOW, the \xF6. According to `perldoc Encode` [...] *CHECK* = Encode::FB_DEFAULT ( == 0) If *CHECK* is 0, (en|de)code will put a *substitution character* in place of a malformed character. For UCM-based encodings, <subchar> will be used. For Unicode, the code point 0xFFFD is used. If the data is supposed to be UTF-8, an optional lexical warning (category utf8) is given. [...] the module chooses the replacement character approach and I thus expect that none of decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6rn") decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6r") decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6") holds true and I expect that my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6" decode("utf-8", $x, Encode::FB_CROAK); croaks. The partial decoding approach is useful but only if check is set to something where the remaining octets are made available to the script and not for check == 0. Why would anyone want it to behave differently? Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is documented as [...] is_utf8(STRING [, CHECK]) [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise. [...] And D36 in Unicode 4.0.1 is very clear that [...] As a consequence of the well-formedness conditions specified in Table 3-6, the following byte values are disallowed in UTF-8: C0âC1, F5âFF. [...] I would thus never expect that Encode::is_utf8(decode(utf8 => qq(\xF6\x80\x80\x80)), 1) returns true or that my $x = qq(\xF6\x80\x80\x80); decode(utf8 => $x, Encode::FB_CROAK); does not croak. The byte string here is *not* well-formed UTF-8! I do not really understand why it one would expect something different. If this is really intentional and kept unchanged, there should at least be highly visible warnings in the documentation on when malformed input is ignored silently (and/or where "UTF-8" does not mean UTF-8 as defined in Unicode or RFC 3629). Clearly, if "well-formed UTF-8" means something different in Perl and outside Perl people necessarily get confused... >>[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"] >>[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"] > >IMHO I believe the current implementation is correct since you can't >really tell if the sequnece is corrupted just by looking at a given octet. Well, there is no need to look at just a single octet here, nothing stops the routine from checking the octets following 0xF6, so I would say there needs to be a better reason to consider this behavior correct. I do not think the implementation matches the documentation or what one would expect from the Unicode standard.