[Encode] 2.07 Released

2004-10-22 Thread Dan Kogai
Porters
On Oct 22, 2004, at 15:31, Dan Kogai wrote:
I just updated Encode to version 2.06.
Within less than 24hrs I resorted to release version 2.07.  What the 
heck.  5.8.6 is soon

=head1 Availability
http://www.dan.co.jp/~dankogai/cpan/Encode-2.07.tar.gz
or CPAN near you
=head1 Changes
$Revision: 2.7 $ $Date: 2004/10/22 19:35:52 $
! lib/Encode/Encoding.pm
  "Remove Carp from warnings.pm" that influences Encode, by Tels.
  Message-Id: <[EMAIL PROTECTED]>
! Encode.xs AUTHORS t/fallback.t
  Now Encode::utf8's fallbacks are compliant to Encode standard.
  Thank Bjoern Hoehrmann for persistently convincing me.
  Message-Id: <[EMAIL PROTECTED]>
! Encode.pm
  POD further revised.
=head1 Signature
Dan the Encode Maintainer


Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 Thread Dan Kogai
On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote:
C12a in Unicode 4.0.1 notes
[...]
  For example, in UTF-8 every code unit of the form 110 must be
  followed by a code unit of the form 10xx. A sequence such as
  110x 0xxx is illformed and must never be generated. When
  faced with this ill-formed code unit sequence while transforming or
  interpreting text, a conformant process must treat the first code 
unit
  110x as an illegally terminated code unit sequence--for example,
  by signaling an error, filtering the code unit out, or representing
  the code unit with a marker such as U+FFFD
[...]
[snip]
Okay, you win.  You have convinced me that Encode::utf8 should behave 
the same as Encode::XS (UCM-base encodings).  And the patch to make 
that way is deceptively simple, as follow;

===
RCS file: Encode.xs,v
retrieving revision 2.0
diff -u -r2.0 Encode.xs
--- Encode.xs   2004/05/16 20:55:15 2.0
+++ Encode.xs   2004/10/22 18:00:29
@@ -297,7 +297,7 @@
U8 skip = UTF8SKIP(s);
if ((s + skip) > e) {
/* Partial character - done */
-   break;
+   goto decode_utf8_fallback;
}
else if (is_utf8_char(s)) {
/* Whole char is good */
@@ -313,6 +313,7 @@
/* Invalid start byte */
}
/* If we get here there is something wrong with alleged UTF-8 */
+decode_utf8_fallback:
if (check & ENCODE_DIE_ON_ERR){
Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", (UV)*s);
XSRETURN(0);
===
The most decisive comment of yours is this:
holds true and I expect that
  my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
  decode("utf-8", $x, Encode::FB_CROAK);
croaks.
Which apparently did not.  Thank you for being so persitent on this 
problem.  I'd be honor to add your name to AUTHORS file for this.

I will $Encode::VERSION++ as soon as I am done w/ the test suites and 
Tel's patch.  This time I will be careful not to screw up 
(maint|bread)perl so give me some time before the update is ready (but 
I won't keep you waiting for too long since 5.8.6 deadline is soon).

Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 
is
documented as

[...]
  is_utf8(STRING [, CHECK])
[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being
well-formed UTF-8. Returns true if successful, false otherwise.
[...]
And D36 in Unicode 4.0.1 is very clear that
[...]
  As a consequence of the well-formedness conditions specified in Table
  3-6, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.
[...]
That's because perl's notion of Unicode is broader than that of 
unicode.org.  So far Unicode.org's mapping only spans from U+ to 
U+1f, While that of perl is U+ or even U+ 
(in other words, MAX_UINT).  See Camel 3 on details.

And I think we can leave this :)
Dan the Encode Maintainer


Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 Thread Bjoern Hoehrmann
* Dan Kogai wrote:
>>   perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"
>>   perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"

>Though unicode.org does not assign any character on U+18 (yet), 
>"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of 
>view.  Perl only finds it corrupted when it reaches the following 'r'.
>
>In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the 
>following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from 
>UTF-8's point of view).

C12a in Unicode 4.0.1 notes

[...]
  For example, in UTF-8 every code unit of the form 110 must be
  followed by a code unit of the form 10xx. A sequence such as
  110x 0xxx is illformed and must never be generated. When
  faced with this ill-formed code unit sequence while transforming or
  interpreting text, a conformant process must treat the first code unit
  110x as an illegally terminated code unit sequence--for example,
  by signaling an error, filtering the code unit out, or representing
  the code unit with a marker such as U+FFFD
[...]

IOW, the \xF6. According to `perldoc Encode`

[...]
  *CHECK* = Encode::FB_DEFAULT ( == 0)
If *CHECK* is 0, (en|de)code will put a *substitution character* in
place of a malformed character. For UCM-based encodings, 
will be used. For Unicode, the code point 0xFFFD is used. If the
data is supposed to be UTF-8, an optional lexical warning (category
utf8) is given.
[...]

the module chooses the replacement character approach and I thus expect
that none of

  decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6rn")
  decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6r")
  decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6")

holds true and I expect that

  my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
  decode("utf-8", $x, Encode::FB_CROAK);

croaks. The partial decoding approach is useful but only if check is set
to something where the remaining octets are made available to the script
and not for check == 0. Why would anyone want it to behave differently?

Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is
documented as

[...]
  is_utf8(STRING [, CHECK])
[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being
well-formed UTF-8. Returns true if successful, false otherwise.
[...]

And D36 in Unicode 4.0.1 is very clear that

[...]
  As a consequence of the well-formedness conditions specified in Table
  3-6, the following byte values are disallowed in UTF-8: C0âC1, F5âFF.
[...]

I would thus never expect that

  Encode::is_utf8(decode(utf8 => qq(\xF6\x80\x80\x80)), 1)

returns true or that

  my $x = qq(\xF6\x80\x80\x80);
  decode(utf8 => $x, Encode::FB_CROAK);

does not croak. The byte string here is *not* well-formed UTF-8! I do
not really understand why it one would expect something different.

If this is really intentional and kept unchanged, there should at least
be highly visible warnings in the documentation on when malformed input
is ignored silently (and/or where "UTF-8" does not mean UTF-8 as defined
in Unicode or RFC 3629). Clearly, if "well-formed UTF-8" means something
different in Perl and outside Perl people necessarily get confused...

>>[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"]
>>[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"]
>
>IMHO I believe the current implementation is correct since you can't 
>really tell if the sequnece is corrupted just by looking at a given octet.

Well, there is no need to look at just a single octet here, nothing
stops the routine from checking the octets following 0xF6, so I would
say there needs to be a better reason to consider this behavior correct.
I do not think the implementation matches the documentation or what one
would expect from the Unicode standard.


Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 Thread Dan Kogai
On Oct 22, 2004, at 20:42, Bjoern Hoehrmann wrote:
No, you misread the bug report, I expect that
  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"
  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"
behave the same in that the malformed sequence \xF6 gets replaced by
U+FFFD as documented in `perldoc Encode` for check = 
Encode::FB_DEFAULT.
Encode::utf8::decode_xs() fails to do that for the reason outlined in 
my
bug report so the current result is
"\xF6" ALONE does not mean that the sequence is malformed.  Try
  perl -Mencoding=utf8 -le 'print "\x{18}"' | hexdump -C
Though unicode.org does not assign any character on U+18 (yet), 
"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of 
view.  Perl only finds it corrupted when it reaches the following 'r'.

In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the 
following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from 
UTF-8's point of view).

  Bj
  Bj\x{FFFD}rnx
it should be
  Bj\x{FFFD}rn
  Bj\x{FFFD}rnx
So you can't really say which behavior is "correct".
I fail to see what this has to do with how Perl treats the string as
from a Perl perspective there is no real difference here, Perl works
as expected, decode() does not.
(I've posted this to RT but it again does not show up there, see
http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html).
IMHO I believe the current implementation is correct since you can't 
really tell if the sequnece is
corrupted just by looking at a given octet.  At the same time I believe 
this should be documented somehow somewhere.

Dan the Encode Maintainer