Re: real UTF-8 vs. utf8n_to_uvuni()

2004-12-09 Thread Gisle Aas
Dan Kogai [EMAIL PROTECTED] writes:

 --- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
 +++ perl-5.8.x.dan/utf8.c   Sun Dec  5 11:38:52 2004
 @@ -429,6 +429,13 @@
  }
  else
  uv = UTF8_ACCUMULATE(uv, *s);
 +   /* Checks if ord()  0x10 -- dankogai */
 +   if (uv  PERL_UNICODE_MAX){
 +   if (!(flags  UTF8_ALLOW_LONG)) {
 +   warning = UTF8_WARN_LONG;
 +   goto malformed;
 +   }
 +   }
  if (!(uv  ouv)) {
  /* These cannot be allowed. */
  if (uv == ouv) {

I think this patch is wrong since UTF8_ALLOW_LONG is about allowing
overlong sequences.  What we need is a UTF8_ALLOW_SUPER flag (matching
UNICODE_ALLOW_SUPER) that would indicate that code points past 10x
should be allowed.  This would be the flag that UTF8_ALLOW_ANYUV
should contain instead of UTF8_ALLOW_LONG.

Unfortunately there is no more room for UTF8_ALLOW_* flags in the
UTF8_ALLOW_ANY space so we would have to add some bits to this mask,
which give us binary incompatiblity with extensions that use the old
UTF8_ALLOW_ANY value.

The UTF8_ALLOW_ should also allow 0x1, 0x2 as well as the
0xFFFE variants.  This match the UNICODE_ALLOW_ behaviour.
Currently it only allows 0x.

The UTF8_ALLOW_FDD0 flag to match UNICODE_ALLOW_FDD0 is also missing,
but insted of introducing UTF8_ALLOW_FDD0 it seems better to collapse
the *_ALLOW_ and *_ALLOW_FDD0 flags into a single *_ALLOW_ILLEGAL
and then make UNICODE_IS_ILLEGAL() match this.

--Gisle


Re: real UTF-8 vs. utf8n_to_uvuni()

2004-12-06 Thread Gisle Aas
Dan Kogai [EMAIL PROTECTED] writes:

 Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a
 problem of perl core.  So I have checked utf8.c which defines that.
 Seems like it does not make use of PERL_UNICODE_MAX.
 
 The patch against utf8.c fixes that.

Seems like a good idea to have a workaround in Encode for this as
well.

Index: users/gisle/hacks/Encode/Encode.xs
--- Encode/Encode.xs.~1~Mon Dec  6 10:44:31 2004
+++ Encode/Encode.xsMon Dec  6 10:44:31 2004
@@ -300,6 +300,10 @@
 UTF8_CHECK_ONLY | (strict ? UTF8_ALLOW_STRICT :
 
UTF8_ALLOW_NONSTRICT)
);
+#if 1 /* perl-5.8.6 and older do not check UTF8_ALLOW_LONG */
+   if (strict  uv  PERL_UNICODE_MAX)
+   ulen = -1;
+#endif
 if (ulen == -1) {
 if (strict) {
 uv = utf8n_to_uvuni(s, e - s, ulen,
End of Patch.


 --- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
 +++ perl-5.8.x.dan/utf8.c   Sun Dec  5 11:38:52 2004
 @@ -429,6 +429,13 @@
  }
  else
  uv = UTF8_ACCUMULATE(uv, *s);
 +   /* Checks if ord()  0x10 -- dankogai */
 +   if (uv  PERL_UNICODE_MAX){
 +   if (!(flags  UTF8_ALLOW_LONG)) {
 +   warning = UTF8_WARN_LONG;
 +   goto malformed;
 +   }
 +   }
  if (!(uv  ouv)) {
  /* These cannot be allowed. */
  if (uv == ouv) {


Re: real UTF-8 vs. utf8n_to_uvuni()

2004-12-06 Thread Tim Bunce
On Sun, Dec 05, 2004 at 11:58:54AM +0900, Dan Kogai wrote:
 % perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)'
 \x{} does not map to utf8 at [...]

Shouldn't that (and similar messages) say ... does not map to UTF-8 ?

Tim.


Re: real UTF-8 vs. utf8n_to_uvuni()

2004-12-06 Thread Gisle Aas
Tim Bunce [EMAIL PROTECTED] writes:

 On Sun, Dec 05, 2004 at 11:58:54AM +0900, Dan Kogai wrote:
  % perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)'
  \x{} does not map to utf8 at [...]
 
 Shouldn't that (and similar messages) say ... does not map to UTF-8 ?

Yes, I think so.  Should be an easy tweak.


real UTF-8 vs. utf8n_to_uvuni()

2004-12-04 Thread Dan Kogai
On Dec 05, 2004, at 10:56, Dan Kogai wrote:
Thanks, applied in my repository.  New tests and documentation fix in 
progress.  When I am done w/ that, I will release Encode-2.0901 on my 
web (not CPAN yet).  When cross-checks by porters are done I will 
release Encode-2.10.

Dan the Encode Maintainer
Now I am writing test suites and found some of the strictures are 
missing.

Surrogate -- OK
% perl -Mblib -MEncode -le '$a=\x{d801}; print encode(UTF-8, $a, 1)'
\x{d801} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

U+ -- OK
% perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)'
\x{} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

Chars above U+10 -- NOT OK
% perl -Mblib -MEncode -le '$a=\x{11}; print encode(UTF-8, $a, 
1)'


Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a 
problem of perl core.  So I have checked utf8.c which defines that.  
Seems like it does not make use of PERL_UNICODE_MAX.

The patch against utf8.c fixes that.
 ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a=\x{11}; print 
encode(UTF-8, $a, 1)'
\x{00f4} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

As you see, the warning is still funny.  But for any case w/ 
UTF8_WARN_LONG is funny as follows;

 perl -Mblib -MEncode -le '$a=\x{7fff_}; print encode(UTF-8, 
$a, 1)'
??
 perl -Mblib -MEncode -le '$a=\x{8000_}; print encode(UTF-8, 
$a, 1)'
\x{00fe} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

I have tracked down and found this warning was handled by Encode so 
Gisle and I can fix that.

Dan the Encode Maintainer
--- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
+++ perl-5.8.x.dan/utf8.c   Sun Dec  5 11:38:52 2004
@@ -429,6 +429,13 @@
}
else
uv = UTF8_ACCUMULATE(uv, *s);
+   /* Checks if ord()  0x10 -- dankogai */
+   if (uv  PERL_UNICODE_MAX){
+   if (!(flags  UTF8_ALLOW_LONG)) {
+   warning = UTF8_WARN_LONG;
+   goto malformed;
+   }
+   }
if (!(uv  ouv)) {
/* These cannot be allowed. */
if (uv == ouv) {