Re: real UTF-8 vs. utf8n_to_uvuni()
Dan Kogai [EMAIL PROTECTED] writes: --- perl-5.8.x/utf8.c Wed Nov 17 23:11:04 2004 +++ perl-5.8.x.dan/utf8.c Sun Dec 5 11:38:52 2004 @@ -429,6 +429,13 @@ } else uv = UTF8_ACCUMULATE(uv, *s); + /* Checks if ord() 0x10 -- dankogai */ + if (uv PERL_UNICODE_MAX){ + if (!(flags UTF8_ALLOW_LONG)) { + warning = UTF8_WARN_LONG; + goto malformed; + } + } if (!(uv ouv)) { /* These cannot be allowed. */ if (uv == ouv) { I think this patch is wrong since UTF8_ALLOW_LONG is about allowing overlong sequences. What we need is a UTF8_ALLOW_SUPER flag (matching UNICODE_ALLOW_SUPER) that would indicate that code points past 10x should be allowed. This would be the flag that UTF8_ALLOW_ANYUV should contain instead of UTF8_ALLOW_LONG. Unfortunately there is no more room for UTF8_ALLOW_* flags in the UTF8_ALLOW_ANY space so we would have to add some bits to this mask, which give us binary incompatiblity with extensions that use the old UTF8_ALLOW_ANY value. The UTF8_ALLOW_ should also allow 0x1, 0x2 as well as the 0xFFFE variants. This match the UNICODE_ALLOW_ behaviour. Currently it only allows 0x. The UTF8_ALLOW_FDD0 flag to match UNICODE_ALLOW_FDD0 is also missing, but insted of introducing UTF8_ALLOW_FDD0 it seems better to collapse the *_ALLOW_ and *_ALLOW_FDD0 flags into a single *_ALLOW_ILLEGAL and then make UNICODE_IS_ILLEGAL() match this. --Gisle
Re: real UTF-8 vs. utf8n_to_uvuni()
Dan Kogai [EMAIL PROTECTED] writes: Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a problem of perl core. So I have checked utf8.c which defines that. Seems like it does not make use of PERL_UNICODE_MAX. The patch against utf8.c fixes that. Seems like a good idea to have a workaround in Encode for this as well. Index: users/gisle/hacks/Encode/Encode.xs --- Encode/Encode.xs.~1~Mon Dec 6 10:44:31 2004 +++ Encode/Encode.xsMon Dec 6 10:44:31 2004 @@ -300,6 +300,10 @@ UTF8_CHECK_ONLY | (strict ? UTF8_ALLOW_STRICT : UTF8_ALLOW_NONSTRICT) ); +#if 1 /* perl-5.8.6 and older do not check UTF8_ALLOW_LONG */ + if (strict uv PERL_UNICODE_MAX) + ulen = -1; +#endif if (ulen == -1) { if (strict) { uv = utf8n_to_uvuni(s, e - s, ulen, End of Patch. --- perl-5.8.x/utf8.c Wed Nov 17 23:11:04 2004 +++ perl-5.8.x.dan/utf8.c Sun Dec 5 11:38:52 2004 @@ -429,6 +429,13 @@ } else uv = UTF8_ACCUMULATE(uv, *s); + /* Checks if ord() 0x10 -- dankogai */ + if (uv PERL_UNICODE_MAX){ + if (!(flags UTF8_ALLOW_LONG)) { + warning = UTF8_WARN_LONG; + goto malformed; + } + } if (!(uv ouv)) { /* These cannot be allowed. */ if (uv == ouv) {
Re: real UTF-8 vs. utf8n_to_uvuni()
On Sun, Dec 05, 2004 at 11:58:54AM +0900, Dan Kogai wrote: % perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)' \x{} does not map to utf8 at [...] Shouldn't that (and similar messages) say ... does not map to UTF-8 ? Tim.
Re: real UTF-8 vs. utf8n_to_uvuni()
Tim Bunce [EMAIL PROTECTED] writes: On Sun, Dec 05, 2004 at 11:58:54AM +0900, Dan Kogai wrote: % perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)' \x{} does not map to utf8 at [...] Shouldn't that (and similar messages) say ... does not map to UTF-8 ? Yes, I think so. Should be an easy tweak.
real UTF-8 vs. utf8n_to_uvuni()
On Dec 05, 2004, at 10:56, Dan Kogai wrote: Thanks, applied in my repository. New tests and documentation fix in progress. When I am done w/ that, I will release Encode-2.0901 on my web (not CPAN yet). When cross-checks by porters are done I will release Encode-2.10. Dan the Encode Maintainer Now I am writing test suites and found some of the strictures are missing. Surrogate -- OK % perl -Mblib -MEncode -le '$a=\x{d801}; print encode(UTF-8, $a, 1)' \x{d801} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. U+ -- OK % perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)' \x{} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. Chars above U+10 -- NOT OK % perl -Mblib -MEncode -le '$a=\x{11}; print encode(UTF-8, $a, 1)' Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a problem of perl core. So I have checked utf8.c which defines that. Seems like it does not make use of PERL_UNICODE_MAX. The patch against utf8.c fixes that. ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a=\x{11}; print encode(UTF-8, $a, 1)' \x{00f4} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. As you see, the warning is still funny. But for any case w/ UTF8_WARN_LONG is funny as follows; perl -Mblib -MEncode -le '$a=\x{7fff_}; print encode(UTF-8, $a, 1)' ?? perl -Mblib -MEncode -le '$a=\x{8000_}; print encode(UTF-8, $a, 1)' \x{00fe} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. I have tracked down and found this warning was handled by Encode so Gisle and I can fix that. Dan the Encode Maintainer --- perl-5.8.x/utf8.c Wed Nov 17 23:11:04 2004 +++ perl-5.8.x.dan/utf8.c Sun Dec 5 11:38:52 2004 @@ -429,6 +429,13 @@ } else uv = UTF8_ACCUMULATE(uv, *s); + /* Checks if ord() 0x10 -- dankogai */ + if (uv PERL_UNICODE_MAX){ + if (!(flags UTF8_ALLOW_LONG)) { + warning = UTF8_WARN_LONG; + goto malformed; + } + } if (!(uv ouv)) { /* These cannot be allowed. */ if (uv == ouv) {