real UTF-8 vs. utf8n_to_uvuni()
On Dec 05, 2004, at 10:56, Dan Kogai wrote: Thanks, applied in my repository. New tests and documentation fix in progress. When I am done w/ that, I will release Encode-2.0901 on my web (not CPAN yet). When cross-checks by porters are done I will release Encode-2.10. Dan the Encode Maintainer Now I am writing test suites and found some of the strictures are missing. Surrogate -- OK % perl -Mblib -MEncode -le '$a="\x{d801}"; print encode("UTF-8", $a, 1)' "\x{d801}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. U+ -- OK % perl -Mblib -MEncode -le '$a="\x{}"; print encode("UTF-8", $a, 1)' "\x{}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. Chars above U+10 -- NOT OK %> perl -Mblib -MEncode -le '$a="\x{11}"; print encode("UTF-8", $a, 1)' Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a problem of perl core. So I have checked utf8.c which defines that. Seems like it does not make use of PERL_UNICODE_MAX. The patch against utf8.c fixes that. > ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a="\x{11}"; print encode("UTF-8", $a, 1)' "\x{00f4}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. As you see, the warning is still funny. But for any case w/ UTF8_WARN_LONG is funny as follows; > perl -Mblib -MEncode -le '$a="\x{7fff_}"; print encode("UTF-8", $a, 1)' ?? > perl -Mblib -MEncode -le '$a="\x{8000_}"; print encode("UTF-8", $a, 1)' "\x{00fe}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. I have tracked down and found this warning was handled by Encode so Gisle and I can fix that. Dan the Encode Maintainer --- perl-5.8.x/utf8.c Wed Nov 17 23:11:04 2004 +++ perl-5.8.x.dan/utf8.c Sun Dec 5 11:38:52 2004 @@ -429,6 +429,13 @@ } else uv = UTF8_ACCUMULATE(uv, *s); + /* Checks if ord() > 0x10 -- dankogai */ + if (uv > PERL_UNICODE_MAX){ + if (!(flags & UTF8_ALLOW_LONG)) { + warning = UTF8_WARN_LONG; + goto malformed; + } + } if (!(uv > ouv)) { /* These cannot be allowed. */ if (uv == ouv) {
Re: Make Encode.pm support the real UTF-8
On Dec 04, 2004, at 20:28, Gisle Aas wrote: Dan Kogai <[EMAIL PROTECTED]> writes: 2.1. What will the canonnical name of the strict version of "UTF-8" be ? Gisle already submitted me a test patch and it uses 'utf-8-strict'. If there is no objection, I would like to use that. This is the complete patch relative to Encode-2.09 that implements this. I would also be happy with just removing the alias entry and just declare 'UTF-8' as the strict version. We still pass all the old tests after this patch. What is left to do is to write some new tests that feed bad data to the strict encoder. Thanks, applied in my repository. New tests and documentation fix in progress. When I am done w/ that, I will release Encode-2.0901 on my web (not CPAN yet). When cross-checks by porters are done I will release Encode-2.10. Dan the Encode Maintainer
Re: Make Encode.pm support the real UTF-8
Dan Kogai <[EMAIL PROTECTED]> writes: > 2.3. Degree of stricture. How strict are we going to make utf-8-strict? > a. simply make use of UTF8_ALLOW_* in utf8.h ? > b. unmapped codepoints banned as well? > IMHO a. is strict enough since mapped codepoints are subject to > increase as Unicode Standard updates. It seems obvious to me that 'a' is what we want and this is what my proposed patch implements. The goal is to make sure that we never generate or accepts code points that Unicode has explicitly declared as non-chars. My goal is just to make the "UTF-8" encoder pass the test at http://smontagu.damowmow.com/utf8test.html. Regards, Gisle
Re: Make Encode.pm support the real UTF-8
Dan Kogai <[EMAIL PROTECTED]> writes: > 2.1. What will the canonnical name of the strict version of "UTF-8" > be ? Gisle already submitted me a test patch and it uses > 'utf-8-strict'. If there is no objection, I would like to use that. This is the complete patch relative to Encode-2.09 that implements this. I would also be happy with just removing the alias entry and just declare 'UTF-8' as the strict version. We still pass all the old tests after this patch. What is left to do is to write some new tests that feed bad data to the strict encoder. Regards, Gisle diff -ru contrib/Encode-2/Encode.pm Encode/Encode.pm --- contrib/Encode-2/Encode.pm 2004-12-03 11:17:01.0 -0800 +++ Encode/Encode.pm2004-12-03 13:27:11.0 -0800 @@ -300,6 +300,8 @@ }; $Encode::Encoding{utf8} = bless {Name => "utf8"} => "Encode::utf8"; + $Encode::Encoding{"utf-8-strict"} = + bless {Name => "utf-8-strict", strict_utf8 => 1 } => "Encode::utf8"; } } diff -ru contrib/Encode-2/Encode.xs Encode/Encode.xs --- contrib/Encode-2/Encode.xs 2004-12-03 11:16:57.0 -0800 +++ Encode/Encode.xs2004-12-04 03:17:32.0 -0800 @@ -29,6 +29,12 @@ UNIMPLEMENTED(_encoded_utf8_to_bytes, I32) UNIMPLEMENTED(_encoded_bytes_to_utf8, I32) +#define UTF8_ALLOW_STRICT 0 +#define UTF8_ALLOW_NONSTRICT (UTF8_ALLOW_ANY &\ + ~(UTF8_ALLOW_CONTINUATION | \ +UTF8_ALLOW_NON_CONTINUATION | \ +UTF8_ALLOW_LONG)) + void Encode_XSEncoding(pTHX_ encode_t * enc) { @@ -247,6 +253,111 @@ return dst; } +static bool +strict_utf8(pTHX_ SV* sv) +{ +HV* hv; +SV** svp; +sv = SvRV(sv); +if (!sv || SvTYPE(sv) != SVt_PVHV) +return 0; +hv = (HV*)sv; +svp = hv_fetch(hv, "strict_utf8", 11, 0); +if (!svp) +return 0; +return SvTRUE(*svp); +} + +static U8* +process_utf8(pTHX_ SV* dst, U8* s, U8* e, int check, + bool encode, bool strict, bool stop_at_partial) +{ +UV uv; +STRLEN ulen; + +SvPOK_only(dst); +SvCUR_set(dst,0); + +while (s < e) { +if (UTF8_IS_INVARIANT(*s)) { +sv_catpvn(dst, (char *)s, 1); +s++; +continue; +} + +if (UTF8_IS_START(*s)) { +U8 skip = UTF8SKIP(s); +if ((s + skip) > e) { +/* Partial character */ +/* XXX could check that rest of bytes are UTF8_IS_CONTINUATION(ch) */ +if (stop_at_partial) +break; + +goto malformed_byte; +} + +uv = utf8n_to_uvuni(s, e - s, &ulen, +UTF8_CHECK_ONLY | (strict ? UTF8_ALLOW_STRICT : + UTF8_ALLOW_NONSTRICT) + ); +if (ulen == -1) { +if (strict) { +uv = utf8n_to_uvuni(s, e - s, &ulen, +UTF8_CHECK_ONLY | UTF8_ALLOW_NONSTRICT); +if (ulen == -1) +goto malformed_byte; +goto malformed; +} +goto malformed_byte; +} + + + /* Whole char is good */ + sv_catpvn(dst,(char *)s,skip); + s += skip; + continue; +} + +/* If we get here there is something wrong with alleged UTF-8 */ +malformed_byte: +uv = (UV)*s; +ulen = 1; + +malformed: +if (check & ENCODE_DIE_ON_ERR){ +if (encode) +Perl_croak(aTHX_ ERR_ENCODE_NOMAP, uv, "utf8"); +else +Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", uv); +} +if (check & ENCODE_WARN_ON_ERR){ +if (encode) +Perl_warner(aTHX_ packWARN(WARN_UTF8), +ERR_ENCODE_NOMAP, uv, "utf8"); +else +Perl_warner(aTHX_ packWARN(WARN_UTF8), +ERR_DECODE_NOMAP, "utf8", uv); +} +if (check & ENCODE_RETURN_ON_ERR) { +break; +} +if (check & (ENCODE_PERLQQ|ENCODE_HTMLCREF|ENCODE_XMLCREF)){ +SV* subchar = newSVpvf(check & ENCODE_PERLQQ ? (ulen == 1 ? "\\x%02" UVXf : "\\x{%04" UVXf "}"): + check & ENCODE_HTMLCREF ? "%" UVuf ";" : + "%" UVxf ";", uv); +sv_catsv(dst, subchar); +SvREFCNT_dec(subchar); +} else { +sv_catpv(dst, FBCHAR_UTF8); +} +s += ulen; +} +*SvEND(dst) = '\0'; + +return s; +} + + MODULE = EncodePACKAGE = Encode::utf8 PREFIX = Method_ PROTOTYPES: DISABLE @@ -283,8 +394,6 @@ FREETMPS; LEAVE; /* end PerlIO check *
Re: Make Encode.pm support the real UTF-8
On Sat, Dec 04, 2004 at 01:40:30PM +0900, Dan Kogai wrote: > On Dec 04, 2004, at 11:51, Larry Wall wrote: > >On Fri, Dec 03, 2004 at 10:12:12PM +, Tim Bunce wrote: > >: I've no problem with 'utf8' being perl's unrestricted uft8 encoding, > >: but "UTF-8" is the name of the standard and should give the > >: corresponding behaviour. > > > >For what it's worth, that's how I've always kept them straight in my > >head. > > > >Also for what it's worth, Perl 6 will mostly default to strict but make > >it easy to switch back to lax. > > > >Larry > > Okay, Looks like the verdict is reached. > > 1. "utf8" will stay liberal > 2. "UTF-8" will be strict > > The rest is mostly implemenation. > > 2.1. What will the canonnical name of the strict version of "UTF-8" be > ? Gisle already submitted me a test patch and it uses 'utf-8-strict'. > If there is no objection, I would like to use that. "UTF-8" is the name of the standard and should give the corresponding behaviour. Why not use "UTF-8" as the canonnical name of the behaviour that matches the "UTF-8" standard? Strictness should be implied by the fact it's the official name of the encoding. > 2.2. CAVEAT: "UTF8" will be "utf8", not "utf-8-strict", since Encode > aliasing is case insensitive. > > 2.3. Degree of stricture. How strict are we going to make utf-8-strict? >a. simply make use of UTF8_ALLOW_* in utf8.h ? >b. unmapped codepoints banned as well? >IMHO a. is strict enough since mapped codepoints are subject to > increase >as Unicode Standard updates. Overlong sequences (ie security) are the only concern I have. Tim. > 2.4 We can always make "UTF-8" liberal by reapplying alias. > > Anything else missing? > > Dan the Encode Maintainer >