real UTF-8 vs. utf8n_to_uvuni()

2004-12-04 Thread Dan Kogai
On Dec 05, 2004, at 10:56, Dan Kogai wrote:
Thanks, applied in my repository.  New tests and documentation fix in 
progress.  When I am done w/ that, I will release Encode-2.0901 on my 
web (not CPAN yet).  When cross-checks by porters are done I will 
release Encode-2.10.

Dan the Encode Maintainer
Now I am writing test suites and found some of the strictures are 
missing.

Surrogate -- OK
% perl -Mblib -MEncode -le '$a="\x{d801}"; print encode("UTF-8", $a, 1)'
"\x{d801}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

U+ -- OK
% perl -Mblib -MEncode -le '$a="\x{}"; print encode("UTF-8", $a, 1)'
"\x{}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

Chars above U+10 -- NOT OK
%> perl -Mblib -MEncode -le '$a="\x{11}"; print encode("UTF-8", $a, 
1)'


Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a 
problem of perl core.  So I have checked utf8.c which defines that.  
Seems like it does not make use of PERL_UNICODE_MAX.

The patch against utf8.c fixes that.
> ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a="\x{11}"; print 
encode("UTF-8", $a, 1)'
"\x{00f4}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

As you see, the warning is still funny.  But for any case w/ 
UTF8_WARN_LONG is funny as follows;

> perl -Mblib -MEncode -le '$a="\x{7fff_}"; print encode("UTF-8", 
$a, 1)'
??
> perl -Mblib -MEncode -le '$a="\x{8000_}"; print encode("UTF-8", 
$a, 1)'
"\x{00fe}" does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

I have tracked down and found this warning was handled by Encode so 
Gisle and I can fix that.

Dan the Encode Maintainer
--- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
+++ perl-5.8.x.dan/utf8.c   Sun Dec  5 11:38:52 2004
@@ -429,6 +429,13 @@
}
else
uv = UTF8_ACCUMULATE(uv, *s);
+   /* Checks if ord() > 0x10 -- dankogai */
+   if (uv > PERL_UNICODE_MAX){
+   if (!(flags & UTF8_ALLOW_LONG)) {
+   warning = UTF8_WARN_LONG;
+   goto malformed;
+   }
+   }
if (!(uv > ouv)) {
/* These cannot be allowed. */
if (uv == ouv) {


Re: Make Encode.pm support the real UTF-8

2004-12-04 Thread Dan Kogai
On Dec 04, 2004, at 20:28, Gisle Aas wrote:
Dan Kogai <[EMAIL PROTECTED]> writes:
2.1.  What will the canonnical name of the strict version of "UTF-8"
be ? Gisle already submitted me a test patch and it uses
'utf-8-strict'.  If there is no objection, I would like to use that.
This is the complete patch relative to Encode-2.09 that implements
this.  I would also be happy with just removing the alias entry and
just declare 'UTF-8' as the strict version.
We still pass all the old tests after this patch.  What is left to do
is to write some new tests that feed bad data to the strict encoder.
Thanks, applied in my repository.  New tests and documentation fix in 
progress.  When I am done w/ that, I will release Encode-2.0901 on my 
web (not CPAN yet).  When cross-checks by porters are done I will 
release Encode-2.10.

Dan the Encode Maintainer


Re: Make Encode.pm support the real UTF-8

2004-12-04 Thread Gisle Aas
Dan Kogai <[EMAIL PROTECTED]> writes:

> 2.3.  Degree of stricture. How strict are we going to make utf-8-strict?
> a. simply make use of UTF8_ALLOW_* in utf8.h ?
> b. unmapped codepoints banned as well?
> IMHO a. is strict enough since mapped codepoints are subject to
> increase as Unicode Standard updates.

It seems obvious to me that 'a' is what we want and this is what my
proposed patch implements.  The goal is to make sure that we never
generate or accepts code points that Unicode has explicitly declared
as non-chars.

My goal is just to make the "UTF-8" encoder pass the test at
http://smontagu.damowmow.com/utf8test.html.

Regards,
Gisle


Re: Make Encode.pm support the real UTF-8

2004-12-04 Thread Gisle Aas
Dan Kogai <[EMAIL PROTECTED]> writes:

> 2.1.  What will the canonnical name of the strict version of "UTF-8"
> be ? Gisle already submitted me a test patch and it uses
> 'utf-8-strict'.  If there is no objection, I would like to use that.

This is the complete patch relative to Encode-2.09 that implements
this.  I would also be happy with just removing the alias entry and
just declare 'UTF-8' as the strict version.

We still pass all the old tests after this patch.  What is left to do
is to write some new tests that feed bad data to the strict encoder.

Regards,
Gisle


diff -ru contrib/Encode-2/Encode.pm Encode/Encode.pm
--- contrib/Encode-2/Encode.pm  2004-12-03 11:17:01.0 -0800
+++ Encode/Encode.pm2004-12-03 13:27:11.0 -0800
@@ -300,6 +300,8 @@
};
$Encode::Encoding{utf8} =
bless {Name => "utf8"} => "Encode::utf8";
+   $Encode::Encoding{"utf-8-strict"} =
+   bless {Name => "utf-8-strict", strict_utf8 => 1 } => "Encode::utf8";
 }
 }
 
diff -ru contrib/Encode-2/Encode.xs Encode/Encode.xs
--- contrib/Encode-2/Encode.xs  2004-12-03 11:16:57.0 -0800
+++ Encode/Encode.xs2004-12-04 03:17:32.0 -0800
@@ -29,6 +29,12 @@
 UNIMPLEMENTED(_encoded_utf8_to_bytes, I32)
 UNIMPLEMENTED(_encoded_bytes_to_utf8, I32)
 
+#define UTF8_ALLOW_STRICT 0
+#define UTF8_ALLOW_NONSTRICT (UTF8_ALLOW_ANY &\
+  ~(UTF8_ALLOW_CONTINUATION | \
+UTF8_ALLOW_NON_CONTINUATION | \
+UTF8_ALLOW_LONG))
+
 void
 Encode_XSEncoding(pTHX_ encode_t * enc)
 {
@@ -247,6 +253,111 @@
 return dst;
 }
 
+static bool
+strict_utf8(pTHX_ SV* sv)
+{
+HV* hv;
+SV** svp;
+sv = SvRV(sv);
+if (!sv || SvTYPE(sv) != SVt_PVHV)
+return 0;
+hv = (HV*)sv;
+svp = hv_fetch(hv, "strict_utf8", 11, 0);
+if (!svp)
+return 0;
+return SvTRUE(*svp);
+}
+
+static U8*
+process_utf8(pTHX_ SV* dst, U8* s, U8* e, int check,
+ bool encode, bool strict, bool stop_at_partial)
+{
+UV uv;
+STRLEN ulen;
+
+SvPOK_only(dst);
+SvCUR_set(dst,0);
+
+while (s < e) {
+if (UTF8_IS_INVARIANT(*s)) {
+sv_catpvn(dst, (char *)s, 1);
+s++;
+continue;
+}
+
+if (UTF8_IS_START(*s)) {
+U8 skip = UTF8SKIP(s);
+if ((s + skip) > e) {
+/* Partial character */
+/* XXX could check that rest of bytes are 
UTF8_IS_CONTINUATION(ch) */
+if (stop_at_partial)
+break;
+
+goto malformed_byte;
+}
+
+uv = utf8n_to_uvuni(s, e - s, &ulen,
+UTF8_CHECK_ONLY | (strict ? UTF8_ALLOW_STRICT :
+
UTF8_ALLOW_NONSTRICT)
+   );
+if (ulen == -1) {
+if (strict) {
+uv = utf8n_to_uvuni(s, e - s, &ulen,
+UTF8_CHECK_ONLY | 
UTF8_ALLOW_NONSTRICT);
+if (ulen == -1)
+goto malformed_byte;
+goto malformed;
+}
+goto malformed_byte;
+}
+
+
+ /* Whole char is good */
+ sv_catpvn(dst,(char *)s,skip);
+ s += skip;
+ continue;
+}
+
+/* If we get here there is something wrong with alleged UTF-8 */
+malformed_byte:
+uv = (UV)*s;
+ulen = 1;
+
+malformed:
+if (check & ENCODE_DIE_ON_ERR){
+if (encode)
+Perl_croak(aTHX_ ERR_ENCODE_NOMAP, uv, "utf8");
+else
+Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", uv);
+}
+if (check & ENCODE_WARN_ON_ERR){
+if (encode)
+Perl_warner(aTHX_ packWARN(WARN_UTF8),
+ERR_ENCODE_NOMAP, uv, "utf8");
+else
+Perl_warner(aTHX_ packWARN(WARN_UTF8),
+ERR_DECODE_NOMAP, "utf8", uv);
+}
+if (check & ENCODE_RETURN_ON_ERR) {
+break;
+}
+if (check & (ENCODE_PERLQQ|ENCODE_HTMLCREF|ENCODE_XMLCREF)){
+SV* subchar = newSVpvf(check & ENCODE_PERLQQ ? (ulen == 1 ? 
"\\x%02" UVXf : "\\x{%04" UVXf "}"):
+   check & ENCODE_HTMLCREF ? "&#%" UVuf ";" :
+   "&#x%" UVxf ";", uv);
+sv_catsv(dst, subchar);
+SvREFCNT_dec(subchar);
+} else {
+sv_catpv(dst, FBCHAR_UTF8);
+}
+s += ulen;
+}
+*SvEND(dst) = '\0';
+
+return s;
+}
+
+
 MODULE = EncodePACKAGE = Encode::utf8  PREFIX = Method_
 
 PROTOTYPES: DISABLE
@@ -283,8 +394,6 @@
 FREETMPS; LEAVE;
 /* end PerlIO check *

Re: Make Encode.pm support the real UTF-8

2004-12-04 Thread Tim Bunce
On Sat, Dec 04, 2004 at 01:40:30PM +0900, Dan Kogai wrote:
> On Dec 04, 2004, at 11:51, Larry Wall wrote:
> >On Fri, Dec 03, 2004 at 10:12:12PM +, Tim Bunce wrote:
> >: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
> >: but "UTF-8" is the name of the standard and should give the
> >: corresponding behaviour.
> >
> >For what it's worth, that's how I've always kept them straight in my 
> >head.
> >
> >Also for what it's worth, Perl 6 will mostly default to strict but make
> >it easy to switch back to lax.
> >
> >Larry
> 
> Okay, Looks like the verdict is reached.
> 
> 1.  "utf8" will stay liberal
> 2.  "UTF-8" will be strict
> 
> The rest is mostly implemenation.
> 
> 2.1.  What will the canonnical name of the strict version of "UTF-8" be 
> ? Gisle already submitted me a test patch and it uses 'utf-8-strict'.  
> If there is no objection, I would like to use that.

"UTF-8" is the name of the standard and should give the corresponding
behaviour.  Why not use "UTF-8" as the canonnical name of the
behaviour that matches the "UTF-8" standard? Strictness should be
implied by the fact it's the official name of the encoding.

> 2.2.  CAVEAT: "UTF8" will be "utf8", not "utf-8-strict", since Encode 
> aliasing is case insensitive.
> 
> 2.3.  Degree of stricture. How strict are we going to make utf-8-strict?
>a. simply make use of UTF8_ALLOW_* in utf8.h ?
>b. unmapped codepoints banned as well?
>IMHO a. is strict enough since mapped codepoints are subject to 
> increase
>as Unicode Standard updates.

Overlong sequences (ie security) are the only concern I have.

Tim.

> 2.4   We can always make "UTF-8" liberal by reapplying alias.
> 
> Anything else missing?
> 
> Dan the Encode Maintainer
>