Re: Make Encode.pm support the real UTF-8

2004-12-03 Thread Dan Kogai
On Dec 04, 2004, at 11:51, Larry Wall wrote:
On Fri, Dec 03, 2004 at 10:12:12PM +, Tim Bunce wrote:
: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
: but "UTF-8" is the name of the standard and should give the
: corresponding behaviour.
For what it's worth, that's how I've always kept them straight in my 
head.

Also for what it's worth, Perl 6 will mostly default to strict but make
it easy to switch back to lax.
Larry
Okay, Looks like the verdict is reached.
1.  "utf8" will stay liberal
2.  "UTF-8" will be strict
The rest is mostly implemenation.
2.1.  What will the canonnical name of the strict version of "UTF-8" be 
? Gisle already submitted me a test patch and it uses 'utf-8-strict'.  
If there is no objection, I would like to use that.

2.2.  CAVEAT: "UTF8" will be "utf8", not "utf-8-strict", since Encode 
aliasing is case insensitive.

2.3.  Degree of stricture. How strict are we going to make utf-8-strict?
   a. simply make use of UTF8_ALLOW_* in utf8.h ?
   b. unmapped codepoints banned as well?
   IMHO a. is strict enough since mapped codepoints are subject to 
increase
   as Unicode Standard updates.

2.4   We can always make "UTF-8" liberal by reapplying alias.
Anything else missing?
Dan the Encode Maintainer


Re: Make Encode.pm support the real UTF-8

2004-12-03 Thread Larry Wall
On Fri, Dec 03, 2004 at 10:12:12PM +, Tim Bunce wrote:
: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
: but "UTF-8" is the name of the standard and should give the
: corresponding behaviour.

For what it's worth, that's how I've always kept them straight in my head.

Also for what it's worth, Perl 6 will mostly default to strict but make
it easy to switch back to lax.

Larry


Re: About HTML unicode

2004-12-03 Thread Ben Morrow

Quoth [EMAIL PROTECTED] (John Delacour):
> At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
> 
> >Now i encountered another problem,  there are a few files contains 
> >not only one charset but also two or more, for example, file1 
> >contains japanese and chinese, if i use open() to  load the data 
> >into memory, ord and length etc.. can't correctly work! Perhasp i 
> >miss something to encode or decode the data ?
> >code:
> >#!/usr/bin/perl -w
> >use utf8;
> >open(FD, "< file1");
> >while() {
> >chomp;
> >print "length = ".length($_);
> >}
> >close FD;
> >--
> >length() can not count the correct non-ASCII characters. :(
> 
> If the file is in UTF-8, then it may be in any number of _languages_ 
> but it uses only one character set -- Unicode.  So far as I know "use 
> utf8" is now redundant and ineffectual in Perl.

Both utf8.pm and encoding.pm alter the encoding Perl considers your
*source file* to be in. This is different from what utf8.pm did under
5.6.

> You will get the 
> correct character count (6 characters rather than 18 bytes) by 
> opening the file handle as utf-8 as below.
> 
> no warnings;
> my $f = "/tmp/cjk.txt";
> my $text = "\x{56d8}\x{56d9}\x{56da}\x{56db}\x{56dc}\x{56dd}\n";
> open F, ">$f";

binmode F;

both for portability and in case of some environment setting (PERLIO,
the locale variables with 5.8.0 or -C) having set some other encoding on
the data.

> print F $text; # writes $text to $f as UTF-8

utf8::encode $text; # make sure $text is a a sequence of octets not
# characters
print F $text;

> close F;
> open F, "<:utf8",  $f;
> for () {
>chomp;
>print "$_  -  Length = " . length() . $/;
> }

Ben

-- 
  Joy and Woe are woven fine,
  A Clothing for the Soul divine   William Blake
  Under every grief and pine  'Auguries of Innocence'
  Runs a joy with silken twine.[EMAIL PROTECTED]


Re: About HTML unicode

2004-12-03 Thread Ben Morrow

Quoth [EMAIL PROTECTED] (He Zhiqiang):
> 
>  I've a problem to convert a unicode character into it's decimal or 
> Hexadecimal value. The following URL:
> http://code.cside.com/3rdpage/us/unicode/converter.html
>  can easily convert via Javascript function escape(), but i wonder that is 
> there some method or function
> or modules can do the same job?? If i can do it, then in one html page, i 
> can display ont only chinese, but
> also japanese, korea etc... This is something like HTML unicode, am i right? 

I routinely use

use Encode qw/:fallbacks/;
use PerlIO::encoding ();

$PerlIO::encoding::fallback = FB_XMLCREF; # could use HTMLCREF instead
binmode STDOUT, ':encoding(ascii)';

which will cause all non-ascii characters in the output to be converted
to HTML's ༀ form. You need 5.8 for this to work, but you need 5.8
for Unicode support anyway.

(if this is wrong, could someone tell me where, and why... ? :)

>  The ord() function can't do the job because it return the incorrect decimal 

I presume you mean 'the ord function returns the decimal value, which is
not what I want (I want the hex value)' rather than 'the ord function
returns an incorrect decimal value (i.e. the value itself is wrong)'?

Ben

-- 
Every twenty-four hours about 34k children die from the effects of poverty.
Meanwhile, the latest estimate is that 2800 people died on 9/11, so it's like
that image, that ghastly, grey-billowing, double-barrelled fall, repeated
twelve times every day. Full of children. [Iain Banks] [EMAIL PROTECTED]


Re: Make Encode.pm support the real UTF-8

2004-12-03 Thread Tim Bunce
On Sat, Dec 04, 2004 at 04:06:46AM +0900, Dan Kogai wrote:
> On Dec 02, 2004, at 23:25, Tim Bunce wrote:
> >On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
> >>As you probably know perl's version of UTF-8 is not the real thing.  I
> >>thought I would hack up a patch to support the encoding as defined by
> >>Unicode.  That involves rejecting illegal chars (like surrogates,
> >>"\x{}" and "\x{FDD0}), chars above 0x10, overlong sequences
> >>and such.
> >
> >It's worth remembering that overlong sequences are a potential 
> >security risk.
> >
> >>Before I do this I would like to get some feedback on the interface.
> >>My prefered interface would be to make:
> >>
> >>   encode("UTF-8", $string)
> >>
> >>imply the official restricted form
> >
> >I think that would be best.
> 
> But to what extent?  Does it mean restricted, but unused codepoints 
> (i.e. U+10F000) to be illegal?  Does that mean we have to verify and if 
> necessary, patch perl anytime Unicode.org updates Unicode?
> 
> While I agree official UTF-8 be supported separately from "Perl" UTF-8, 

Okay.

> I would like perl to be independent from unicode.org.  Remember that 
> perl community does not have a vote in unicode.org (or does it?).  
> Making perl too compliant to the Unicode standard means that perl is at 
> a mercy thereof.

Whoa. We agree official UTF-8 be supported separately from "Perl" UTF-8.
So there must be two names.

Then this thread boils down to what to call them.

> >>This implies that encode("UTF-8", $string) can start failing while
> >>previously it could not.
> >
> >Anyone working with valid UTF-8 would not get failures.
> >Anyone who thinks they're using valid UTF-8 but aren't should be 
> >grateful!
> >Anyone not using valid UTF-8 (eg using it as a way to encode integers)
> >needs to be told in advance - but I doubt there are many and they're
> >likely to be cluefull users who read release notes :)
> 
> There are many movements and implementations that "extends" Unicode by 
> making use of codepoints beyond 0x10.  Current perl can accept 
> them;  "Real", official unicode cannot.

Sure. I've used perl utf8 for packing large integers myself. That's
not the issue here. The issue is what to call the two encodings.

> >I'd say "UTF-8" should mean the official restricted form for perl 5.10.
> 
> Perl is a language where "use strict" is not default.  Why make its 
> default encoding strict then?  Perl should be liberal, not official.

I didn't actually say that perl's default encoding should be strict,
though I can see how it came across that way.

I'm only saying that the Unicode standard is called "UTF-8" and if
that's what a script explicitly asks for then that's what it should get.

> So my proposal is opposite;  Leave "utf8" and "UTF-8" as it is now and 
> define "UTF-8-official" or "UTF-8-pedantic" or whatever.

Security is for everyone, not just pedants. This is a bit dated but was
the best I could find http://www.izerv.net/idwg-public/archive/0181.html

> >The only remaining issues are then what to do for 5.8.7
> >and what to call the unrestricted encoding.
> 
> I would like to keep calling that 'utf8'.

I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
but "UTF-8" is the name of the standard and should give the
corresponding behaviour.

Tim.


[Encode] 2.09 released!

2004-12-03 Thread Dan Kogai
Porters,
I have just released Encode-2.09, AKA GAAS special.   Gisle, thank you 
for all the reports and patches.  Wish they were done before 5.8.6 :)

=head1 Availability
http://www.dan.co.jp/~dankogai/cpan/Encode-2.09.tar.gz
or CPAN near you
=head1 Changes
$Revision: 2.9 $ $Date: 2004/12/03 19:16:53 $
! Encode.pm Encode.xs
  Addressed " :encoding(utf8) broken in perl-5.8.6".
  Message-Id: <[EMAIL PROTECTED]>
! Encode.pm
  Addressed "(de|en)code($valid_encoding, undef) does not warn".
  http://rt.cpan.org/NoAuth/Bug.html?id=8723
! Encode.pm t/Encode.t
  Addressed "Can't encode URI".  When a reference is fed to (en|de)code,
  Encode now stringifies instead of returning undef.
  http://rt.cpan.org/NoAuth/Bug.html?id=8725
! Encode.xs t/fallback.t
  Addressed "FB_HTMLCREF and FB_XMLCREF for the UTF-8 decoder".
  http://rt.cpan.org/NoAuth/Bug.html?id=8694
! Encode.pm
  Addressed "s/digit/number/".
  http://rt.cpan.org/NoAuth/Bug.html?id=8695
! Encode.pm
  Addressed "while (defined(read )) { ... } is an infinite loop".
  http://rt.cpan.org/NoAuth/Bug.html?id=8696
! Encode.pm
  Addressed "What the heck is UCM?".
  Document fixed so that it no longer contains "UCM-Based Encodings".
  http://rt.cpan.org/NoAuth/Bug.html?id=8697
=head1 NOTE
I am in the middle of moving out and I will not settle till Dec. 10th 
or so.  So please be patient if you ever RT'd or mailed me on Encode.

=head1 Murphy's Law of Patches
* Patches come after the RC period is over.
* Patches come when you are hauling.
=head1 Signature
Dan the Encode Maintainer



Re: Make Encode.pm support the real UTF-8

2004-12-03 Thread Dan Kogai
On Dec 02, 2004, at 23:25, Tim Bunce wrote:
On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{}" and "\x{FDD0}), chars above 0x10, overlong sequences
and such.
It's worth remembering that overlong sequences are a potential 
security risk.

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:
   encode("UTF-8", $string)
imply the official restricted form
I think that would be best.
But to what extent?  Does it mean restricted, but unused codepoints 
(i.e. U+10F000) to be illegal?  Does that mean we have to verify and if 
necessary, patch perl anytime Unicode.org updates Unicode?

While I agree official UTF-8 be supported separately from "Perl" UTF-8, 
I would like perl to be independent from unicode.org.  Remember that 
perl community does not have a vote in unicode.org (or does it?).  
Making perl too compliant to the Unicode standard means that perl is at 
a mercy thereof.

and then have
   encode("UTF-8-Perl", $string)
be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).
Isn't there a standard name for the 'unrestricted' encoding?
(Might be an IETF RFC rather than a unicode standard.)
To my knowledge there are at least 3 flavors of UTF-8;
* Official -- officialized by unicode.org.
* RFC 2279 -- "unrestricted", U+ - U+7FFF_
* Perl -- "unrestricted", U+ - U+7FFF___ w/ 64bitint
This implies that encode("UTF-8", $string) can start failing while
previously it could not.
Anyone working with valid UTF-8 would not get failures.
Anyone who thinks they're using valid UTF-8 but aren't should be 
grateful!
Anyone not using valid UTF-8 (eg using it as a way to encode integers)
needs to be told in advance - but I doubt there are many and they're
likely to be cluefull users who read release notes :)
There are many movements and implementations that "extends" Unicode by 
making use of codepoints beyond 0x10.  Current perl can accept 
them;  "Real", official unicode cannot.

I'd say "UTF-8" should mean the official restricted form for perl 5.10.
Perl is a language where "use strict" is not default.  Why make its 
default encoding strict then?

Perl should be liberal, not official.
Why make real when you already have something better than real?
So my proposal is opposite;  Leave "utf8" and "UTF-8" as it is now and 
define "UTF-8-official" or "UTF-8-pedantic" or whatever.

The only remaining issues are then what to do for 5.8.7
and what to call the unrestricted encoding.
I would like to keep calling that 'utf8'.
Dan the Encode Maintainer


Re: About HTML unicode

2004-12-03 Thread John Delacour
At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
Now i encountered another problem,  there are a few files contains 
not only one charset but also two or more, for example, file1 
contains japanese and chinese, if i use open() to  load the data 
into memory, ord and length etc.. can't correctly work! Perhasp i 
miss something to encode or decode the data ?
code:
#!/usr/bin/perl -w
use utf8;
open(FD, "< file1");
while() {
chomp;
print "length = ".length($_);
}
close FD;
--
length() can not count the correct non-ASCII characters. :(
If the file is in UTF-8, then it may be in any number of _languages_ 
but it uses only one character set -- Unicode.  So far as I know "use 
utf8" is now redundant and ineffectual in Perl.  You will get the 
correct character count (6 characters rather than 18 bytes) by 
opening the file handle as utf-8 as below.

no warnings;
my $f = "/tmp/cjk.txt";
my $text = "\x{56d8}\x{56d9}\x{56da}\x{56db}\x{56dc}\x{56dd}\n";
open F, ">$f";
print F $text; # writes $text to $f as UTF-8
close F;
open F, "<:utf8",  $f;
for () {
  chomp;
  print "$_  -  Length = " . length() . $/;
}
JD