range operator vs. unicode

2006-06-08 Thread Dan Kogai

Porters,

I found that ('a'..'z') works only for alphanumerals.  Try the code  
below;


use strict;
use warnings;
#use utf8;
use charnames ':full';
binmode STDOUT, ':utf8';
# works
print $_\n for (\N{LATIN CAPITAL LETTER A} .. \N{LATIN CAPITAL  
LETTER Z});

# (0..9, 'A'..'Z', 'a'..'z'); symbols skipped
print $_\n for (\N{DIGIT ZERO} .. \N{LATIN SMALL LETTER Z});
# does not work
print $_\n for (\N{LATIN SMALL LETTER A} .. \N{LEFT CURLY  
BRACKET});
print $_\n for (\N{NO-BREAK SPACE} .. \N{LATIN SMALL LETTER Y  
WITH DIAERESIS});
print $_\n for (\N{GREEK CAPITAL LETTER ALPHA} .. \N{GREEK  
CAPITAL LETTER OMEGA});
print $_\n for (\N{KATAKANA LETTER SMALL A} .. \N{KATAKANA  
LETTER VO})

__END__

There is an easy workaround, however.

my @katakana = map { chr } (\N{KATAKANA LETTER SMALL A} .. \N 
{KATAKANA LETTER VO});



Since we have a workaround above, I don't consider this range  
implementation is a bug -- after all we would be rather surprised if  
('\x0' .. '\x{10}') worked.  But the following should be fixed so  
greeks are not confused with the consequence of  (\N{GREEK CAPITAL  
LETTER ALPHA} .. \N{GREEK CAPITAL LETTER OMEGA}), japanese are not  
confused with (\N{KATAKANA LETTER SMALL A} .. \N{KATAKANA LETTER  
VO}) and so forth.


perldoc perlop
   The range operator (in list context) makes use of the  
magical auto-

   increment algorithm if the operands are strings.  You can say

   @alphabet = ('A' .. 'Z');

   to get all normal letters of the English alphabet, or

   $hexdigit = (0 .. 9, 'a' .. 'f')[$num  15];

   to get a hexadecimal digit, or

   @z2 = ('01' .. '31');  print $z2[$mday];

   to get dates with leading zeros.  If the final value  
specified is not
   in the sequence that the magical increment would produce,  
the sequence
   goes until the next value would be longer than the final  
value speci-

   fied.


Dan the Man with Too Many Characters to Squeeze in the Range


Re: range operator vs. unicode

2006-06-08 Thread Dan Kogai

On Jun 08, 2006, at 17:34 , Yitzchak Scott-Thoennes wrote:

Which part should be fixed?


The limitation of the magic, namely


The key part is that magical auto-increment is defined earlier as
only working for strings matching /^[a-zA-Z]*[0-9]*\z/.


Which is described in Auto-increment and Auto-decrement, though  
Range Operator does mention.


perldoc perlop
   The range operator (in list context) makes use of the  
magical auto-

   increment algorithm if the operands are strings.


This would make lawyers happy enough but not (Uni)?coders like  
myself.  With the advent of Unicode support more people would attempt  
things like (\N{alpha} .. \N{omega}) and wonder why it does not  
work like (a..z).  So we should add something like;


=head2 CAVEAT

Note that the range operator cannot apply magic beyond C[a-zA-Z0-9] 
.  Therefore


  use charnames 'greek';
  my @greek_small =  (\N{alpha} .. \N{omega});

Does not work.  If you want non-ascii ranges, try

  my @greek_small =  map { chr } ( ord(\N{alpha}) .. ord(\N 
{omega}) );


On the other hand, ranges in regexp and Ctr/// works.  You may  
consider this inconsistent but range operator must accept variables  
like tt($start .. $end)/tt while character ranges in regexp is  
constant.


=cut

Dan the Range (?:Ar)ranger



[Encode] 2.16 released!

2006-05-03 Thread Dan Kogai

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Porters,

I just released Encode 2.16 as follows.  In terms of codes it is  
virtually no different from 2.15 but it contains two important non- 
code fixes.


First, it addresses the absence of COPYRIGHT section.  Since Encode  
is part of core and I felt owe my fellows too much credit to claim  
such, I kept leaving that part intentionally blank -- till I got this.


Second, I perltidy-ed all *.pm's  Encode has accepted so many patches  
and in a course of doing so, it kinda turned into a grab-bag of  
coding styles.  I reckoned time is high that I applied good  
practices.  The only difference from the perltidy default is -l=76; I  
so did because that's what MIME header uses.


Ticket URL: http://rt.cpan.org/Ticket/Display.html?id=19056 

There is no license in the Encode package.  It is not clear under what
basis CPAN has permission to distribute the module.


So I finally put my name on COPYRIGHT while adding this disclaimer to  
MAINTAINER section.


  While Dan Kogai retains the copyright as a maintainer, the credit
  should go to all those involoved.  See AUTHORS for those submitted
  codes.

If any of you listed on AUTHORS section and want your name added to  
COPYRIGHT, you are welcome.


=head1 Availability

http://www.dan.co.jp/~dankogai/cpan/Encode-2.16.tar.gz
and CPAN near you.

=head1 Changes

$Revision: 2.16 $ $Date: 2006/05/03 18:24:10 $
! bin/piconv
  --xmlcref and --htmlcref added.
! Encode.pm
  Copyright Notice Added.
  http://rt.cpan.org/NoAuth/Bug.html?id=#19056
! *
  Replaced remaining ^\t with q( ) x 4. -- Perl Best Practice pp. 20
  And all .pm's are now perltidy-ed.

=for Maintperl

Encode remains 2.12 there but I consider the current version mature  
enough for maint.  Nicholas, would to consider doing so?


Dan the Encode Maintainer


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFEWPxgErJia/WXtBsRAi9qAJ4/+Bye2aZbCScV3rIBFvzYoJpD7ACgoMd9
bM43uLFvZ6q7yOsnO0/pGcw=
=jRwd
-END PGP SIGNATURE-


Re: \p{IsBogus} vs. exception

2006-03-06 Thread Dan Kogai

On Mar 07, 2006, at 01:45 , Yitzchak Scott-Thoennes wrote:

So the property is only checked for validity at the point when it is
actually used.  I'm not sure it would even be desirable to check it
before then (that is, at regcomp-time), remembering that Perl is a
dynamic language.


Maybe too dynamic :)  Not many people would expect \p{IsBogus} is  
completely ignored where.


'str' =~ / \p{IsBogus}/;

But in cases like

  $str = $ARGV[0];
  $str =~ / \p{IsBogus}/;

the code may or may not raise an exception and that's somewhat tricky.

On Mar 07, 2006, at 01:45 , [EMAIL PROTECTED] wrote:

This all looks perfectly consistent to me: the expensive work of
looking up the property is not done until matching actually gets
to that point.


Thanks.  Sounds reasonable to me, too.


I would not call this a bug (not sure if you were suggesting that it
is) - if you need to check whether a property is bogus, your example
has_unicode_property is fine. But it would not be unreasonable for
utf8.pm (or something) to provide a function that delivers the same
information.


I agree.

Dan the Perl5 Porter



Re: [Encode::Guess] ambiguous result !?

2005-10-24 Thread Dan Kogai

Christophe,

Thanks for your mail.  So far Encode::Guess does not guess the  
encoding for filehandle not because it is impossible but because  
Encode::Guess takes a very conservative -- even paranoia --  
approach.  For example, Ambiguity raises exception, not preferred  
encoding.


One way to enable guessing filehandle goes like this.

sub open_and_guess{
  my $filename = shift;
  open my $fh, :raw, $filename or return; # or die if you like
  my $head = $fh; # may not work for UTF-(16|32);
  my $enc  = guess_encoding($head);
  ref $enc or die $enc; # or return $enc
  my $encname = $enc-name;
  seek $fh, 0, 0;
  binmode $fh, :encoding($encname);
  return $fh;
}

Here we open, guess, and reopen but this does not work for general  
case; Not all files are seekable and reopenable (i.e. pipes and  
sockets).


To know more about Encode::Guess, try

http://search.cpan.org/~dankogai/Encode-2.12/lib/Encode/Guess.pm

Yours,

Dan the Encode Maintainer


On Oct 24, 2005, at 19:12 , HERMIER Christophe wrote:


Hello,

I am using the Encode::Guess module to detect the encoding of a  
file before opening it.
Basically I believe that I can have various sorts of unicode  
encodings or latin-1.


What I want to get is an encoding string to give back to open.
my code goes like this :

my $codage  = guess_encoding ( $debut );

if ( UNIVERSAL::isa ( $codage, Encode::utf8 ) )
{
$codage = :utf8;
}
elsif ( UNIVERSAL::isa ( $codage, Encode::Unicode ) )
{
$codage = :encoding(utf-16);
}
else
{
$codage = :encoding(iso-8859-1);
}


The problem is with the Encode::Unicode case : I don't know if it  
is UTF16-LE ou UTF16-BE

and it could even be UTF32

Is there a way to know that ???


BTW, I checked your homepage (http://www.dan.co.jp/http:// 
www.dan.co.jp/) first but it does not seem to work ?


Regards,
Christophe.







Re: UTF-16LE fails in substitution

2005-09-15 Thread Dan Kogai

On Sep 15, 2005, at 07:05 , Steve Larson wrote:

What I want to do is add a version string comment at the beginning  
of .xml
files.  I test to see if the file is UNICODE (Encode::Unicode) or  
ASCII
(Encode::XS) using guess_encoding.  My ASCII case works fine but  
the regexp
for the UNICODE case fails.  Below snippet is the code for the  
UNICODE case.


The answer is that PerlIO does not go well with BOMed UTFs.  What you  
should do instead is to read the whole file first like this;


open my $in, :raw, $filename or die $filename : $!;
read $in, my $buf, -s $filename; # one of many ways to slurp file.
close $in;
my $content = decode(UTF16, $buffer); # LE or BE is not required.
#
# do whatever you want to $content and
#
open my $out, :raw, $filename or die $filename : $!;
print $out encode(UTF16-LE, $buffer); # now be explicit on endianness
close $out;

Remember UTF-(16|32) does not go well with stream models.  Treat it  
as a binary file.


Dan the Encode Maintainer



[Encode] 2.12 Released!

2005-09-08 Thread Dan Kogai

Porters,

I am pleased to release Encode Version 2.12 as follows;

=head1 Availability

http://www.dan.co.jp/~dankogai/cpan/Encode-2.12.tar.gz
and CPAN near you.

=head1 Highlight

You can finally use coderef to CHECK.

   coderef for CHECK

   As of Encode 2.12 CHECK can also be a code reference which  
takes the
   ord value of unmapped caharacter as an argument and returns a  
string

   that represents the fallback character.  For instance,

 $ascii = encode(ascii, $utf8, sub{ sprintf U+%04X,  
shift });


   Acts like FB_PERLQQ but U+ is used instead of \x{}.

=head1 Changes

$Revision: 2.12 $ $Date: 2005/09/08 14:17:17 $
! Encode.xs Encode.pm t/fallback.t
  Now accepts coderef for CHECK!
! ucm/8859-7.ucm
  Updated to newer version at unicode.org
  http://rt.cpan.org/NoAuth/Bug.html?id=14222
! lib/Encode/Supported.pod
  More POD typo fixed.
  [EMAIL PROTECTED]
! encoding.pm
  More POD typo leftover fixed.
  Message-Id: [EMAIL PROTECTED]

=head1 Signature

Dan the Encode Maintainer



Re: intelligent lexically encoding

2005-09-07 Thread Dan Kogai

On Sep 08, 2005, at 11:22 , Jerzy Giergiel wrote:
sorry for bugging people here with a trivial question. I need to  
convert from MacRoman encoding to asci (7-bit). Encode package  
simply replaces out of range characters with a question mark. I  
need something intelligent lexically speaking. For example aacute  
should be converted to a. Any suggestions?


Maybe you need to implement your own fallback method.
FYI Encode already has fallback methods as follows.

  $ascii = encode(ascii, $utf8, $fallbacks);

  where;

  $fallback is  รก (U+00E1) will be
  
  Encode::FB_PERLQQ \x{00E1}
  Encode::HTMLCREF  #225;
  Encode::XMLCREF   #xe1;
  

If any of that will suffice, go ahead use it.  If it does not, you  
have go go like this;


$ascii = $utf8;
$ascii =~ s/([^\x00-\x7f])/your_own_fallback($1)/eg;

Hope that helps.

Dan the Encode Maintainer




Re: IRI support in URI and URI::Escape modules

2005-01-31 Thread Dan Kogai
On Jan 31, 2005, at 18:19, Martin Duerst wrote:
I started with some very simple (I thought) tests, but got
completely confused very quickly. Here is the short program
that I was using:
 test.pl
use utf8;
use URI;
use URI::Escape;
print (uri_escape(\xFD)
[snip]
With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail), I get

%FD
%C3%BD
[snip]
However, on perl, v5.8.4 built for i386-linux-thread-multi,
I get:

%FD
[snip]
Nothing seems to work anymore, although (or because?) 5.8
has better Unicode support.
The (easiest|new canonical) way to go is to use uri_escape_utf8() 
instead of uri_escape().  Note that as of version 3.28 
uri_escape_utf8() is NOT AUTOMATICALLY loaded.

% perl -MURI::Escape -le 'print uri_escape(\xFD)'
%FD
% perl -MURI::Escape=uri_escape_utf8 -le 'print uri_escape_utf8(\xFD)'
%C3%BD
perldoc URI::Escape
   uri_escape_utf8( $string )
   uri_escape_utf8( $string, $unsafe )
   Works like uri_escape(), but will encode chars as UTF-8 
before
   escaping them.  This makes this function able do deal with 
charac-
   ters with code above 255 in $string.  Note that chars in 
the 128 ..
   255 range will be escaped differently by this function 
compared to
   what uri_escape() would.  For chars in the 0 .. 127 range 
there is
   no difference.

   The call:
   $uri = uri_escape_utf8($string);
   will be the same as:
   use Encode qw(encode);
   $uri = uri_escape(encode(UTF-8, $string));
   but will even work for perl-5.6 for chars in the 128 .. 255 
range.
Dan the Encode Maintainer


real UTF-8 vs. utf8n_to_uvuni()

2004-12-04 Thread Dan Kogai
On Dec 05, 2004, at 10:56, Dan Kogai wrote:
Thanks, applied in my repository.  New tests and documentation fix in 
progress.  When I am done w/ that, I will release Encode-2.0901 on my 
web (not CPAN yet).  When cross-checks by porters are done I will 
release Encode-2.10.

Dan the Encode Maintainer
Now I am writing test suites and found some of the strictures are 
missing.

Surrogate -- OK
% perl -Mblib -MEncode -le '$a=\x{d801}; print encode(UTF-8, $a, 1)'
\x{d801} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

U+ -- OK
% perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)'
\x{} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

Chars above U+10 -- NOT OK
% perl -Mblib -MEncode -le '$a=\x{11}; print encode(UTF-8, $a, 
1)'


Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a 
problem of perl core.  So I have checked utf8.c which defines that.  
Seems like it does not make use of PERL_UNICODE_MAX.

The patch against utf8.c fixes that.
 ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a=\x{11}; print 
encode(UTF-8, $a, 1)'
\x{00f4} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

As you see, the warning is still funny.  But for any case w/ 
UTF8_WARN_LONG is funny as follows;

 perl -Mblib -MEncode -le '$a=\x{7fff_}; print encode(UTF-8, 
$a, 1)'
??
 perl -Mblib -MEncode -le '$a=\x{8000_}; print encode(UTF-8, 
$a, 1)'
\x{00fe} does not map to utf8 at 
/gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

I have tracked down and found this warning was handled by Encode so 
Gisle and I can fix that.

Dan the Encode Maintainer
--- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
+++ perl-5.8.x.dan/utf8.c   Sun Dec  5 11:38:52 2004
@@ -429,6 +429,13 @@
}
else
uv = UTF8_ACCUMULATE(uv, *s);
+   /* Checks if ord()  0x10 -- dankogai */
+   if (uv  PERL_UNICODE_MAX){
+   if (!(flags  UTF8_ALLOW_LONG)) {
+   warning = UTF8_WARN_LONG;
+   goto malformed;
+   }
+   }
if (!(uv  ouv)) {
/* These cannot be allowed. */
if (uv == ouv) {


Re: Make Encode.pm support the real UTF-8

2004-12-03 Thread Dan Kogai
On Dec 04, 2004, at 11:51, Larry Wall wrote:
On Fri, Dec 03, 2004 at 10:12:12PM +, Tim Bunce wrote:
: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
: but UTF-8 is the name of the standard and should give the
: corresponding behaviour.
For what it's worth, that's how I've always kept them straight in my 
head.

Also for what it's worth, Perl 6 will mostly default to strict but make
it easy to switch back to lax.
Larry
Okay, Looks like the verdict is reached.
1.  utf8 will stay liberal
2.  UTF-8 will be strict
The rest is mostly implemenation.
2.1.  What will the canonnical name of the strict version of UTF-8 be 
? Gisle already submitted me a test patch and it uses 'utf-8-strict'.  
If there is no objection, I would like to use that.

2.2.  CAVEAT: UTF8 will be utf8, not utf-8-strict, since Encode 
aliasing is case insensitive.

2.3.  Degree of stricture. How strict are we going to make utf-8-strict?
   a. simply make use of UTF8_ALLOW_* in utf8.h ?
   b. unmapped codepoints banned as well?
   IMHO a. is strict enough since mapped codepoints are subject to 
increase
   as Unicode Standard updates.

2.4   We can always make UTF-8 liberal by reapplying alias.
Anything else missing?
Dan the Encode Maintainer


Re: clearing the utf8 flag

2004-11-09 Thread Dan Kogai
On Nov 10, 2004, at 01:30, Paul Bijnens wrote:
I have a program that reads and writes (among others) strings that
should be utf8 encoded.  I say should, because somewhere deep
inside the dark corners of that program, sometimes, the utf8 flag on
a string is lost. (I'm still investigating where, tips to attack
such a problem are welcome.)
Even when you try to set UTF-8 flag on strings which consists entirely 
of ASCII ( /^[\x00-\x7f]$/ ) the UTF-8 will not be on.  See The UTF-8 
flag section of 'perldoc Encode'.  Here is the short summary.

perldoc Encode
   o When you decode, the resulting utf8 flag is on unless you can 
unam-
 biguously represent data.  Here is the definition of 
dis-ambiguity.

 After $utf8 = decode('foo', $octet);,
   When $octet is...   The utf8 flag in $utf8 is
   -
   In ASCII only (or EBCDIC only)OFF
   In ISO-8859-1  ON
   In any other Encoding  ON
   -
 As you see, there is one exception, In ASCII.  That way you 
can assue
 Goal #1.  And with Encode Goal #2 is assumed but you still 
have to be
 careful in such cases mentioned in CAVEAT paragraphs.

When writing the string, the program clears the utf8 flag
and writes a simple string of octets using:
$s = encode(utf8, $s) if $s =~ /[^\x00-\x7f]/;
$n = length($s);   # yes, we need length in bytes
...
print $s;
If what you need is byte length, you can simply use bytes as follows. 
 binmode is for print().

use bytes (); # avoid imports
binmode STDOUT = :utf8;
my $s = \x{5c0f}\x{98fc} \x{5f3e};
# ...
my $n = length($s);ch
my $l = bytes::length($s);
# ...
print $s;
Why would someone test for pure 7-bit strings instead of:
$s = encode(utf8, $s) if Encode::is_utf8($s);
For most cases you don't have to and you should not have to (unless you 
maintain Encode and/or perl :).  Complex it may be, the internal UTF-8 
flag was the best way to harness UTF-8 while keeping legacy, 
byte-oriented scripts compatible.

which seems superior to avoid double utf8 encodings,
shoue ld the utf8-flag be lost.  And it's faster.
Or even simply: Encode::_utf8_off($s)
The problem is that I'm usually wrong.  Am I this time?
Am I missing something?  Or do I need more coffee?
I have to admit Encode and Perl 5.8-way of handling Unicode needs more 
recipes (Perl Cookbook 2nd Ed. does cover that issue on Ch. 8 but it 
was hardly enough).

Dan the Encode Maintainer


Re: Help with uc and lc and utf8

2004-11-05 Thread Dan Kogai
On Nov 06, 2004, at 15:21, Robert D Oden wrote:
I am not able to lower case a . I am sure I am missing something 
simple but I have spent many hours researching and trying different 
things to no avail.

Any help would be appreciated!!
Make sure:
* You have saved your script in UTF-8, not Latin1
* use utf8 to make sure string literals are treated as UTF-8 strings
* if you print, set filehandle layer to :utf8.
Try the sript below (be sure to save it in UTF-8).  I got KThe = 
kthe.

#
use strict;
use utf8;
my $fname = 'KThe';
my $lc_fname = lc($fname);
binmode STDOUT = :utf8;
print $fname = $lc_fname\n;
__END__
Summary of my perl5 (revision 5.0 version 8 subversion 0) 
configuration:
You should upgrade it to 5.8.1 or above (5.8.5 being the latest).  
5.8.0 was still premature unicode-wise.

Dan the Encode Maintainer


Re: Encode-2.07 vs. PerlIO::encoding

2004-10-24 Thread Dan Kogai
On Oct 24, 2004, at 18:34, Rafael Garcia-Suarez wrote:
Welcome to backward compatibility hell :)
Hell it was but seems like I came up with a way out (yay).
I just want Encode::utf8-decode() to make sure Encode:RETURN_ON_ERR 
is
on when the callar is PerlIO::encoding...
Or, one could backport PerlIO::encoding (with your patch) to CPAN and
require this latest version for Encode 2.08.
That was what came across my mind first but I found it was not good 
enough to coerce Encode:RETURN_ON_ERR since $PerlIO::encoding:fallback 
is open to the public (even documented!).

So far -renew() is only used by PerlIO (and is meaningful only when 
the object is Encode::Unicode).  In other words, you can tell it's 
PerlIO that is calling you if the object is renewed.

The following patch does that.  The new Encode::utf8-decode() checks 
$self-renewed and if so it sets Encode:RETURN_ON_ERR.  Here is the 
patch or you can wait for Encode-2.08.

Thankfully Encode::XS needs no real -renew so it is left as is 
(dummy -renewed() was introduced just to be safe).

Dan the Encode Maintainer
diff -ruN ext/Encode-2.07/Encode.xs ext/Encode/Encode.xs
--- ext/Encode-2.07/Encode.xs   Sat Oct 23 04:37:13 2004
+++ ext/Encode/Encode.xsSun Oct 24 20:31:06 2004
@@ -252,14 +252,6 @@
 PROTOTYPES: DISABLE
 void
-Method_renew(obj)
-SV *   obj
-CODE:
-{
-XSRETURN(1);
-}
-
-void
 Method_decode_xs(obj,src,check = 0)
 SV *   obj
 SV *   src
@@ -270,6 +262,28 @@
 U8 *s = (U8 *) SvPV(src, slen);
 U8 *e = (U8 *) SvEND(src);
 SV *dst = newSV(slen0?slen:1); /* newSV() abhors 0 -- inaba */
+
+/*
+ * PerlO check -- we assume the object is of PerlIO if renewed
+ * and if so, we set RETURN_ON_ERR for partial character
+ */
+int renewed = 0;
+dSP; ENTER; SAVETMPS;
+PUSHMARK(sp);
+XPUSHs(obj);
+PUTBACK;
+if (call_method(renewed,G_SCALAR) == 1) {
+   SPAGAIN;
+   renewed = POPi;
+   PUTBACK;
+#if 0
+   fprintf(stderr, renewed == %d\n, renewed);
+#endif
+   if (renewed){ check |= ENCODE_RETURN_ON_ERR; }
+}
+FREETMPS; LEAVE;
+/* end PerlIO check */
+
 SvPOK_only(dst);
 SvCUR_set(dst,0);
 if (SvUTF8(src)) {
@@ -397,6 +411,14 @@
 {
 XSRETURN(1);
 }
+
+int
+Method_renewed(obj)
+SV *obj
+CODE:
+RETVAL = 0;
+OUTPUT:
+RETVAL
 void
 Method_name(obj)
diff -ruN ext/Encode-2.07/Unicode/Unicode.pm 
ext/Encode/Unicode/Unicode.pm
--- ext/Encode-2.07/Unicode/Unicode.pm  Sat Oct 23 04:37:17 2004
+++ ext/Encode/Unicode/Unicode.pm   Sun Oct 24 20:38:16 2004
@@ -46,7 +46,7 @@
 my $self = shift;
 $BOM_Unknown{$self-name} or return $self;
 my $clone = bless { %$self } = ref($self);
-$clone-{clone} = 1; # so the caller knows it is renewed.
+$clone-{clone}++ # so the caller knows it is renewed.
 return $clone;
 }

diff -ruN ext/Encode-2.07/lib/Encode/Encoding.pm 
ext/Encode/lib/Encode/Encoding.pm
--- ext/Encode-2.07/lib/Encode/Encoding.pm  Sat Oct 23 04:37:13 2004
+++ ext/Encode/lib/Encode/Encoding.pm   Sun Oct 24 20:25:13 2004
@@ -5,6 +5,7 @@

 require Encode;
+sub DEBUG { 0 }
 sub Define
 {
 my $obj = shift;
@@ -16,7 +17,18 @@
 sub name  { return shift-{'Name'} }
-sub renew { return $_[0] }
+# sub renew { return $_[0] }
+
+sub renew {
+my $self = shift;
+my $clone = bless { %$self } = ref($self);
+$clone-{renewed}++; # so the caller can see it
+DEBUG and warn $clone-{renewed};
+return $clone;
+}
+
+sub renewed{ return $_[0]-{renewed} || 0 }
+
 *new_sequence = \renew;
 sub needs_lines { 0 };
@@ -167,24 +179,28 @@
 Predefined As:
-  sub renew { return $_[0] }
+  sub renew {
+my $self = shift;
+my $clone = bless { %$self } = ref($self);
+$clone-{renewed}++;
+return $clone;
+  }
 This method reconstructs the encoding object if necessary.  If you need
 to store the state during encoding, this is where you clone your 
object.
-Here is an example:
-
-  sub renew {
-  my $self = shift;
-  my $clone = bless { %$self } = ref($self);
-  $clone-{clone} = 1; # so the caller can see it
-  return $clone;
-  }
-
-Since most encodings are stateless the default behavior is just return
-itself as shown above.

 PerlIO ALWAYS calls this method to make sure it has its own private
 encoding object.
+
+=item -Egtrenewed
+
+Predefined As:
+
+  sub renewed { $_[0]-{renewed} || 0 }
+
+Tells whether the object is renewed (and how many times).  Some
+modules emit CUse of uninitialized value in null operation warning
+unless the value is numeric so return 0 for false.
 =item -Egtperlio_ok()


Re: Encode-2.07 vs. PerlIO::encoding

2004-10-24 Thread Dan Kogai
On Oct 24, 2004, at 20:50, Dan Kogai wrote:
The following patch does that.  The new Encode::utf8-decode() checks 
$self-renewed and if so it sets Encode:RETURN_ON_ERR.  Here is the 
patch or you can wait for Encode-2.08.
One patch to Unicode/Unicode.xs was missing and Unicode/Unicode.pm was 
garbled. Here we go again, the patch against 2.07.  Forget the  
previous patch.

Or wait for Encode-2.08
Dan the Encode Maintainer
diff -ruN ext/Encode-2.07/Encode.xs ext/Encode/Encode.xs
--- ext/Encode-2.07/Encode.xs   Sat Oct 23 04:37:13 2004
+++ ext/Encode/Encode.xsSun Oct 24 20:31:06 2004
@@ -252,14 +252,6 @@
 PROTOTYPES: DISABLE
 void
-Method_renew(obj)
-SV *   obj
-CODE:
-{
-XSRETURN(1);
-}
-
-void
 Method_decode_xs(obj,src,check = 0)
 SV *   obj
 SV *   src
@@ -270,6 +262,28 @@
 U8 *s = (U8 *) SvPV(src, slen);
 U8 *e = (U8 *) SvEND(src);
 SV *dst = newSV(slen0?slen:1); /* newSV() abhors 0 -- inaba */
+
+/*
+ * PerlO check -- we assume the object is of PerlIO if renewed
+ * and if so, we set RETURN_ON_ERR for partial character
+ */
+int renewed = 0;
+dSP; ENTER; SAVETMPS;
+PUSHMARK(sp);
+XPUSHs(obj);
+PUTBACK;
+if (call_method(renewed,G_SCALAR) == 1) {
+   SPAGAIN;
+   renewed = POPi;
+   PUTBACK;
+#if 0
+   fprintf(stderr, renewed == %d\n, renewed);
+#endif
+   if (renewed){ check |= ENCODE_RETURN_ON_ERR; }
+}
+FREETMPS; LEAVE;
+/* end PerlIO check */
+
 SvPOK_only(dst);
 SvCUR_set(dst,0);
 if (SvUTF8(src)) {
@@ -397,6 +411,14 @@
 {
 XSRETURN(1);
 }
+
+int
+Method_renewed(obj)
+SV *obj
+CODE:
+RETVAL = 0;
+OUTPUT:
+RETVAL
 void
 Method_name(obj)
diff -ruN ext/Encode-2.07/Unicode/Unicode.pm 
ext/Encode/Unicode/Unicode.pm
--- ext/Encode-2.07/Unicode/Unicode.pm  Sat Oct 23 04:37:17 2004
+++ ext/Encode/Unicode/Unicode.pm   Sun Oct 24 21:20:22 2004
@@ -46,7 +46,7 @@
 my $self = shift;
 $BOM_Unknown{$self-name} or return $self;
 my $clone = bless { %$self } = ref($self);
-$clone-{clone} = 1; # so the caller knows it is renewed.
+$clone-{renewed}++; # so the caller knows it is renewed.
 return $clone;
 }

diff -ruN ext/Encode-2.07/Unicode/Unicode.xs 
ext/Encode/Unicode/Unicode.xs
--- ext/Encode-2.07/Unicode/Unicode.xs  Sat Oct 23 04:37:21 2004
+++ ext/Encode/Unicode/Unicode.xs   Sun Oct 24 21:20:22 2004
@@ -1,5 +1,5 @@
 /*
- $Id: Unicode.xs,v 2.0 2004/05/16 20:55:16 dankogai Exp $
+ $Id: Unicode.xs,v 2.0 2004/05/16 20:55:16 dankogai Exp dankogai $
  */

 #define PERL_NO_GET_CONTEXT
@@ -97,7 +97,7 @@
 U8 endian   = *((U8 *)SvPV_nolen(attr(endian, 6)));
 int size=   SvIV(attr(size,   4));
 int ucs2= SvTRUE(attr(ucs2,   4));
-int clone   = SvTRUE(attr(clone,  5));
+int renewed = SvTRUE(attr(renewed,  7));
 SV *result  = newSVpvn(,0);
 STRLEN ulen;
 U8 *s = (U8 *)SvPVbyte(str,ulen);
@@ -124,7 +124,7 @@
}
 #if 1
/* Update endian for next sequence */
-   if (clone) {
+   if (renewed) {
hv_store((HV *)SvRV(obj),endian,6,newSVpv((char 
*)endian,1),0);
}
 #endif
@@ -200,7 +200,7 @@
 U8 endian   = *((U8 *)SvPV_nolen(attr(endian, 6)));
 int size=   SvIV(attr(size,   4));
 int ucs2= SvTRUE(attr(ucs2,   4));
-int clone   = SvTRUE(attr(clone,  5));
+int renewed = SvTRUE(attr(renewed,  7));
 SV *result  = newSVpvn(,0);
 STRLEN ulen;
 U8 *s = (U8 *)SvPVutf8(utf8,ulen);
@@ -211,7 +211,7 @@
enc_pack(aTHX_ result,size,endian,BOM_BE);
 #if 1
/* Update endian for next sequence */
-   if (clone){
+   if (renewed){
hv_store((HV *)SvRV(obj),endian,6,newSVpv((char 
*)endian,1),0);
}
 #endif
diff -ruN ext/Encode-2.07/lib/Encode/Encoding.pm 
ext/Encode/lib/Encode/Encoding.pm
--- ext/Encode-2.07/lib/Encode/Encoding.pm  Sat Oct 23 04:37:13 2004
+++ ext/Encode/lib/Encode/Encoding.pm   Sun Oct 24 20:25:13 2004
@@ -5,6 +5,7 @@

 require Encode;
+sub DEBUG { 0 }
 sub Define
 {
 my $obj = shift;
@@ -16,7 +17,18 @@
 sub name  { return shift-{'Name'} }
-sub renew { return $_[0] }
+# sub renew { return $_[0] }
+
+sub renew {
+my $self = shift;
+my $clone = bless { %$self } = ref($self);
+$clone-{renewed}++; # so the caller can see it
+DEBUG and warn $clone-{renewed};
+return $clone;
+}
+
+sub renewed{ return $_[0]-{renewed} || 0 }
+
 *new_sequence = \renew;
 sub needs_lines { 0 };
@@ -167,24 +179,28 @@
 Predefined As:
-  sub renew { return $_[0] }
+  sub renew {
+my $self = shift;
+my $clone = bless { %$self } = ref($self);
+$clone-{renewed}++;
+return $clone;
+  }
 This method reconstructs the encoding object if necessary.  If you need
 to store the state during encoding, this is where you clone your 
object.
-Here is an example:
-
-  sub renew {
-  my $self = shift;
-  my $clone = bless { %$self } = ref($self);
-  $clone-{clone} = 1; # so the caller can see it
-  return

[Encode] 2.08 released

2004-10-24 Thread Dan Kogai
Porters,
On Oct 24, 2004, at 20:50, Dan Kogai wrote:
The following patch does that.  The new Encode::utf8-decode() checks 
$self-renewed and if so it sets Encode:RETURN_ON_ERR.  Here is the 
patch or you can wait for Encode-2.08.
One patch to Unicode/Unicode.xs was missing and Unicode/Unicode.pm was 
garbled. Here we go again, the patch against 2.07.  Forget the  
previous patch.

Or wait for Encode-2.08
And here comes Encode-2.08.  If you are by any chance using 
Encode-2.07, upgrade RIGHT NOW!

=head1 Tested
As follows:
 Perl 5.8.3 on Mac OS X v10.3.5 (/usr/bin/perl, post-built as in CPAN)
 Perl 5.8.5 on Mac OS X v10.3.5 (post-built)
on FreeBSD 4.10-STABLE (post-built)
 bleedperl  on Mac OS X v10.3.5 (integrally built w/ whole perl dist)
on FreeBSD 4.10-STABLE (integrally built)
=head1 Availability
http://www.dan.co.jp/~dankogai/cpan/Encode-2.08.tar.gz
or CPAN near you
=head1 Changes
$Revision: 2.8 $ $Date: 2004/10/24 13:00:29 $
! Encode.xs lib/Encode/Encoding.pm  Unicode/Unicode.{pm,xs}
  Resolved the issue that was raised by the Encode::utf8 fallbacks vs.
  PerlIO::encoding issue that was introduced in 2.07.  This is done by
  making use of -renew() method that used to be used only by
  Encode::Unicode.  -renewed() method was also introduced to fetch
  the value thereof.
  Message-Id: [EMAIL PROTECTED]
=head1 Epilogue
Enjoy!
Dan the Encode Maintainer


Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-24 Thread Dan Kogai
On Oct 25, 2004, at 03:01, Nick Ing-Simmons wrote:
But as Dan said at the start \xF6 on its own (say as 1023 octet
in a 0..1023 1024-octet buffer is not a fail.
Changing that will make :encoding() layer have problems as buffer
boundaries can occur in the middle of characters.
Right.  Encode-2.07 indeed had the problem, resulting bleedperl to fail 
on ext/PerlIO/t/encoding.t, test 14.  Encode-2.08 corrected the problem 
by checking if the caller is PerlIO and if so, sets 
Encode::RETURN_ON_ERR so it breaks out of the loop on partial character 
case.

I believe I have checked  tested enough but I would appreaciate if you 
guys take a look, especially Encode.xs and t/fallback.t.

Dan the Encode Maintainer


[Encode] 2.06 Released

2004-10-22 Thread Dan Kogai
Porters,
I just updated Encode to version 2.06.
=head1 Availability
http://www.dan.co.jp/~dankogai/cpan/Encode-2.06.tar.gz
or CPAN near you
=head1 Changes
$Revision: 2.6 $ $Date: 2004/10/22 06:23:11 $
! ucm/mac*
  RT #8083 reports that MacThai mapping was obsolete
  Updated all mac* encodings accordingly to the URI below.
  One remaining mystery is that MacRomanian vs. MacRumanian.
  MacRumanian is not found in unicode.org...
  http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/
! Encode.pm t/Encode.t
  Fixed RT #8081: decode(..., bless{},'x') segfault
  Two more tests added to test that.
  http://rt.cpan.org/NoAuth/Bug.html?id=8081
! Encode.pm
  POD revised accordingly to RT #7966
  http://rt.cpan.org/NoAuth/Bug.html?id=7966
! Unicode/Unicode.pm
  POD updated explaining why Encode::Unicode always croaks on error
  rather than giving users choices.
  http://rt.cpan.org/NoAuth/Bug.html?id=7892
=head1 Signature
Dan the Encode Maintainer


Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 Thread Dan Kogai
On Oct 22, 2004, at 20:42, Bjoern Hoehrmann wrote:
No, you misread the bug report, I expect that
  perl -MEncode -e print decode(q(utf-8), qq(Bj\xF6rn))
  perl -MEncode -e print decode(q(utf-8), qq(Bj\xF6rnx))
behave the same in that the malformed sequence \xF6 gets replaced by
U+FFFD as documented in `perldoc Encode` for check = 
Encode::FB_DEFAULT.
Encode::utf8::decode_xs() fails to do that for the reason outlined in 
my
bug report so the current result is
\xF6 ALONE does not mean that the sequence is malformed.  Try
  perl -Mencoding=utf8 -le 'print \x{18}' | hexdump -C
Though unicode.org does not assign any character on U+18 (yet), 
\xF6\x80\x80\x80 is a valid UTF-8 character from perl's point of 
view.  Perl only finds it corrupted when it reaches the following 'r'.

In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the 
following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from 
UTF-8's point of view).

  Bj
  Bj\x{FFFD}rnx
it should be
  Bj\x{FFFD}rn
  Bj\x{FFFD}rnx
So you can't really say which behavior is correct.
I fail to see what this has to do with how Perl treats the string as
from a Perl perspective there is no real difference here, Perl works
as expected, decode() does not.
(I've posted this to RT but it again does not show up there, see
http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html).
IMHO I believe the current implementation is correct since you can't 
really tell if the sequnece is
corrupted just by looking at a given octet.  At the same time I believe 
this should be documented somehow somewhere.

Dan the Encode Maintainer


Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 Thread Dan Kogai
On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote:
C12a in Unicode 4.0.1 notes
[...]
  For example, in UTF-8 every code unit of the form 110 must be
  followed by a code unit of the form 10xx. A sequence such as
  110x 0xxx is illformed and must never be generated. When
  faced with this ill-formed code unit sequence while transforming or
  interpreting text, a conformant process must treat the first code 
unit
  110x as an illegally terminated code unit sequence--for example,
  by signaling an error, filtering the code unit out, or representing
  the code unit with a marker such as U+FFFD
[...]
[snip]
Okay, you win.  You have convinced me that Encode::utf8 should behave 
the same as Encode::XS (UCM-base encodings).  And the patch to make 
that way is deceptively simple, as follow;

===
RCS file: Encode.xs,v
retrieving revision 2.0
diff -u -r2.0 Encode.xs
--- Encode.xs   2004/05/16 20:55:15 2.0
+++ Encode.xs   2004/10/22 18:00:29
@@ -297,7 +297,7 @@
U8 skip = UTF8SKIP(s);
if ((s + skip)  e) {
/* Partial character - done */
-   break;
+   goto decode_utf8_fallback;
}
else if (is_utf8_char(s)) {
/* Whole char is good */
@@ -313,6 +313,7 @@
/* Invalid start byte */
}
/* If we get here there is something wrong with alleged UTF-8 */
+decode_utf8_fallback:
if (check  ENCODE_DIE_ON_ERR){
Perl_croak(aTHX_ ERR_DECODE_NOMAP, utf8, (UV)*s);
XSRETURN(0);
===
The most decisive comment of yours is this:
holds true and I expect that
  my $x = Bj\xF6rn; # as well as Bj\xF6r and Bj\xF6
  decode(utf-8, $x, Encode::FB_CROAK);
croaks.
Which apparently did not.  Thank you for being so persitent on this 
problem.  I'd be honor to add your name to AUTHORS file for this.

I will $Encode::VERSION++ as soon as I am done w/ the test suites and 
Tel's patch.  This time I will be careful not to screw up 
(maint|bread)perl so give me some time before the update is ready (but 
I won't keep you waiting for too long since 5.8.6 deadline is soon).

Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 
is
documented as

[...]
  is_utf8(STRING [, CHECK])
[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being
well-formed UTF-8. Returns true if successful, false otherwise.
[...]
And D36 in Unicode 4.0.1 is very clear that
[...]
  As a consequence of the well-formedness conditions specified in Table
  3-6, the following byte values are disallowed in UTF-8: C0C1, F5FF.
[...]
That's because perl's notion of Unicode is broader than that of 
unicode.org.  So far Unicode.org's mapping only spans from U+ to 
U+1f, While that of perl is U+ or even U+ 
(in other words, MAX_UINT).  See Camel 3 on details.

And I think we can leave this :)
Dan the Encode Maintainer


[Encode] 2.07 Released

2004-10-22 Thread Dan Kogai
Porters
On Oct 22, 2004, at 15:31, Dan Kogai wrote:
I just updated Encode to version 2.06.
Within less than 24hrs I resorted to release version 2.07.  What the 
heck.  5.8.6 is soon

=head1 Availability
http://www.dan.co.jp/~dankogai/cpan/Encode-2.07.tar.gz
or CPAN near you
=head1 Changes
$Revision: 2.7 $ $Date: 2004/10/22 19:35:52 $
! lib/Encode/Encoding.pm
  Remove Carp from warnings.pm that influences Encode, by Tels.
  Message-Id: [EMAIL PROTECTED]
! Encode.xs AUTHORS t/fallback.t
  Now Encode::utf8's fallbacks are compliant to Encode standard.
  Thank Bjoern Hoehrmann for persistently convincing me.
  Message-Id: [EMAIL PROTECTED]
! Encode.pm
  POD further revised.
=head1 Signature
Dan the Encode Maintainer


[Encode] 2.03 released

2004-10-06 Thread Dan Kogai
Porters,
I have released Encode 2.03 at last;  The code was done about 10 days 
ago but I needed a tester to verify 'piconv vs. Win32' case.  Thanks, 
Steve.

=HEAD1 Availability
http://www.dan.co.jp/~dankogai/Encode-2.03.tar.gz
or CPAN near you (as soon as CPAN is fixed)
=HEAD1 Changes
$Revision: 2.3 $ $Date: 2004/10/06 05:07:20 $
  lib/Encode/Alias.pm
Resolved some alias case sensitivity glitches reported via RT.
http://rt.cpan.org/NoAuth/Bug.html?id=7835
  bin/piconv
Resolved Win32 glitches reported via RT.
(Fixed by dankogai and tested by Steve Hay)
http://rt.cpan.org/Ticket/Display.html?id=7831
  JP/JP.pm lib/Encode/Alias.pm lib/Encode/Supported.pod AUTHORS
/\bwindows-31j$/i is now an alias of CP932, by Steve Hay.
http://rt.cpan.org/NoAuth/Bug.html?id=6695
Yours,
Dan the Encode Maintainer


Re: [Encode] Request for testing: piconv vs. Win32

2004-10-05 Thread Dan Kogai
On Oct 06, 2004, at 02:12, Steve Hay wrote:
I created the attached in file (utf-8) and ran the command-line in 
the
original bug report:

piconv -f utf-8 -t UTF-16LE in  out
This produces the attached out file, which looks right to me.
I also saved the in file to another name in UTF-16LE format using
Windows' Notepad program and the file that it output was identical to
the out file produced by piconv.
So that's a thumbs-up from me.  I'm running Encode 2.02 with perl 5.8.5
on Windows XP.
Thanks a meg.  I'll $Encode::VERSION++ right away.
Dan the Encode Maintainer


[Encode] Request for testing: piconv vs. Win32

2004-10-05 Thread Dan Kogai
Porters,
Can somebody w/ Win32 access test the patch below mentioned in RT#7831: 
piconv with ascii-incompatible output breaks on Win32 ?

http://rt.cpan.org/NoAuth/Bug.html?id=7831
I have submitted the patch but there was no response from the reporter 
and I do not have an access to Win32 platforms right now.

Dan the Encode Maintainer

 perl -MEncode -e
local$/;$_=;binmode(STDOUT);Encode::from_to($_,'utf-8','UTF-
16LE');print in  out
This should be resolved by applying binmode to the input filehandle.  
The patch below
should fix that.  Would you try and tell me what happens?

Dan the Maintainer Thereof
--- bin/piconv  2004/05/16 20:55:16 2.0
+++ bin/piconv  2004/09/30 18:40:17
@@ -1,5 +1,5 @@
 #!./perl
-# $Id: piconv,v 2.0 2004/05/16 20:55:16 dankogai Exp dankogai $
+# $Id: piconv,v 2.0 2004/05/16 20:55:16 dankogai Exp $
 #
 use 5.8.0;
 use strict;
@@ -52,25 +52,39 @@
 EOT
 }
-# default
-if ($scheme eq 'from_to'){
-while(){
-   Encode::from_to($_, $from, $to, $Opt{check}); print;
-};
-# step-by-step
-}elsif ($scheme eq 'decode_encode'){
-   while(){
-   my $decoded = decode($from, $_, $Opt{check});
-   my $encoded = encode($to, $decoded);
-   print $encoded;
-};
-# NI-S favorite
-}elsif ($scheme eq 'perlio'){
-binmode(STDIN,  :encoding($from));
-binmode(STDOUT, :encoding($to));
-while(){ print; }
-} else { # won't reach
-die $name: unknown scheme: $scheme;
+# we do not use  (or ARGV) for the sake of binmode()
[EMAIL PROTECTED] or push @ARGV, \*STDIN;
+
+unless ($scheme eq 'perlio'){
+binmode STDOUT;
+for my $argv (@ARGV){
+   my $ifh = ref $argv ? $argv : undef;
+   $ifh or open $ifh, , $argv or next;
+   binmode $ifh;
+   if ($scheme eq 'from_to'){  # default
+   while($ifh){
+   Encode::from_to($_, $from, $to, $Opt{check});
+   print;
+   }
+   }elsif ($scheme eq 'decode_encode'){ # step-by-step
+   while($ifh){
+   my $decoded = decode($from, $_, $Opt{check});
+   my $encoded = encode($to, $decoded);
+   print $encoded;
+   }
+   } else { # won't reach
+   die $name: unknown scheme: $scheme;
+   }
+}
+}else{
+# NI-S favorite
+binmode STDOUT = raw:encoding($to);
+for my $argv (@ARGV){
+   my $ifh = ref $argv ? $argv : undef;
+   $ifh or open $ifh, , $argv or next;
+   binmode $ifh = raw:encoding($from);
+   print while($ifh);
+}
 }
 sub list_encodings{



[Encode] 2.00 released!

2004-05-16 Thread Dan Kogai
Porters,
I have just released Encode version 2.00.  Though major version has 
been incremented, there is no big feature (addition|change)s.

=head1 AVAILABILITY
http://www.dan.co.jp/~dankogai/Encode-2.00.tar.gz
or CPAN near you
=head1 CHANGES
$Revision: 2.0 $ $Date: 2004/05/16 20:55:15 $
* version updated to 2.00
   -- sorry, no big feature change.  I just hate version 1.100 :)
! lib/Encode/Guess.pm
  Unicode/Unicode.pm
  addressed  UTF-(8|32LE) + BOM misguessing
  https://rt.cpan.org/Ticket/Display.html?id=6279
! Encode.pm
  s/is_utif8/is_utf8/ in POD
! Encode/lib/Encode/CN/HZ.pm
  Fixes make test failure after the patch to pp_hot.c
  by Sadahiro-san
  Message-Id: [EMAIL PROTECTED]
! bin/piconv
  From:   [EMAIL PROTECTED]
  Subject: [PATCH] piconv -C 512 badly broken
  Message-Id: [EMAIL PROTECTED]
Some of the changes are already committed in Perl 5.8.[34] and 
maintperl but without new releases older perls are left behind so I 
released.

Enjoy!
Dan the Encode Maintainer


Re: UTF8 behavior under -T (Taint) mode

2004-01-01 Thread Dan Kogai
On Jan 01, 2004, at 12:32, Masanori HATA wrote:
Hello,

I have a simple question:

It seems that utf8::decode() does not work for
any tainted variables under the -T (Taint) mode.
Is it right?
Wrong.

What drove you to such a conclusion?  It does work.  Try something like

  perl -T -le 'utf8::decode($ARGV[0])' something

and see it for yourself.  Did perl die with Insecure ... message?

Dan the Perl5 Porter



Re: UTF8 behavior under -T (Taint) mode

2004-01-01 Thread Dan Kogai
On Jan 01, 2004, at 21:49, Masanori HATA wrote:
Sorry, no. Since the case which I would like to suggest
seems not to be fatal. Perl would not die, but it would
take the tainted value as a Non-UTF8 string.
My sample code is like below (test.pl):
-
utf8::decode(my $text0 = \x{3042}  ); # clean
utf8::decode(my $arg   = $ARGV[0]); # tainted
utf8::decode(my $text1 = $arg$text0); # tainted
utf8::decode(my $text2 = $text0$arg); # tainted
print length($text1), \n;
print length($text2), \n;
-
Aha!  I see your point at last.  And I found your argument was correct.

When I run this code with 'perl -T test.pl a', the result is:
To clear your point, I have modified your script with Devel::Peek.  Pay 
attention to the $text1 result.

without -T
% perl test.pl a
SV = PV(0x812354) at 0x80a960
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x428090 a\343\201\202\0 [UTF8 a\x{3042}]
  CUR = 4
  LEN = 5
2
SV = PV(0x812e10) at 0x80f2a8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x405150 \343\201\202a\0 [UTF8 \x{3042}a]
  CUR = 4
  LEN = 5
2
with -T
% perl -T test.pl a
SV = PVMG(0x819a88) at 0x80a954
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK)
  IV = 0
  NV = 0
  PV = 0x428540 a\343\201\202\0
  CUR = 4
  LEN = 5
  MAGIC = 0x405480
MG_VIRTUAL = PL_vtbl_taint
MG_TYPE = PERL_MAGIC_taint(t)
MG_LEN = 1
4
SV = PVMG(0x819af4) at 0x80f69c
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x4054e0 \343\201\202a\0 [UTF8 \x{3042}a]
  CUR = 4
  LEN = 5
  MAGIC = 0x4010d0
MG_VIRTUAL = PL_vtbl_taint
MG_TYPE = PERL_MAGIC_taint(t)
MG_LEN = 1
2
I am not sure how severe it is but this is a bug indeed.

(My system is perl5.8.1 MSWin32-X86-multi-thread)
I have duplicated the result with Perl 5.8.2 on Mac OS X as well as 
[EMAIL PROTECTED] on FreeBSD.  And using Encode::decode_utf8 does not 
help either because it simply calls utf8::decode.  And you can't use 
Encode::decode(utf8, ...) in this particular case because 
Encode::decode() checks and clobbers at Cannot decode string with wide 
characters.  Hmm

Dan the Perl5 Porter



Re: [PATCH] piconv -C 512 doesn't work.

2003-09-26 Thread Dan Kogai
On Friday, Sep 26, 2003, at 23:18 Asia/Tokyo, Autrijus Tang wrote:
It is unfortunate that this very handy utility breaks:

$ piconv -f utf8 -t ascii -C 1024
Can't open 1024: No such file or directory at piconv line 60.
Thanks, applied in my repository.

Dan the Encode Maintainer




Re: Inverse of /\p{script}/

2003-08-29 Thread Dan Kogai
On Thursday, Aug 28, 2003, at 23:16 Asia/Tokyo, [EMAIL PROTECTED]  
wrote:
Does the existing perl5.8.* Unicode support have a way to efficently
determine which script(s) or block (in unicode sense) a code point 
belongs
to?

In Unicode-aware Tk I am still doing battle with mechanism to select
X11 font to display a particular codepoint (for now glossing over
glyph vs character issues).
The present code is still rather dumb.
That's what Encode::InCharset is for.  Available via CPAN.

http://search.cpan.org/author/DANKOGAI/Encode-InCharset-0.03/

It seems to make sense to have a hash which maps script names to
probable (font) encodings
 (Hiragana | Katakana | Han) = 'jisx0208.1990-0'
The module makes it \p{InJIS0208} ...

 (Greek) = 'iso8859-7',
And \p{InISO_8859_7}, respectively.

So give a (1 character) string how do I get Unicode script/block it is 
in?
One caveat, however.  It is slightly out of sync w/ the latest Encode.  
You should stay away from vendor encodings that are thoroughly revised 
in Encode 1.75 - 1.98 (FYI ENcode::InCharset is still based upon 1.75).

Dan the Encode Maintainer






Re: [Patch] Encode.pm : euro sign missing in cp936.ucm

2003-03-26 Thread Dan Kogai
SADAHIRO-san and cp9?? experts,

On Thursday, Mar 27, 2003, at 00:44 Asia/Tokyo, SADAHIRO Tomoyuki wrote:
+U20AC \x80 |0 # EURO SIGN
Is this right?  Yes, U20AC is indeed missing from cp936.ucm but see 
this;

grep U20AC ucm/cp*.ucm
/Users/dankogai/work/Encode/ucm/cp1250.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1251.ucm:U20AC \x88 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1252.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1253.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1254.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1255.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1256.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1257.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp1258.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp874.ucm:U20AC \x80 |0 # EURO SIGN
/Users/dankogai/work/Encode/ucm/cp949.ucm:U20AC \xA2\xE6 |0 # EURO 
SIGN
/Users/dankogai/work/Encode/ucm/cp950.ucm:U20AC \xA3\xE1 |0 # EURO 
SIGN
\x80 SEEMS right for single-byte CPs but they are mapped differently in 
CP949 and CP950.
So far as I check the Microsoft's pages

http://www.microsoft.com/typography/unicode/cscp.htm -
http://www.microsoft.com/globaldev/reference/wincp.mspx -
http://www.microsoft.com/globaldev/reference/dbcs/936.htm
it indeed does use \x80 (though only \x00-\xFF are covered;  Where the 
heck is the FULL MAP!?).  But it seem this only applies to 936.  932 
(Japanese; Shift_JIS based), 949 (Korean; euc-kr based) and 950 
(Traditional Chinese; Big5-based) all leave \x80 blank.

I would like more confirmation from experts;  cp936.ucm has been 
overhauled with a help of MORIYAMA san and back then and at that time 
FULL map was available from the URIs above.  And I think \x80 was not 
used for EURO SIGN back then.

Oh, I still have a copy of full mapping that was one available via URI 
above.  Let's see...

cp936.txt says...
CODEPAGE 936; PRC GBK (XGB) - ANSI, OEM

CPINFO 2 0x3f 0x003f; DBCS CP, Default Char = Question Mark

MBTABLE 130

0x000x  ;Null
[snip]
0x200x0020  ;Space
[snip]
0x7f0x007f  ;^?
0x800x0080  ;80
0xff0xf8f5  ;FF
\x80 is mentioned but not mapped to EURO SIGN.

Please somebody tell me where to find the FULL map.

Dan the Encode Maintainer with Too Many (Dead) Links to Follow



Re: Warning messages for ill-formed data

2003-03-25 Thread Dan Kogai
Autrijus (and Porters),

  I think you are following this thread but in case you are not, 
Sadahiro-san proposes that some extraneous (and presumably unneeded) 
control characters in \x80-\xA0 in big5-eten map be removed to solve 
problems that arise in certain circumstances.
  Since these control characters are just duplicates at \x00-\x20, I 
think it is a good idea to go for it (and do the same to 
big5-hkscs.ucm).  But I am not as sure of Big5 as you are please check 
if the proposal is right.
  If you affirm the idea, I'll $Encode::VERSION++.

Dan the Encode Maintainer

On Tuesday, Mar 25, 2003, at 21:53 Asia/Tokyo, SADAHIRO Tomoyuki wrote:
Well, is it right?

I'm not sure of the status and the single byte-range
for Big-5, though.
diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm
--- ucm~/big5-eten.ucm  Thu Jan 23 23:21:00 2003
+++ ucm/big5-eten.ucm   Tue Mar 25 21:43:00 2003
@@ -137,38 +137,6 @@
 U007E \x7E |0 # TILDE
 U007F \x7F |0 # DELETE
 U0080 \x80 |0 # control
-U0081 \x81 |0 # control
-U0082 \x82 |0 # BREAK PERMITTED HERE
-U0083 \x83 |0 # NO BREAK HERE
-U0084 \x84 |0 # control
-U0085 \x85 |0 # NEXT LINE
-U0086 \x86 |0 # START OF SELECTED AREA
-U0087 \x87 |0 # END OF SELECTED AREA
-U0088 \x88 |0 # CHARACTER TABULATION SET
-U0089 \x89 |0 # CHARACTER TABULATION WITH JUSTIFICATION
-U008A \x8A |0 # LINE TABULATION SET
-U008B \x8B |0 # PARTIAL LINE DOWN
-U008C \x8C |0 # PARTIAL LINE UP
-U008D \x8D |0 # REVERSE LINE FEED
-U008E \x8E |0 # SINGLE SHIFT TWO
-U008F \x8F |0 # SINGLE SHIFT THREE
-U0090 \x90 |0 # DEVICE CONTROL STRING
-U0091 \x91 |0 # PRIVATE USE ONE
-U0092 \x92 |0 # PRIVATE USE TWO
-U0093 \x93 |0 # SET TRANSMIT STATE
-U0094 \x94 |0 # CANCEL CHARACTER
-U0095 \x95 |0 # MESSAGE WAITING
-U0096 \x96 |0 # START OF GUARDED AREA
-U0097 \x97 |0 # END OF GUARDED AREA
-U0098 \x98 |0 # START OF STRING
-U0099 \x99 |0 # control
-U009A \x9A |0 # SINGLE CHARACTER INTRODUCER
-U009B \x9B |0 # CONTROL SEQUENCE INTRODUCER
-U009C \x9C |0 # STRING TERMINATOR
-U009D \x9D |0 # OPERATING SYSTEM COMMAND
-U009E \x9E |0 # PRIVACY MESSAGE
-U009F \x9F |0 # APPLICATION PROGRAM COMMAND
-U00A0 \xA0 |0 # NO-BREAK SPACE
 U00A7 \xA1\xB1 |0
 U00A8 \xC6\xD8 |0
 U00AF \xA1\xC2 |0
@@ -178,11 +146,6 @@
 U00D7 \xA1\xD1 |0
 U00F7 \xA1\xD2 |0
 U00F8 \xC8\xFB |0
-U00FA \xFA |0 # LATIN SMALL LETTER U WITH ACUTE
-U00FB \xFC |0 # LATIN SMALL LETTER U WITH CIRCUMFLEX
-U00FD \xFD |0 # LATIN SMALL LETTER Y WITH ACUTE
-U00FE \xFE |0 # LATIN SMALL LETTER THORN
-U00FF \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
 U014B \xC8\xFC |0
 U0153 \xC8\xFA |0
 U0250 \xC8\xF6 |0
diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm
--- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003
+++ ucm/big5-hkscs.ucm  Tue Mar 25 21:37:10 2003
@@ -136,13 +136,6 @@
 U007E \x7E |0 # TILDE
 U007F \x7F |0 # DELETE
 U0080 \x80 |0 # control
-U0081 \x81 |0 # control
-U0082 \x82 |0 # BREAK PERMITTED HERE
-U0083 \x83 |0 # NO BREAK HERE
-U0084 \x84 |0 # control
-U0085 \x85 |0 # NEXT LINE
-U0086 \x86 |0 # START OF SELECTED AREA
-U0087 \x87 |0 # END OF SELECTED AREA
 U00A7 \xA1\xB1 |0
 U00A8 \xC6\xD8 |0
 U00AF \xA1\xC2 |0
@@ -171,7 +164,6 @@
 U00F9 \x88\x7B |0
 U00FA \x88\x79 |0
 U00FC \x88\xA2 |0
-U00FF \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
 U0100 \x88\x56 |0
 U0101 \x88\x67 |0
 U0112 \x88\x5A |0
Regards,
SADAHIRO Tomoyuki
I often encounter lower-ascii codes mixed in with Big5 text, which is
fine
and straightforward to handle.  However, a problem arises when upper
ascii occasionally occur outside of the Big5 range.  When such a
character occurs, this is probably an error or part of a user-defined
character.
However, it appears that Encode DOES NOT display warnings for these 
but
rather maps individual upper ascii to conventional characters such as
Roman letters with diacritics commonly found in European languages.
(It appears that Encode displays warnings for characters that are 
within
the Big5 range, but do not have a mapping to Unicode, perhaps because
these code points are not used in Big5 itself.)

Is there a way to cause Encode to display warnings for upper ascii
outside
of the Big5 range when converting from Big5 to Unicode?  If not, could
the
developers consider this for a future fix?
Mark





Re: Warning messages for ill-formed data

2003-03-24 Thread Dan Kogai
On Tuesday, Mar 25, 2003, at 13:59 Asia/Tokyo, Mark Lewellen wrote:
Is there a way to cause Encode to display warnings for upper ascii
outside of the Big5 range when converting from Big5 to Unicode?  If 
not, could
the developers consider this for a future fix?
Use the optional 3rd argument to decode().

$utf8 = decode(Big5 = $big5); # ill-formed chars are mapped to U+FFFD
$utf8 = decode(Big5 = $big5, Encode::FB_WARN); # same but warnings 
issued

see Handling Malformed Data of perldoc Encode for how to use the 
3rd argument.

Dan the Encode Maintainer



Re: Encode 1.87 and later don't pass make test on static perl

2003-03-06 Thread Dan Kogai
Sorry, my finger has slipped.

On Thursday, Mar 6, 2003, at 22:48 Asia/Tokyo, Dan Kogai wrote:

On Thursday, Mar 6, 2003, at 04:10 Asia/Tokyo, Blair Zajac wrote:
Hello,

I have several self compiled copies of Perl 5.8.0, one of which is
compiled to be statically linked (for Perl modules that is, not libc
and other system libraries) so I can profile the code using gcc -pg.
How EXACTLY did you do so?  Here is a recommended way.

0) rm -rf perl-5.8.0/ext/Encode
1) untargzip Encode-1.xx and mv that perl-5.8.0/ext/Encode
2) update perl-5.8.0/MANIFEST
3) configure perl
4) make and make test
Maybe you should start over w/ a fresh copy of perl-5.8.0 as well.

I looked for Encode 1.87 on CPAN but couldn't find it, so here's the
error output on with 1.89, but the problem was introduced in 1.87.
Perl -V output is below.  If I can find tar.gz's of Encode 1.86 and
1.87, I could check again to ensure that 1.86 and 1.87 work and fail
respectively.
Andreas has already answered this one (Thanks, Andreas).  The reason I  
am asking if you are following the right procedure is this;

Encode::KR object version 1.22 does not match bootstrap parameter 1.23  
at  
/opt/i386-linux/installed/perl-5.8.0-g-pg/lib/5.8.0/i686-linux/ 
XSLoader.pm line 44.
Here the module mismatch is happening, suggesting that old symbol(s)  
still exist somewhere...

Dan the Encode Maintainer




Re: [PATCH] viscii.ucm

2003-02-16 Thread Dan Kogai
SADAHIRO Tomoyuki [EMAIL PROTECTED]

I doubt whether the Unicode consortium had provided
any viscii-Unicode mapping table
under www.unicode.org/Public/MAPPINGS/.

I could suppose the table shipped on Perl was borrowed
from czyborra.com. The table there
  ( http://czyborra.com/charsets/vietnamese.html )
has wrongly the duplicated A^? and no a^?.

The following site provides a correct table.

   http://www.vietstd.org/document/unicode.html


Okay.  I'll replace viscii.ucm in the next release.

On Sunday, Feb 16, 2003, at 20:10 Asia/Tokyo, Jarkko Hietaniemi wrote:

...or it could come from the Tcl/Tk mapping tables?


I think this is it.  When I took over Encode maintenance, I have 
rebuilt whatever mappping unicode.org did have but I don't recall doing 
so for viscii so it must have come from viscii.enc.

Dan the Encode Maintainer



Re: [PATCH] viscii.ucm

2003-02-16 Thread Dan Kogai
On Sunday, Feb 16, 2003, at 12:26 Asia/Tokyo, SADAHIRO Tomoyuki wrote:

Hello. I've found the mapping in the present viscii.ucm
have a bit diffrence from that in RFC 1456.

U+1EA8 (A^? in VIQR) should be mapped for 0x86,
and U+1EA9 (a^? in VIQR) for 0xA6.

Here is a test scratch to check whether VISCII indeed
supports all the Latin extensions for Vietnamese,
i.e. U+1EA0..U+1EF9.

#!perl
for (my $u = 0x1EA0; $u = 0x1EF9; $u++) {
my $e = encode(VISCII, chr $u, Encode::FB_WARN);
}
warn End.;
__END__



Thanks. Applied in my repository.

Dan the Encode Maintainer

P.S.  Is ftp.funet.fi still down?  I think I am ready to 
$Encode:VERSION++ with this patch applied.



Re: Handling MacArabic in perl 5.8.0

2003-01-29 Thread Dan Kogai
David and Sadahiro-san,


On Wednesday, January 29, 2003, at 11:58 PM, SADAHIRO Tomoyuki wrote:

On Tue, 28 Jan 2003 01:48:42 -0500
David Graff [EMAIL PROTECTED] wrote:

BTW, I just noticed that the unicode web site now has a more recent
version of the APPLE/ARABIC.TXT mapping page than the one I cited
earlier, and the new version offers improved/expanded commentary:
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ARABIC.TXT
(dated Dec. 19, 2002).


Oh shoot.  Should I rebuild Encode/ucm/macArabic.ucm?
FYI I am a Mac user but I am on OS X so I don't have many chances to 
come accross mac* encodings.

Dan the Encode Maintainer



[Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option

2003-01-25 Thread Dan Kogai
Porters,

  In the recent discussion in various perl-related MLs in Japanese, I 
have discovered a problem that the encoding pragma does not work on 
such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in 
the 2nd byte.  Though not test I am pretty sure big5 is also prone to 
this.

  To understand this problem please have a look at the hexdump below;

% hexdump -C enc-sjis.pl
  23 2f 75 73 72 2f 6c 6f  63 61 6c 2f 62 69 6e 2f  
|#/usr/local/bin/|
0010  70 65 72 6c 20 2d 77 0a  75 73 65 20 73 74 72 69  |perl 
-w.use stri|
0020  63 74 3b 0a 75 73 65 20  65 6e 63 6f 64 69 6e 67  |ct;.use 
encoding|
0030  20 27 73 68 69 66 74 2d  6a 69 73 27 3b 0a 0a 6d  | 
'shift-jis';..m|
0040  79 20 24 6e 61 6d 65 20  3d 20 22 94 5c 22 3b 0a  |y $name = 
.\;.|
0050  70 72 69 6e 74 20 24 6e  61 6d 65 3b 0a 77 72 69  |print 
$name;.wri|
0060  74 65 3b 0a 0a 66 6f 72  6d 61 74 20 53 54 44 4f  
|te;..format STDO|
0070  55 54 20 3d 0a 94 5c 97  cd 3a 40 3c 3c 3c 0a 24  |UT 
=..\..:@.$|
0080  6e 61 6d 65 0a 2e 0a  |name...|

  The perl script is a valid perl script in Shift JIS but the quoted 
character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte, 
mangling the script.  The encoding pragma needs to be parsable 
ASCII-wise.
  Fortunately, the encoding pragma offers a different approach via 
Filter=1.  The problem is that Filter option was incomplete in two 
ways.

0.  Filter=1 leaves STD(IN|OUT) untouched.  Not only does it leave 
STD* untouched it completely ignores STD*= hooks that non-filter 
version offers.

1.  In order to touch STD(IN|OUT) sensibly you have to 'use utf8' in 
the script to make sure the literals therein are utf8-flagged but that 
makes the code too counterintuitive.

The following patch fixes that so the filter option is more useful.  I 
am planning to apply this patch to the next version of Encode but I 
still need to fix the POD and write test suites.  So I decided to issue 
a waring before committing a release.

Dan the Encode Maintainer

--- encoding.pm 2003/01/22 03:29:07 1.40
+++ encoding.pm 2003/01/26 07:03:59
@@ -35,33 +35,11 @@
 unless ($arg{Filter}) {
${^ENCODING} = $enc unless $] = 5.008 and $utfs{$name};
$HAS_PERLIO or return 1;
-   for my $h (qw(STDIN STDOUT)){
-   if ($arg{$h}){
-   unless (defined find_encoding($arg{$h})) {
-   require Carp;
-   Carp::croak(Unknown encoding for $h, '$arg{$h}');
-   }
-   eval { binmode($h, :encoding($arg{$h})) };
-   }else{
-   unless (exists $arg{$h}){
-   eval {
-   no warnings 'uninitialized';
-   binmode($h, :encoding($name));
-   };
-   }
-   }
-   if ($@){
-   require Carp;
-   Carp::croak($@);
-   }
-   }
 }else{
defined(${^ENCODING}) and undef ${^ENCODING};
eval {
require Filter::Util::Call ;
Filter::Util::Call-import ;
-   binmode(STDIN);
-   binmode(STDOUT);
filter_add(sub{
   my $status;
if (($status = filter_read())  0){
@@ -71,7 +49,31 @@
   $status ;
   });
};
+   # internally use utf8 to make sure utf8 flags are set
+   # for literals.
+   use utf8 (); # to fetch $utf8::hint_bits;
+   $^H |= $utf8::hint_bits;
# warn Filter installed;
+}
+for my $h (qw(STDIN STDOUT)){
+   if ($arg{$h}){
+   unless (defined find_encoding($arg{$h})) {
+   require Carp;
+   Carp::croak(Unknown encoding for $h, '$arg{$h}');
+   }
+   eval { binmode($h, :encoding($arg{$h})) };
+   }else{
+   unless (exists $arg{$h}){
+   eval {
+   no warnings 'uninitialized';
+   binmode($h, :encoding($name));
+   };
+   }
+   }
+   if ($@){
+   require Carp;
+   Carp::croak($@);
+   }
 }
 return 1; # I doubt if we need it, though
 }



Re: Encode utf-16 problem

2002-12-02 Thread Dan Kogai
On Tuesday, Dec 3, 2002, at 11:12 Asia/Tokyo, Jarkko Hietaniemi wrote:

Why the 'Partial character' warnings?  I would have though the input
files are just right.  Also, the warnings are given to stderr
unconditionally, I would have to redirect stderr to /dev/null to get
rid of the warnings.

$ perl -le 'print pack(v*, 0xFEFF, unpack(C*, test))' ! utf16
$ hex utf16
ff fe 74 00 65 00 73 00 74 00 0a..t.e.s.t..
$ ./perl -Ilib -e 'open(FH, :encoding(utf16), 
utf16);$a=FH;print $a'|hex
UTF-16:Partial character at -e line 1.
UTF-16:Partial character at -e line 1.
74 65 73 74 test
$ perl -le 'print pack(n*, 0xFEFF, unpack(C*, test))' ! utf16
$ hex utf16
fe ff 00 74 00 65 00 73 00 74 0a...t.e.s.t.
$ ./perl -Ilib -e 'open(FH, :encoding(utf16), 
utf16);$a=FH;print $a'|hex
UTF-16:Partial character at -e line 1.
UTF-16:Partial character at -e line 1.
74 65 73 74 test
$

Aw.  You can't use 'utf16' for use encoding or PerlIO.  You have to 
specify the endianness.  Because of the BOM mark you can't use it for 
PerlIO stream.

I'll tweak Unicode.pm so that perlio_ok returns 0 for BOMless UTF's in 
the next version

Dan the Encode Maintainer



[Encode] HEADS-UP: NC patch will be in

2002-11-04 Thread Dan Kogai
NC and porters,

  First of all, this is a great patch.  Not only does it optimize the 
resulting shlibs, it seems to consume less memory during compilation.

On Monday, Nov 4, 2002, at 12:26 Asia/Tokyo, [EMAIL PROTECTED] wrote:
Nicholas Clark [EMAIL PROTECTED] wrote:
:I've been experimenting with how enc2xs builds the C tables that turn 
into the
:shared objects. enc2xs is building tables (arrays of struct 
encpage_t) which
:in turn have pointers to blocks of bytes.

Great, you seem to be getting some excellent results.

Worked absolutely fine on my PowerBook G4, too.

Before:
  208948 Encode/Byte/Byte.bundle
 1984416 Encode/CN/CN.bundle
   30076 Encode/EBCDIC/EBCDIC.bundle
   33728 Encode/Encode.bundle
 2590420 Encode/JP/JP.bundle
 2208996 Encode/KR/KR.bundle
   39720 Encode/Symbol/Symbol.bundle
 1940288 Encode/TW/TW.bundle
   17892 Encode/Unicode/Unicode.bundle

After:
  178220 Encode/Byte/Byte.bundle
 1085116 Encode/CN/CN.bundle
   25336 Encode/EBCDIC/EBCDIC.bundle
   33604 Encode/Encode.bundle
 1308568 Encode/JP/JP.bundle
 1209804 Encode/KR/KR.bundle
   34896 Encode/Symbol/Symbol.bundle
 1059040 Encode/TW/TW.bundle
   17892 Encode/Unicode/Unicode.bundle


I have also wondered whether the .ucm files are needed after these
have been built; if not, we should consider supplying with perl only
the optimised table data if that could give us a space saving in the
distribution - it would cut build time significantly as well as
allowing us to consider algorithms that take much longer over the
table optimisation, since they need be run only once when we
integrate updated .ucm files.


Trivial yet effective patch is to strip all comments therein.  That 
should dramatically saves space but since *.ucm is, in a way, a source. 
 So I am not sure if I should go for it

Anyway, I am pretty much for integrating NC patch not just because it 
reduces shlib sizes but it also appears compiler safer (one of the 
optimizer features (AGGREGATE_TABLES) was dropped during the dev phase 
of perl 5.8 for the sake of djgpp and other low memory platforms).  
Unfortunately I am at my parents' place this week (to finish the book I 
am writing -- away from kids) so I do not have as much resources for 
extensive tests (the FreeBSD box I was using here at my parents just 
died (physically) the day before I came :-( ).

Another concern is that since it changes the internal structure of 
shlibs CPANized Encode::* modules need to be rebuilt as well, so the 
released version needs to print a warning on that -- oh wait!  
Encode.xs remains unchanged so Encode::* may still work

Thank you, NC.

Dan the Encode Maintainer



Re: [Encode] HEADS-UP: NC patch will be in

2002-11-04 Thread Dan Kogai
On Monday, Nov 4, 2002, at 20:11 Asia/Tokyo, Dan Kogai wrote:

oh wait!  Encode.xs remains unchanged so Encode::* may still work


Confirmed.  The NC patch works w/ preexisting shlibs.


perl -MEncode -e 'print Encode-VERSION, \n'
1.81 # not released, of course!
perl -MEncode::HanExtra -e 1



Dan the Encode Maintainer




How to name CJK ideographs

2002-10-25 Thread Dan Kogai
On Saturday, Oct 26, 2002, at 03:55 Asia/Tokyo, Jungshik Shin wrote:

  Another possibility is 'meaning-pronunciation' index. I believe
this is one of a few ways to refer to CJK characters (say, over the 
phone)
in all CJK countries. However, to do this, we need much more raw data
(more or less like a small dictionary) than UniHan DB provides because
it lists meanings of characters in English only.

That's one thing I wish I could do -- Dan as in Bomb because I 
can't go like YOU five ef three ee :)  I know that's difficult but it 
strikes me to find out we still have no way to canonically specify 
(Hanzi|Kanji|Hanja) after all these years (besides Unicode code points 
but who the heck wants to do so ?).

perl -e 'print \x{5c0f}\x{98fc} \x{5f3e}\n;



Re: [Encode] 1.80 released

2002-10-24 Thread Dan Kogai
On Friday, Oct 25, 2002, at 09:29 Asia/Tokyo, [EMAIL PROTECTED] wrote:

I'd recommend the small patch below, which will make it possible to
run the new rt.pl in any of the standard manners under the core:
  ( cd t ; ./perl TEST ../ext/Encode/t/rt.pl )
  ( cd t ; ./perl harness ../ext/Encode/t/rt.pl )
  PERL_CORE=1 ./perl -Ilib ext/Encode/t/rt.pl

With this patch, those tests also pass (eventually :).


Thanks, applied back :)  I feel relieved now.  And I am doubly relieved 
to find how meticulous a pumpking you are.  I wonder why Net::Ping's 
(rather obvious bug (for *BSD users)) slipped thru :)  And I was 
surprised to find your name was not on ext/Encode/AUTHORS.  Now added.

With that done, please proceed to the next patch to fix tr///

From: Dan Kogai [EMAIL PROTECTED]
Date: Mon Oct 21, 2002  17:36:02 Asia/Tokyo
To: hv [EMAIL PROTECTED], Inaba Hiroto [EMAIL PROTECTED], Jarkko 
Hietaniemi [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: The Inaba patch for tr/// vs. use encoding
Message-Id: [EMAIL PROTECTED]

I KNOW you are working on it (at least reviewing it) but just for 
reminder

Dan the Perl5 Porter



Re: Unicode. Perl does the right thing?

2002-10-24 Thread Dan Kogai
On Friday, Oct 25, 2002, at 14:10 Asia/Tokyo, Philip Newton wrote:
(B Well, partially because there's no "good" names for many of the
(B characters. What do you call "$B@8(B"? "CJK UNIFIED IDEOGRAPH-751F"? (That's
(B the current Unicode "name", but it's not particularly useful.) "CJK
(B shou"? "CJK sei"? "CJK sheng1"? "CJK saeng"? "CJK ikiru"? ikasu, ikeru,
(B umareru, umu, ou, haeru, hayasu, ki, nama, naru, nasu, musu,  which
(B one do you pick?
(B
(BIf we are stuck with de jure, ex officio names from Unicode Consortium 
(Bwe are out of luck but this is perl; if there are more than one way to 
(Bdo it,  Why not more than one way to name it?  I am kind of wondering a 
(Bcharnames extension that goes like
(B
(Buse charnames ":ja"; # Japanese
(Bprint "\N{sei-ikiru}";
(B#
(Buse charnames ":ko";
(Bprint "\N{saeng}";
(B#
(Buse charanames ":zh";
(Bprint "\N{sheng1}";
(B
(BSince pragmatic approach is rather inflexible, I would prefer OO 
(Baproach, like
(B
(Buse Char::Name;
(B
(Bmy $char = Char::Name-new;
(B
(Bprint $char-jp("sei-ikiru");
(B
(BI know Japanese is the biggest nightmare to name characters because in 
(BJapanese we give too many "names" to each character; It's really hard 
(Bto disambiguate these
(B
(BI may come up with something as I look though Unihan DB, now accessible 
(Bvia CPAN (Unicode::Unihan)
(B
(B Cheers,
(B Philip Newton ($BIT0aN'ITF~FZ(B)
(B
(B\x{5c0f}\x{98fc} \x{5f3e}


[Encode] 1.80 released

2002-10-21 Thread Dan Kogai
Hugo and porters,

  I have released Encode 1.80 despite the fact I just released 1.79 
less than 24 hours ago.

Whole:
	http://www.dan.co.jp/~dankogai/Encode-1.80.tar.gz
	and CPAN

Change is very small; it just includes a patch from NI-XS.

$Revision: 1.80 $ $Date: 2002/10/21 20:39:09 $
! Encode.xs t/mime-header.t
  Even more patches from NI-XS regarding Encode::utf8-decode().
  And one more test to t/mime-header.t to prove it
  Message-Id: [EMAIL PROTECTED]

Still I decided to go for 1.80 because I reckon you are yet to commit 
the latest Encode to bleedperl.  If you haven't work on it that's fine; 
just skip 1.7? and go straight to 1.80.  Apologies and thanks.

Dan the Encode Maintainer.



[Encode] 1.79 released

2002-10-21 Thread Dan Kogai
porters,

I have decided to release Encode 1.79 so soon after 1.78 with two 
reasons.

Whole:
	http://www.dan.co.jp/~dankogai/Encode-1.79.tar.gz
	and CPAN

=head1 reasons

=over

=item 1

The latest patch to Encode.(pm|xs) by Nick In-XS to relocate 
Encode::utf8 from .pm to .xs has introduced a minor bug that was 
revealed in t/mime-header.t.  It was due to the fact that 
Encode::utf8-decode() attempts to decode even when the argument is 
already flagged as utf8 string.  Encode 1.78 fixed the problem by 
mending lib/Encode/MIME/Header.pm but Nick In-XS has sent me a patch to 
Encode.xs.

=item 2

M$ version of the mapping in cp949 (Korean) and cp950 (Trad. Chinese) 
was obsolete, resulting   U+20AC (EURO SIGN) and U+00AE (REGISTERED 
SIGN) missing.  This time Moriyama-san has tested them against 
conversions via Win32 API and verified that they all matches now (at 
leased those marked as round-trippable).

=back

=head2 grumbles

Frankly, I am f.*ing tired of hearing about any M$-related char map 
issues.  This is to close the cp9?? cases altogether.  From now on I 
will happily ignore any claims saying 'cp??? seems to be wrong' unless 
M$ fixes their web pages ( 
http://www.microsoft.com/typography/unicode/cscp.htm -- it's gone!) and 
THE ATTITUDE (no news on the shutdown of the page above was released to 
the community).  I have just had enough, m'kay?

=head1 Changes

$Revision: 1.79 $ $Date: 2002/10/21 06:05:37 $
! Encode.xs
  Further patches from NI-XS.  Encode::utf8-decode() now checks the
  value of utf8 flag of the argument.  As a result, the fix to
  lib/Encode/MIME/Header.pm is no longer neccessary but since it did
  no harm (even speedwise) I'll leave it unreverted.
! ucm/cp949.ucm ucm/cp950.ucm
  U+20AC EURO SIGN
  U+00AE REGISTERED SIGN
  were missing as a result of 1.78. Discovered by Moriyama-san.
  Moriyama-san has also developed a test script that compares
  (en|de)coded results to the corresponding Win32 API result and
  all cp9?? maps are now verified.
  Message-Id: [EMAIL PROTECTED]

=head1 AUTHOR

Dan the Encode Maintainer



[Encode] HEADS-UP: ucm/cp932.ucm will be updated

2002-10-18 Thread Dan Kogai
Porters (especially Nick Ing-XS),

  I would like to release Encode 1.78 soon to address the problem in  
CP932 (MS version of Shift_JIS) which MORIYAMA Masayuki  
[EMAIL PROTECTED] has discovered.  Not only has he addressed the  
problem he has also supplied me a patch.  Though he was reluctant to  
come to perl(5-porters|unicode)@perl.org (I have invited him but I was  
too shy to talk to us in English), the problem and solution he has  
raised was too good to ignore so I would like to update Encode on his  
behalf.  Here is the summery of his points.

* ucm/cp932.ucm was based on the mapping file at unicode.org [0] but  
that mapping is obsolete;  it works on Windows 3.1 but not in the era  
of Win32.
* as a result, cp932 is rendered almost useless, at least too  
impractical
* patch was made available [1]

My first suggestion was to Ask MS to update the data at unicode.org  
and if you are unsatisfied w/ the one that comes w/ Encode you are free  
to CPANize your version.  But he has raised even more points and I was  
finally convinced.

* Though not in unicode.org, MS has already made the mapping available  
in their web [2][3]
* Python and Ruby will be using the MS version, not the one at  
unicode.org
* Java has been known to suffer badly for confusing Shift_JIS and CP932  
but Encode is already free of this problem by supplying different  
mappings for Shift_JIS and CP932.

[0]	http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ 
CP932.TXT
[1] http://www2d.biglobe.ne.jp/~msyk/perl/cp932.html
[2] http://www.microsoft.com/typography/unicode/cscp.htm
[3] http://www.microsoft.com/typography/unicode/932.txt

One small but significant concern is Tcl/Tk;  So far Encode's CP932  
does match that of Tcl but not after my next release of Encode.  So I  
decided to call for opinion before I commit the release.

AFAIK, CPยฅd+ should be avoided for any data exchanged in the Net so you  
should not use it on the web or mails so it's perfectly all right if  
Tk(Web|Mail) has a problem handling them.  At the same time Win32 Perl  
users would be much happier if CPยฅd+ are made more practical.

The URI [2] also has links to other code pages so I would also like to  
review them and if neccessary, update them.  8 bit code pages (CP12??)  
seem OK but other CJK (CP9??) needs reviews.

Dan the Encode Maintainer



Re: FW: ISO 8859-11 (Thai) cross-mapping table

2002-10-07 Thread Dan Kogai

On Tuesday, Oct 8, 2002, at 01:24 Asia/Tokyo, 
[EMAIL PROTECTED] wrote:
 I'll fix it but withhold from $Encode::VERSION++ since the table 
 itself
 appears correct.

 But now we have no TIS620, then, so that needs to be added?

Well, unless I hear requests from Thai native users, I'll abstain since 
TIS620 did not exist in http://www.unicode.org/Public/.  So far as I 
see ISO-8859-11 suffices.

But once again I am only human so correct me if I am wrong.

Dan the Man with Too Many Encodings to Support; Too Many Typos Generated




[Encode] 1.77 Released

2002-10-05 Thread Dan Kogai

Porters,

   I am releasing Encode 1.77 to accommodate the up-and-coming changes 
to bleedperl that makes tr/// free of eval qq{} under use encoding 
pragma.  This problem was addressed by me as;

 From: Dan Kogai [EMAIL PROTECTED]
 Date: Thu Oct 3, 2002  20:31:13 Asia/Tokyo
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: tr/// and use encoding
 Message-Id: [EMAIL PROTECTED]


Whole Package -- to be uploaded to CPAN soon:
http://www.dan.co.jp/~dankogai/Encode-1.77.tar.gz

Patch against bleedperl (242 lines) for Hugo:
http://www.dan.co.jp/~dankogai/current-1.77.diff.gz

And here is the Changes.  This one also includes minor alias fix by 
Autrijus.

$Revision: 1.77 $ $Date: 2002/10/06 03:27:02 $
! t/jperl.t
   * Modified to accomodate up and comming patch by Inaba-san that
 will fix tr/// needing eval qq{}
 Message-Id: [EMAIL PROTECTED]
! encoding.pm
   * pod fixes/enhancements to reflect the changes above
! lib/Encode/Alias.pm
   Encode::TW is correct, Encode::Alias not. - /Autrijus/
   Message-Id: [EMAIL PROTECTED]

Note this update alone will NOT fix the tr/// vs. use encoding problem 
noted above,  the real fix needs to be applied to bleedperl (regcomp.c 
and such).  The patch was already made available by Inaba Hiroto 
[EMAIL PROTECTED] (IsP for short)and my preliminary tests already 
show that it works with a minor fix to ext/Encode/t/jperl.t.  Hence the 
update to Encode prior to bleedperl patch.

Here is the proposed schedule before IsP goes into bleedperl;

On Sunday, Oct 6, 2002, at 12:10 Asia/Tokyo, Jarkko Hietaniemi wrote:
 Okay, how about the schedule below?

 0)   I release IsP-safe Encode 1.77 to CPAN.  All I have to do is to
 comment out the local(${^ENCODE}) black magic so it is still test-safe
 under perl 5.8.0

 1)  Hugo to sync bleedperl w/ Encode 1.77

 2)  New *.t for IsP under somewhere OTHER THAN ext/Encode/t.
 lib/nihongo.t, maybe?  that is to test thoroughly what IsP brings.
 That test suite does not have to run on stock perl 5.8.0

 3)  With enough assurance w/ the test suite above, IsP goes into
 bleedperl.

 Sounds good to me.

The Encode update complies step 0.  Though this update is primarily for 
bleedperl, it is perl 5.8.0-safe.

Please allow some time before step 2-3.  I would like Inaba-san's help 
on 2 and I am rather busy recently

Dan the Encode Maintainer




tr/// and use encoding

2002-10-03 Thread Dan Kogai

On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote:
 On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote:
 On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi 
 wrote:
 Both.  I think the operation needed is straight-forward.  When you get
 tr[LHS][RHS], decode'em then
 feed it to the naked tr// .

 Urk...  That means a dip into the toke.c, how the tr/// ranges are
 implemented is... tricky.  sv_recode_to_utf8() is needed somewhere...
 but I'm a little bit pressed for time right now.  I suggest you
 perlbug this and move the process to perl5-porters.  (Inaba Hiroto
 also might have insight on this; he's the tr///-with-Unicode sensei,
 really-- he practically implemented all of it.  And he might read
 *[gk]ana much better than me :-)

So now this thread is in perl5-porter.  Since this undocumented (lack 
of) feature has a very easy workaround, I am yet to perlbug this.

=head1 PROBLEM

Cuse encoding 'foo-encoding' nicely converts string literals and 
regex into UTF-8 so you gen get the power of perl 5.8.0 even when your 
source code is other text encodings than UTF-8.  But tr/// does not 
embrace this magic.

=head1 WORKAROUND

Suppose your script is in EUC-JP and your source contains this:

   $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/;
      

And you want perl to do the following;

   $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/

All you have to do is:

   use encoding 'euc-jp';
   # 
   eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ };

=over

=item chars in this example

   utf8 euc-jp   charnames::viacode()
   -
   \x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A
   \x{3093} \xA4\xF3 HIRAGANA LETTER N
   \x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A
   \x{30f3} \xA5\xF3 KATAKANA LETTER N

=backs

=head1 DISCUSSION

I found this when I was writing a CGI book and I wanted a form 
validation/correction.  THe example above converts all Hiragana to 
Kanakana, which is a common task in Japan.  Traditionally this kind of 
operation was done via jcode::tr() (require jcode.pl;) or Jcode::tr() 
(use Jcode;).  But as of perl 5.6.0 you can apply Japanese directly 
into regex and tr/// -- so long as your script is in UTF-8.

With perl 5.8.0, the direct application of multibyte regex was made 
possible via Cuse encoding pragma.  use encoding pragma applies its 
magic as follows.  Suppose you Cuse encoding 'foo';

=over

=item 0.

${^ENCODING}, a special, non-scoped variable, is set to 
CEncode::find_encoding('foo').  if 'foo' is a supported encoding by 
Encode, ${^ENCODING} is now a transcoder object.

=item 1.

all string literals in q//, qq//, qw// and qr// (not sure of qx//) are 
first fed to ${^ENCODING}.-decode().  So from perl's point of view, 
it's the same as literals written in UTF-8.

=item 2.

Cbinmode STDIN, :encoding(foo); and Cbinmode STDIN, 
:encoding(foo) are implicitly applied So you can feed STDIN in 
enconding 'foo' and get STDOUT in encoding 'foo'

=back

Very clever and powerful.  But 1. is not done to tr///.  qq{} is under 
control of Cuse encoding so eval qq{} works as expected.

Though the workaround is simple, easy and clever it still leaves 
inconsistency on how ${^ENCODING} gets used;  It does indeed works on 
non-interpolated literals already.

=head1 REPORTED BY

Dan the Encode Maintainer Elt[EMAIL PROTECTED]gt




[FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai
I am currently writing yet another CGI book.  That is for the Japanese 
market and written in Japanese.  So it is inevitable that you have to 
face the labyrinth of character encoding.

Before perl 5.8.0, most book teaches how to handle Japanese in CGI goes 
as follows;

* stick with EUC-JP.  it does not poison perl like Shift_JIS.
* use jcode.pl or Jcode.pm when you have to convert encoding.
* you can use jcode::tr or Jcode-tr when you have to convert between 
Hiragana and Katakana

fine, so far.  But

* totally forget regex unless you are happy with a very 
counter-intuitive measure illustrated in 6.18 of the Cookbook
* if you are desperate in Kanji regex, use jperl instead.

That has now changed with 'use encoding'.  But when it comes to CGI, 
'use encoding' alone will not cut it.  But CGI.pm can handle 
multipart/form-data .  Together you can use regex safely and 
intuitively without resorting to convert your CGI script to UTF-8.

The 120-line script right after my signature illustrates that.  Sorry, 
it contains some Japanese (or my point gets blurred).

As you see, tr/// is not subject to the magic of 'use encoding'.  jhi, 
have we made it so deliberately ?  I am begging to think tr/// is 
happier to enbrace the power thereof.

Still, it can be overcome by simple eval qq{} as illustrated.  This 
much idiom would not hurt much, at least not as much as the Cookbook 
sample

Dan the Transcoded Man

#!/usr/local/bin/perl
#
# Save me in EUC-JP!

use 5.008;
use strict;
use CGI;
use CGI::Carp qw(fatalsToBrowser);
our $Method  = 'POST';
#our $Method  = 'GET';
our $Enctype = 'multipart/form-data';
#our $Enctype = 'application/x-www-form-urlencoded';
our $Charset = 'euc-jp';
use encoding 'euc-jp';

my $cgi = CGI-new();

my %Label =
 (
  name= '$BL>A0(B',
  kana= '$B%U%j%,%J(B',
  mailto  = '$BEE;R%a!<%k(B',
  mailto2 = '$BEE;R%a!<%k(B($B3NG'(B)',
  tel = '$BEEOC(B',
  fax = '$B%U%!%C%/%9(B',
  zip = '$B")(B',
  address = '$B=;=j(B',
  comment = '$B$40U8+(B',
  );


unless ($cgi-param()){
 print_input($cgi);
}else{
 my $kana = $cgi-param('kana');
 $kana =~ s/[(J\(Bs$B!!(B]+//g; # beware of zenkaku space!
 eval qq{ (J\(B$kana =~ tr/$B$!(B-$B$s(B/$B%!(B-$B%s(B/ };
 # $kana =~ tr/$B$!(B-$B$s(B/$B%!(B-$B%s(B/; # will not work but do you 
know why?
 $cgi-param(kana = $kana);
 print_output($cgi);
}

sub print_input{
 my $c = shift;
 print_html(
$c,
title ="Form:$BF~NO(B",
name= $c-textfield(-name = 'name'),
kana= $c-textfield(-name = 'kana'),
mailto  = $c-textfield(-name = 'mailto'),
mailto2 = $c-textfield(-name = 'mailto2'),
tel = $c-textfield(-name = 'tel'),
fax = $c-textfield(-name = 'fax'),
zip = $c-textfield(-name = 'zip'),
address = $c-textfield(-name = 'address'),
comment = $c-textarea(-name = 'comment'),
);
}

sub print_output{
 my $c = shift;
 print_html(
$c,
title   = "Form:$B=PNO(B",
name= $c-param('name'),
kana= $c-param('kana'),
mailto  = $c-param('mailto'),
mailto2 = $c-param('mailto2'),
tel = $c-param('tel'),
fax = $c-param('fax'),
zip = $c-param('zip'),
address = $c-param('address'),
comment = $c-param('comment'),
);
};

sub print_html{
 my $c = shift;
 my %arg = @_;
 print
 $c-header(-charset   = $Charset),
 $c-start_html(-title = $arg{title}),
 $c-h1($arg{title});
 $c-param() or print
 $c-start_form(-method = $Method, -enctype = $Enctype);
 print
 $c-start_table({border = 1}),
 $c-Tr([
 $c-td([ $Label{name}= $arg{name} ]),
 $c-td([ $Label{kana}= $arg{kana} ]),
 $c-td([ $Label{mailto}  = $arg{mailto} ]),
 $c-td([ $Label{mailto2} = $arg{mailto2} ]),
 $c-td([ $Label{tel} = $arg{tel} ]),
 $c-td([ $Label{fax} = $arg{fax} ]),
 $c-td([ $Label{zip} = $arg{zip} ]),
 $c-td([ $Label{address} = $arg{address} ]),
 $c-td([ $Label{comment} = $arg{comment} ]),
 ]);
 if ($c-param()){
 print
 $c-td($c-a({href=$ENV{SCRIPT_TEXT}}, "Retry"));
 }else{
 print
 $c-td([$c-reset(), $c-submit()]),
 };
 print $c-end_form() unless $c-param();
 print
 $c-end_table(),
 $c-end_html();
}
__END__


Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai

On Wednesday, Oct 2, 2002, at 22:15 Asia/Tokyo, Jarkko Hietaniemi wrote:
 (Hi, it's me again...)

 Are you doing character ranges in the tr/// under 'use encoding'?
 (I'm asking because I see a - in the middle of what I assume is
 mangled EUC-JP)

Yes. that's where hiragana - katakana conversion is attempted;  
English equivalent of tr/A-Z/a-z/.

Dan




Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai

On Wednesday, Oct 2, 2002, at 21:51 Asia/Tokyo, Jarkko Hietaniemi wrote:
 However, I will need to stare at your example some more, since
 for simpler cases I think tr/// *is* obeying the 'use encoding':

 use encoding 'greek';
 ($a = \x{3af}bc\x{3af}de) =~ tr/\xdf/a/;
 print $a, \n;

 This does print abcade\n, and it also works when I replace the \xdf
 with the literal \xdf.

I can explain that.  \x{3af}bc\x{3af}de is is a string literal so it 
gets encoded.  however, my example in escaped form is;

   $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/

   which does not get encoded.  the intention was;

   $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/

   That's why

   eval qq{ $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ }

works because \xA4\xA1-\xA4\xF3 and \xA5\xA1-\xA5\xF3 are converted. to 
\x{3041}-\x{3093} and \x{30a1}-\x{30f3}, respectively.

Dan




Re: Encode functionality for Perl 5.6.1

2002-09-21 Thread Dan Kogai

On Saturday, Sep 21, 2002, at 22:38 Asia/Tokyo, Robert Allerstorfer 
wrote:
 Hi,

 the great Encode module requires perl 5.8. Are there
 any backports existing yet that may work with 5.6.1? I am trying to
 find a solution to encode Japanese (shiftjis) and Chinese (gb2312
 and big5) into utf8 with that perl version since 5.8 is not yet used
 widely, unfortunately.

Okay, let me repeat what I have said in this mailing list before.

0)  Backporting Encode to 5.6.1 and perhaps 5.00503 was my first 
intention when I joined (and later took over) the development thereof
1)   Then I found Unicode stuff in 5.6.1 is very kaputt.  At the same 
time Encode was made very perl-5.8.0 dependent especially unicode 
handling
2)   So I concluded I would rather advocate perl 5.8 than pay some 
effort to backport Encode.
3)  Efforts by others to backport is welcome, provided
   a) if it uses 'Encode' as module name it needs to work both in 5.8 
and 5.6.1.
   Bottom line is that backported version will not breach what it is 
now.  If it ain't broke,
   don't fix it (and 5.6.1 was broke Unicode-wise)
   b) if you just implemented Encode functionality in perl 5.6.1 but 
incompatible w/ 5.8,
  give it a different name; i.e) Encode::Compat
   c) at any rate don't forget to share your idea and work here at 
[EMAIL PROTECTED]

Dan the Encode Maintainer




[Encode] enc2xs fixes

2002-09-02 Thread Dan Kogai

On Sunday, Sep 1, 2002, at 18:10 Asia/Tokyo, Andreas J. Koenig wrote:
 Apparently I'm missing something. 'make manifest' is only good for the
 expert, because you need to know which lines you have to delete from
 MANIFEST after running 'make manifest'. That's not obvious. If you
 know a trivial way to write a correct MANIFEST file, I'd be grateful
 if you could add it. Maybe then some of my documentation changes are
 not needed.

Right.  MANIFEST generation is not as trivial as it looks and 'make 
manifest' does it trivially-too trivially

[snip]

 But I am not so happy if enc2xs becomes behemoth like h2xs.  Man, I
 thought I had soothed all the feeping creaturism before the release of
 5.8.0

 Please see if the patch below makes it a behemoth. It does just the
 following:

 Add Usage().
 Add Version().
 Add -h and -v.
 Die on an insufficient commandline.
 Removed ARGV from the arguments to call the to make_configlocal_pm().
 Document what find_e2x does in a comment.
 Speed up find_e2x considerably.
 Tweak documentation.

Thank you.  Now your patch is in my repository.  But since we are not 
in urgency,  I would like to wait for NI-XS for his tweaks/fix for 'use 
encoding utf8' workaround.  Besides, I am a little busy taming Jaguar 
right now...

Dan the Encode Maitainer.




Re: Encode 1.76 Released

2002-08-31 Thread Dan Kogai

On Friday, August 30, 2002, at 08:48 , Andreas J. Koenig wrote:
 Hi Dan,

 today I revisited enc2xs and found three things missing:

Okay

 - enc2xs doesn't write a MANIFEST file: this would be handy as the
   innocent user doesn't know which files need to be included in a
   distribution

I reckon your suggestion is to 'let enc2xs generate any missing files 
that are enough to CPANize the encoding'.  is there any other file 
missing?  MANIFEST autogeneration is trivial but I still prefer 'make 
manifest'

 - no -h or --help option

I'm not sure enc2xs be used frequently enough to call for the need for 
-h but there is no reason not to add one.  Maybe detailed info on -M and 
very, very brief description on -o and such that are not supposed to be 
invoked by human (even I don't do that except for debugging).

 - no -v or --version option

This should return the version of Encode.pm as well as enc2xs itself.

 I'd volunteer to add all that if you'd be inclined to accept (and
 proofread) it. Please let me know what you think.

I am glad you help me out (well, your trust level is so high that you 
can do all the work and all I do is put your new version to my 
repository -- and claim the credit (c) ams).

But I am not so happy if enc2xs becomes behemoth like h2xs.  Man, I 
thought I had soothed all the feeping creaturism before the release of 
5.8.0

Dan the Patient of Feeping Creaturism




Re: translating the Perl 5.8.0 announcement to CJK

2002-07-19 Thread Dan Kogai

On Friday, July 19, 2002, at 04:27 AM, Jarkko Hietaniemi wrote:
 Final round of proofreadings, if I may:

 http://www.iki.fi/jhi/pl580.txt.big5.tw
 http://www.iki.fi/jhi/pl580.txt.euc.cn
 http://www.iki.fi/jhi/pl580.txt.euc.jp
 http://www.iki.fi/jhi/pl580.txt.euc.kr

Looks like miyagawa-kun's patch is already in for .jp.  Looks good.  Be 
it the final version.

Dan




Re: translating the Perl 5.8.0 announcement to CJK

2002-07-17 Thread Dan Kogai

jhi and any that grok Nihongo,

On Wednesday, July 17, 2002, at 02:41 AM, Jarkko Hietaniemi wrote:
 Neat.

 I now have the Chinese and Korean ones online:

   http://www.iki.fi/jhi/pl580.txt.cn
   http://www.iki.fi/jhi/pl580.txt.tw
   http://www.iki.fi/jhi/pl580.txt.kr

Here is the Japanese version at last.

http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.utf8.jp
http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.euc.jp

I decided to post this to [EMAIL PROTECTED] because I have fever and 
my brain runs on single-digit percentage point of its capacity so I am 
not as sure of the quality of document as usual.  If you are able to 
grok Japanese please feel free to polish the doc.  Yoroshiku 
Onegaishimasu!

Dan the Sneezing Translator of Yours




Re: translating the Perl 5.8.0 announcement to CJK

2002-07-17 Thread Dan Kogai

On Thursday, July 18, 2002, at 02:31 AM, Jarkko Hietaniemi wrote:

 Notice the name changes.  I also edited away the DJGPP broken entry
 since that's now fixed.  I notice that the VMS section in the .jp one
 is untranslated, and the .kr is still missing the new Unicode section.

 http://www.iki.fi/jhi/pl580.txt.big5.tw
 http://www.iki.fi/jhi/pl580.txt.euc.cn
 http://www.iki.fi/jhi/pl580.txt.euc.jp
 http://www.iki.fi/jhi/pl580.txt.euc.kr

Thus fixed.  I've also corrected linefeeds so it looks better via web 
browsers.  Get the newer version one via

http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.euc.jp
http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.utf8.jp

Dan the Translator of Yours




Re: translating the Perl 5.8.0 announcement to CJK

2002-07-17 Thread Dan Kogai

On Thursday, July 18, 2002, at 01:34 PM, Autrijus Tang wrote:
 I'm extremely sorry to bother you again with nitpicking, but that
 version did not contain the TraditionalSimplified Chinese fix I
 posted earlier.

 http://www.autrijus.org/tmp/pl580.txt.euc.jp
 http://www.autrijus.org/tmp/pl580.txt.utf8.jp

 Has it corrected.

now

http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.euc.jp
http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.utf8.jp

are identical to the ones above (I've lwp-downloaded'em :).  Xiexie and 
Kiitos.

Dan the Corrected Man




[Encode] 1.75 Released

2002-06-01 Thread Dan Kogai

On Sunday, June 2, 2002, at 02:24 AM, Jarkko Hietaniemi wrote:
 I say GRRR.  Be quick about it.  I'm already wrapping things up,
 but I guess a few aliases won't break this camel's back too much.

Ignorance is Bliss, the phrase hit me when I released Encode-1.75, 
available as follows;

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.75.tar.gz and CPAN
Diff against current: (191 lines)
http://www.dan.co.jp/~dankogai/current-1.75.diff.gz

And Changes

$Revision: 1.75 $ $Date: 2002/06/01 18:07:49 $
! lib/Encode/Alias.pm t/Alias.t lib/Encode/Supported.pod TW/TW.pm
   glibc compliance cited by Autrijus.
   http://www.li18nux.org/docs/html/CodesetAliasTable-V10.html
! bin/enc2xs bin/piconv
   Subject: Re: forewarning: usedevel and versiononly
   Message-Id: [EMAIL PROTECTED]

Autrijus, please for pumpkin's sake don't find anything too irresistible 
to leave it unfixed :-P

Dan the Encode Maintainer




Re: ICU and Parrot

2002-05-31 Thread Dan Kogai

On Saturday, June 1, 2002, at 12:34 AM, Autrijus Tang wrote:
 On Fri, May 31, 2002 at 06:18:55AM +0900, Dan Kogai wrote:
 As a matter of fact GB18030 is ALREADY supported via Encode::HanExtra 
 by
 Autrijus Tang.  The only reason GB18030 was not included in Encode main
 is sheer size of the map.

 Yes, partly because it was not implemented algorithmically. :)

 I was browsing http://www-124.ibm.com/cvs/icu/charset/data/ucm/ and 
 toying
 with uconv, and wondered:

 1) Does Encode have (or intend to have) them all covered?

No,  Unless they appear in www.unicode.org.  Though some of them are 
actually adopted.  Useful it may be I found raw ICM too Big and too 
Blue :)

 2) If not, would a Encode::ICU be wise?

I'm not so sure.  But if I were the one to implement Encode::ICU, it 
will not be just a compiled collection of UCM files but a wrapper to all 
library functions that ICU has to offer.  I, for one, am too lazy for 
that.

 3) A number of encodings are in HanExtra but not their ucm repository,
namedly big5plus, big5ext and cccii. Is is wise to feed back to them
under the name of e.g. perl-big5plus.ucm?

You should in time and I should, too, because I have expanded UCM a 
little so that you can define combined characters commonly seen in 
Mac*.  But I don't see any reason to be in hurry for the time being.

If any of you are a member of team ICU you may redirect this dialogue to 
your team so we can work together in future (after 5.8.0, that is).

Dan the Encode Maintainer




Re: [PATCH] Encode::MIME::Header

2002-05-20 Thread Dan Kogai

On Monday, May 20, 2002, at 11:39 AM, Tatsuhiko Miyagawa wrote:
 charsets can include _ in its name. Here's a patch.

Thanks, applied.  With patches from Autrijus and you I think I now I 
have enough diff to justify the version increment of Encode.  Next 
version within 24 hours.  Oh, VMS is still on to do list...

Dan the Encode Maintainer

 --
 Tatsuhiko Miyagawa [EMAIL PROTECTED]


 --- Header.pm~Sun May  5 01:41:30 2002
 +++ Header.pm Mon May 20 11:34:39 2002
 @@ -51,7 +51,7 @@
  $str =~
   s{
   =\?  # begin encoded word
 - ([0-9A-Za-z\-]+) # charset (encoding)
 + ([0-9A-Za-z\-_]+) # charset (encoding)
   \?([QqBb])\? # delimiter
   (.*?)# Base64-encodede contents
   \?=  # end encoded word




Re: Acceptance of Unicode (UTF8) in Far East

2002-05-15 Thread Dan Kogai

On Thursday, May 16, 2002, at 03:04 AM, Mark Lewellen wrote:
 Hi all-
   I have a question directed mostly at those involved in the Far East.
 Since Unicode is often implemented in UTF8, and UTF8 uses 3 bytes
 for Chinese characters (instead of the 2 bytes in Chinese and Japanese
 GB, Big5, JIS), UTF8 documents solely in these languages will be 50%
 larger.  This appears to be a large stumbling block to universal
 acceptance of UTF8.  Is there much resistance to UTF8 in the
 Far East, are there work-arounds to the problem, and are many
 people even aware of the problem?
 Mark

Size of data is not a big deal these days with data compression and 
faster network.  So far as I see there are very few who dislike UTF-8 
because of the size bloats.  Most of objections and dislikes against 
Unicode is more of politics and culture.

Whether you like it or not, the Unicodization is steady because it is 
already blessed by Windows and MacOS (X).  And you have virtually no 
choice but to use Unicode when you program in Java.  But the 
Unicodization of applications have only begun.  UTF-8 mails and web 
pages are still rare mainly because of lack of tools (well, as a matter 
of fact many of these tools do support Unicode but simple don't make 
UTF-8 a default when it sends or saves data).

And even if tools are there it may still take a long time before data 
get converted to UTF-8.  Unless you need to save more than 3 languages 
legacy encodings do suffice and many may still choose to save new data 
in legacy encodings for legacy applications.

To me it is okay whether you choose to save your data in whichever 
encoding so long as I can read.  That's why I became a maintainer of 
Encode module, a standard part of Perl 5.8 that enables you to do so.

Dan the Encode Maintainer




Re: use encoding in both scripts and modules

2002-05-06 Thread Dan Kogai

On Monday, May 6, 2002, at 05:16 , Tatsuhiko Miyagawa wrote:
 panic happens while hacking with encoding pragma.

It seems use encoding is still in effect after you 'use EncBar'.  
Simply commenting out 'use encoding 'euc-jp'' in encoding-test.pl makes 
the program work as expected.

Dan




Re: [PATCH] Encode::Encoding

2002-05-06 Thread Dan Kogai

On Monday, May 6, 2002, at 06:51 , Tatsuhiko Miyagawa wrote:

   package Encode::MyEncoding;
   use base qw(Encode::Encoding);

   __PACKAGE__-Define(qw(myCanonical myAlias));

 dies saying:

Error:  Undefined subroutine Encode::define_encoding called at ...

 Patch follows after sig.

Thanx.  Applied.

Dan the Encode Maintainer




Re: [preannounce] Encode::Punycode

2002-05-06 Thread Dan Kogai

On Monday, May 6, 2002, at 07:11 , Tatsuhiko Miyagawa wrote:
 I've just made Encode implementation for Punycode[1]. (Does it make
 any sense to make such an encodings as subclass of Encode::Encoding? I
 think it's reasonable, as there's Encode::MIME::Header!)

I bet you do that sooner or later.  Thanks!  As for module hierarchy, 
your choice is perfectly valid and it is even NI-XS recommendation.

Dan the Encode Maintainer




[Encode] 1.69 Released

2002-05-04 Thread Dan Kogai

I hope it was not too premature to release Encode-1.69, now available as 
follows.

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.69.tar.gz and CPAN
Diff against current: 180 lines
http://www.dan.co.jp/~dankogai/current-1.69.diff.gz

And here are Changes

$Revision: 1.69 $ $Date: 2002/05/04 16:41:18 $
! lib/Encode/MIME/Header.pm
   Floating-point coerced for UNICOS (in integer arithmetics it folds
   line one character too early).  Verification by Mark is pending.
   Message-Id: [EMAIL PROTECTED]
! Unicode/Unicode.pm
   more doc patch from Elizabeth
   Message-Id: [EMAIL PROTECTED]
! Encode/Makefile_PL.e2x
   More platform-independent patch from Benjamin
   Message-Id: [EMAIL PROTECTED]
! lib/Encode/Guess AUTHORS
   split regex fix by Graham Barr.  Adds him to AUTHORS.
   Message-Id: [EMAIL PROTECTED]
! Encode/Makefile_PL.e2x
   enc2xs script discovery made smarter and more sensible, first cited
   by Miyagawa-kun and further suggestions by Rafael and Andreas
! Encode.pm lib/Encode/Guess.pm t/fallback.t t/guess.t t/mime-header.t
   The EBCDIC remapping of the low 256 bites again #16372 by jhi

UNICOS needs verification by Mark or others.  I am fairly sure of the 
cause of failure and equally sure of the fix.  But I am not sure if 
UNICOS groks what I mean...

Dan the Encode Maintainer

P.S.  I think we are very, very close to 5.8.0-RC1 but will we make it 
on May 8th so we can claim we are only a month behind ?




[Encode] 1.68 Released

2002-05-03 Thread Dan Kogai

I am delighted to add the first female to AUTHORS when I released 
Encode, available as follows;

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.68.tar.gz
Diff against current: 106 lines
http://www.dan.co.jp/~dankogai/current-1.68.diff.gz

Changes is just one paragraph long.

$Revision: 1.68 $ $Date: 2002/05/03 12:20:13 $
! lib/Encode/Alias.pm lib/Encode/Supported.pod t/Alias.t AUTHORS
   UCS-4 added to aliases of UTF-32 by Elizabeth Mattijsen.  Alias.t
   and Supported.pod modified to reflect the change.  Elizabeth added
   to Authors.  And H.M. is also added for forwarding her patch among
   other contributions (I was rather surprised to find his name was not
   there yet!)
   Message-Id: [EMAIL PROTECTED]

.if there is one kind of diversity that is lacking in Perl, it is 
definitely sex ratio.  In terms of the sheer number of sex it is already 
diverse than an ordinary world for Perl mongers have female, male, and 
the Borg, :P

Dan the Encode Maintainer / the Equal Opportunity Whippee




[Encode] 1.67 released

2002-05-02 Thread Dan Kogai

I wonder how's Laszlo's doing when I released Encode 1.67, available as 
follows.

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.67.tar.gz and CPAN
Diff agains current: 147 lines
http://www.dan.co.jp/~dankogai/current-1.67.diff.gz

And Changes.  As you see changes are just cosmetic.

$Revision: 1.67 $ $Date: 2002/05/02 07:33:09 $
! Encode.xs
   Error message now consistent w/ perlqq (\N{U+} - \x{})
   done in perl@16308 but Philip linted me further.  Now the error
   messages are macronized as ERR_ENCODE_NOMAP and ERR_DECODE_NOMAP
! lib/Encode/Guess.pm
   Sanity check for happier -w by Autrijus

Dan the Encode Maitainer




Encode-InCharset-0.01 Released

2002-05-02 Thread Dan Kogai

I have just released Encode-InCharset-0.01, available as

  http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN.

I have developed this module primarily to implement ISO-2022-JP-3 and 
ISO-2022-CN in future.  To implement encode() in these, you have to know 
which character set a given character belongs.  But this module can also 
be used if a string can safely be encoded
(Though fallback is much faster).

Dan the Encode Maintainer

NAME
 Encode::InCharset - defines \p{InCharset}

INSTALL
 perl Makefile.PL
 make test  make install

SYNOPSIS
   use Encode::InCharset qw(InJIS0208);
   I am \x{5c0f}\x{98fc}\x{3000}\x{5f3e} =~ /(\p{InJIS0208})+/o;
   # guess what is in $1

ABSTRACT
 This module provides In*Charset* Unicode property that matches
 characters *Charset*.

 As of this writing, Property-matching functions are auto-generated 
out
 of ucm files in Encode, Encode::HanExtra, and Encode::JIS2K.

DESCRIPTION
 As of this writing, this module supports character properties shown
 below. Since names are self-explanatory I am not going to discuss in
 details.

   InASCII InAdobeStandardEncoding InAdobeSymbol InAdobeZdingbat
   InBIG5EXT InBIG5PLUS InBIG5_ETEN InBIG5_HKSCS InCCCII InCP1006
   InCP1026 InCP1047 InCP1250 InCP1251 InCP1252 InCP1253 InCP1254
   InCP1255 InCP1256 InCP1257 InCP1258 InCP37 InCP424 InCP437 InCP500
   InCP737 InCP775 InCP850 InCP852 InCP855 InCP856 InCP857 InCP860
   InCP861 InCP862 InCP863 InCP864 InCP865 InCP866 InCP869 InCP874
   InCP875 InCP932 InCP936 InCP949 InCP950 InDingbats InEUC_CN
   InEUC_JISX0213 InEUC_JP InEUC_KR InEUC_TW InGB12345 InGB18030 
InGB2312
   InGSM0338 InHp_Roman8 InISO_8859_1 InISO_8859_10 InISO_8859_11
   InISO_8859_13 InISO_8859_14 InISO_8859_15 InISO_8859_16 
InISO_8859_2
   InISO_8859_3 InISO_8859_4 InISO_8859_5 InISO_8859_6 InISO_8859_7
   InISO_8859_8 InISO_8859_9 InISO_IR_165 InJIS0201 InJIS0208 
InJIS0212
   InJIS0213_1 InJIS0213_2 InJohab InKOI8_F InKOI8_R InKOI8_U 
InKSC5601
   InMacArabic InMacCentralEurRoman InMacChineseSimp InMacChineseTrad
   InMacCroatian InMacCyrillic InMacDingbats InMacFarsi InMacGreek
   InMacHebrew InMacIcelandic InMacJapanese InMacKorean InMacRoman
   InMacRomanian InMacRumanian InMacSami InMacSymbol InMacThai
   InMacTurkish InMacUkrainian InNextstep InPOSIX_BC InShift_JIS
   InShift_JISX0213 InSymbol InVISCII

   EXPORT

   # will import all of them
   use Encode::InCharset;
   # will import only properties in qw()
   use Encode::InCharset qw(InCharset...)

SEE ALSO
 the Encode manpage, the perlunicode manpage

AUTHOR
 Dan Kogai [EMAIL PROTECTED]

COPYRIGHT AND LICENSE
 Copyright 2002 by Dan Kogai

 This library is free software; you can redistribute it and/or modify 
it
 under the same terms as Perl itself.

 See http://www.perl.com/perl/misc/Artistic.html




Re: [PATCH] Let Guess.pm handles uninitialized argument.

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 09:19 , Autrijus Tang wrote:
 This way is self-descriptory; it makes -w happier. :)

 /Autrijus/

XieXie.  Applied.

Dan the Encode Maintainer




Encode, charnames and utf8heavy

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 10:30 , Jarkko Hietaniemi wrote:
 Thanks, upgraded.

 A bit of noise from ext/PerlIO/t/fallback.t:

 ./perl -Ilib ext/PerlIO/t/fallback.t
 1..8
 ok 1 - opened iso-8859-1 file
 \N{U+20ac} does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 
 21.
 ok 2 - perlqq escapes
 ok 3 - opened iso-8859-1 file
 ok 4 - HTML escapes
 ok 5 - Opened as ASCII
 # 5c
 ok 6 - Escaped non-mapped char
 ok 7 - Opened as ASCII
 # fffd
 ok 8 - Unicode replacement char

 Also, is it intentional that there is no \N{U+} syntax...?
 That was planned at some point but as of there is no such thing

Okay,  I'll change the error message in the next one so it would say

\x{abcd} does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21.

Autrijus just sent me a patch so it won't take long.

 ./perl -Ilib -Ilib -Mcharnames=:full -e '\N{U+20ac}'
 Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1

 Why not just use \x{...}?  If that's PERLQQ, that's what
 I would expect?

Speaking of charnames and utf8heavy, charname::viacode() is incredibly 
slow (I tried to use it extensively to pretty-comment ucm files.  I gave 
up and used quicker and dirtier approach originally by NI-XS) and I 
don't really like how unicore/ is laid out.  We can at least make use of 
AnyDBM_File (the key-value pairs needed there is totally SDBM_File safe 
so we can safely use it!) or if we can spend more memory, Storable.

return 'END'
0   
END

is totally counterintuitive and the whitespace in between must be 
exactly a single '\t' and that sucks (I've been annoyed why my test 
script on InMyOwnDefinition didn't work as expected).

I would like to make this a 5.8.1 todo of mine.

Dan the Encode Maintainer




Re: Encode, charnames and utf8heavy

2002-05-01 Thread Dan Kogai

On Wednesday, May 1, 2002, at 10:57 , Dan Kogai wrote:
 Okay,  I'll change the error message in the next one so it would say

 \x{abcd} does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 
 21.

 Autrijus just sent me a patch so it won't take long.

Done in my repository.

Was
  piconv5.7.3 -c -f utf8 -t ascii t/jisx0201.utf
 \N{U+ff61} does not map to ascii, 134 at 
 /home/dankogai/lib/perl5/5.7.3/i386-freebsd/Encode.pm line 175,  line 
 1.

Is
  bleedperl -Mblib `which piconv5.7.3` -c -f utf8 -t ascii 
 t/jisx0201.utf
 \x{ff61} does not map to ascii at 
 /usr/home/dankogai/work/Encode/blib/lib/Encode.pm line 175,  line 1.

Is there anything I should fix before Encode 1.67 ? (ahem, besides djgpp 
which I am still waiting for the news from Laszlo)

Dan the Encode Maintainer




Re: [Encode] euc-jp vs euc-jisx0213

2002-04-30 Thread Dan Kogai

On Monday, April 29, 2002, at 07:38 , SADAHIRO Tomoyuki wrote:
 I doubt whether users of 'euc-jp' will
 assume it to be a combination with JIS X 0213.

They don't have to because 'euc-jp' behaves exactly the same as before 
so long as the charset is in ASCII/JISX(0201|0208|0212).

 Such a mixing would prevent warning/croaking
 for appearance of code points that are not defined
 originally (meaning w/o X 0213), wouldn't it?

That was my biggest concern but I have decided to go ahead with euc-jp 
to (partially) support JIS X 0213 and the reason is simple;  Encode::JP 
is already too big to differentiate between various euc-jp.  In such 
cases, we should settle for the most 'comprehensive' version.

Even the term 'euc-jp' is too ambiguous for many;  At first it didn't 
include G3 and some say they must be clearly marked as something like 
'euc-jp-classic' (no 0212 support) vs 'euc-jp-modern' and so forth (then 
our current euc-jp should be marked as 'euc-jp-postmodern' :).  It would 
be nice if we can go that way like 7bit-JIS/ISO-2022-JP/ISO-2022-JP-1 
but for euc-jp we have to have a whole ucm for each.

This is definitely a todo for Perl 5.8.1 and up and I have already come 
up with a solution;  the future Encode (Encode II) will support 
CES-generator;  that is, you can express euc-jp not as a whole big 
table but a combination of tables.  That will also reduce the duplicates 
found in vendor mappings.  It will be a complete rewrite of encengine.c

But that requires not only codes but the expansion of UCM format so give 
me more time (and Perl 5.8.0!)

Dan the Encode Maintainer




[Encode] Encode-JIS2K-0.01 uploaded to CPAN

2002-04-30 Thread Dan Kogai

Folks,

   I gotta go in 5 minutes so I just dump the README file after the sig.

Dan the Encode Maintainer

NAME
Encode::JIS2K - JIS X 0212 (aka JIS 2000) Encodings

INSTALLATION

To install this module type the following:

perl Makefile.PL
make
make test
make install

SYNOPSIS
  use Encode::JIS2K;
  use Encode qw/encode decode/;
  $euc_2k = encode(euc-jisx0213, $utf8);
  $utf8   = decode(euc-jisx0213, $euc_jp);

ABSTRACT
This module implements encodings that covers JIS X 0213
charset (AKA JIS 2000, hence the module name).  Encodings
supported are as follows.

  Canonical Alias  
Description
  

  euc-jisx0213  qr/\beuc.*jp[ \-]?(?:2000|2k)$/i  
EUC-JISX0213
qr/\bjp.*euc[ \-]?(2000|2k)$/i
qr/\bujis[ \-]?(?:2000|2k)$/i
  shiftjisx0123 qr/\bshift.*jis(?:2000|2k)$/i   
Shift_JISX0213
qr/\bsjisp \-]?(?:2000|2k)$/i

  iso-2022-jp-3
  jis0213-1-raw JIS X 0213 plane 1, raw 
format
  jis0213-2-raw JIS X 0213 plane 2, raw 
format
  


DESCRIPTION
To find out how to use this module in detail, see the
Encode manpage.

what is JIS X 0213 anyway?
Simply put, JIS X 0213 is a rework and reorganization of
JIS X 0208 and JIS X 0212.  They consist of two 94x94
planes which roughly corrensponds as follows;

  JIS X 0213 Plane 1 = JIS X 0208 + extension
  JIS X 0213 Plane 2 = JIS X 0212 reorganized + extension

And here is the character repertoire there of at a glance.

  # of codepoints Kuten Ku (rows) used
  
  JIS X 0208 6,8791..8,16..83
  JIS X 0213-1   8,7621..94 (all!)
  JIS X 0212 6,0672,6..7,9..11,16..77
  JIS X 0213-2   2,4361,3..5,8,12..15,78..94
  ---
  (JIS X0213 Total) 11,197

JIS X 0213 was designed to extend JIS X 0208 and JIS X
0212 without being imcompatible to (classic) EUC-JP and
Shift_JIS.  The following characteristics are as a result
thereof.

o JIS X plane 1 is (almost) a superset of JIS X 0208.
  However, with Unicode 3.2.0 the mappings differ in 3
  codepoints.

Kuten   JIS X 0208 - Unicode JIS X 0213 - Unicode
--
1-1-17  UFFE3 # FULLWIDTH MACRONU203E # OVERLINE
1-1-29  U2014 # EM DASH U2015 # HORIZONTAL BAR
1-1-79  UFFE5 # FULLWIDTH YEN SIGN  U00A5 # YEN SIGN

o By the same token, JIS X 0213 plane 2 contains JIS Dai-4
  Suijun Kanji (JIS Kanji Repertoire Level 4).  This
  allows EUC-JP's G3 to contain both JIS X 0212 and JIS
  0213 plane 2.

  However, JIS X 0212:1990 already contains many of Dai-4
  Suijun Kanji so EUC's G3 is subject to containing dupli-
  cate mappings.

o Because of Halfwidth Katakana, Shift_JIS mapping has
  been tricky and it is even trickier.  Here is a regex
  that matches Shift_JISX0213 sequence (note: you have to
  use bytes to make it work!)

$re_valid_shifjisx0213 =
  qr/^(?:
   [\x00-\x7f] |# ASCII or
   [\xa1-\xdf] |# JIS X 0201 
KANA or
   [\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc] # JIS X 0213
   )+$/xo;

Note on EUC-JISX0213 (vs. EUC-JP)

As of Encode-1.64, 'euc-jp' does support euc-jisx0213 for
decoding.  However, 'euc-jp' in Encode and 'euc-jisx0213'
differ as follows;

euc-jp   euc-jisx0213
  --
  Decodes   (0201-K|0208|0212|0213)  ditto
  Round-Trip  (|0)  (020-K|0208|0212)JIS X (0201-K|0213)
  Decode Only (|3)  those only found in 0213
those only found in 0212
  --

AUTHORS
Dan Kogai [EMAIL PROTECTED]

COPYRIGHT
Copyright 2002 by Dan Kogai [EMAIL PROTECTED].

This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

SEE ALSO
the Encode manpage, the Encode::JP

Re: Encode doesn't like undef

2002-04-30 Thread Dan Kogai

On Tuesday, April 30, 2002, at 07:14 , Paul Marquess wrote:
 This is with Encode 1.64

 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)'
 Use of uninitialized value in subroutine entry at
 /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.

 I don't know Encode well enough to check if there are any other places 
 this
 will strike.

I think we'd better leave that way;  It needs a PV to (en|de)code so 
consider this a feature.  Of course

perl5.7.3 -w -MEncode -e 'Encode::encode_utf8()'

is perfectly safe and legal.

Dan the Encode Maintainer




Re: Encode doesn't like undef

2002-04-30 Thread Dan Kogai

On Tuesday, April 30, 2002, at 11:42 , Paul Marquess wrote:
 I agree that passing undef() to one of the encoding functions may be an 
 edge
 condition too far, but passing a variable that contains undef is more
 common.

 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)'
 Name main::a used only once: possible typo at -e line 1.
 Use of uninitialized value in subroutine entry at
 /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183.

 Can this be detected  silenced?

You've got a point.  Warning should warn when and only when there is a 
danger therein and passing undef itself is harmless.  And this can be 
done easily by adding defined $str or return; for each sub concerned.  
Okay, I'll go for that.

Dan the Encode Maintainer




Encode should stay undefphobia

2002-04-30 Thread Dan Kogai

On Wednesday, May 1, 2002, at 02:10 , Nick Ing-Simmons wrote:
 Dan Kogai [EMAIL PROTECTED] writes:

 Please don't.

 $a =~ tr/A/a/;

 gives a warning so should encode/decode.

How can I be so dumb for not anticipating you say that! (Blame it on the 
fever).  Paul, I  now think Nick's got more points than yours so I will 
revert it in the next version.  Maybe I will document this undef-phobia 
of Encode subs in the POD

Dan the Warned Man




[Encode] 1.66 Released

2002-04-30 Thread Dan Kogai

My fever is down at last when I released Encode-1.66, available as 
follows;

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.66.tar.gz or CPAN
Diff against current: 264 lines
http://www.dan.co.jp/~dankogai/current-1.66.diff.gz

And Changes.

$Revision: 1.66 $ $Date: 2002/05/01 05:41:06 $
! Encode.xs t/fallback.t
   WARN_ON_ERR no longer assumes RETURN_ON_ERR so you can issue a warning
   while fallback is in effect.  This even came with a welcome side-effect
   of cleaner code with less nests!  Thank you, NI-XS.  t/fallback.t is
   also modified to test this.
   And of course, the corresponding varialbles to UV[Xx]f are 
appropriately
   cast.  This should've concluded NI-XS homework.
! Encode.pm
   encode(undef) does warn again!  Repented upon suggestion by NI-XS.
   Document for unless vs. '' added
   Message-Id: [EMAIL PROTECTED]

As you see, this is a NI-XS homework issue.  Now I have only djgpp to 
left (I think.  djgpp is just s slow on my env.)

Dan the Encode Maintainer




http://bleedperl.dan.co.jp:8080/

2002-04-27 Thread Dan Kogai

I have set up an experimental mod_bleedperl server which URI is shown in 
the subject.
To demonstrate the power of Perl 5.8, I have written a small cgi/pl (.pl 
runs on Apache::Registry) called piconv.pl, a web version of piconv(1).

http://bleedperl.dan.co.jp:8080/piconv/
(Don't forget :8080; it's not run on root!)

What's so funny is that this service can be used to 'asciify' non-ascii 
web pages.  Bart's idea of HTMLCREF is fully exploited here.  To find it 
out, try

http://bleedperl.dan.co.jp:8080/piconv/piconv.pl?f=euc-
jpt=asciiu=www.yahoo.co.jp

Then

http://bleedperl.dan.co.jp:8080/piconv/piconv.pl?f=euc-
jpt=asciio=plainu=www.yahoo.co.jp

Dan the Network Consultant by Trade




Unicode::Unihan 0.01 uploaded to CPAN

2002-04-26 Thread Dan Kogai

I have made a perl module called Unicode::Unihan, a module which makes 
accessing the Unihan DB very easily.  Readme after my sig.

As for the copyright and such I've read thru the original 
Unicode-Unihan-3.2.0 and I concluded I have no problem publicizing this 
but if it does infringe any of such, tell me and I'll remove it from 
CPAN.

Dan the Open Source Developer
--
_  Dan Kogai
   __/    CEO, DAN co. ltd.
  /__ /-+-/  2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan
/--/--- mailto: [EMAIL PROTECTED] / http://www.dan.co.jp/ -
__/  /Tel:+81 3-5665-6131   Fax:+81 3-5665-6132
  GPG Key: http://www.dan.co.jp/~dankogai/dankogai.gpg.asc

Unicode::Unihan
===

INSTALLATION

To install this module type the following:

perl Makefile.PL
make
make test
make install

DEPENDENCIES

This module requires perl 5.6 or better.

NAME
Unicode::Unihan - The Unihan Data Base 3.2

SYNOPSIS
  use Unicode::Unihan;
  my $db = new Unicode::Unihan;
  print join(, = $db-Mandarin(\x{5c0f}\x{98fc}\x{5f3e}), 
\n;

ABSTRACT
This module provides a user-friendly interface to the Uni-
code Unihan Database 3.2.  With this module, the Unihan
database is as easy as shown in the SYNOPSIS above.

DESCRIPTION
The first thing you do is make the database available.
Just say

  use Unicode::Unihan;
  my $db = new Unicode::Unihan;

That's all you have to say.  After that, you can access
the database via $db-tag($string) where tag is the tag in
the Unihan Database, without 'k' prefix.

$data = $db-tag($string) =item @data = $db-tag($string)
The first form (scalar context) returns the Unihan
Database entry of the first character in $string.  The
second form (array context) checks the entry for each
character in $string.

  @data = $db-Mandarin(\x{5c0f}\x{98fc}\x{5f3e});
  # @data is now ('SHAO4 XIAO3','SI4','DAN4')

  @data = $db-JapaneseKun(\x{5c0f}\x{98fc}\x{5f3e});
  # @data is now ('CHIISAI KO O','KAU YASHINAU','TAMA HAZUMU 
HIKU')

SEE ALSO
the perlunintro manpage
the perlunicode manpage
The Unihand Database, in Text
  http://www.unicode.org/Public/3.2-Update/Uni-
  han-3.2.0.txt.gz

AUTHOR
For the Module: Dan Kogai [EMAIL PROTECTED]

For the Source Data: Unicode, Inc.

COPYRIGHT AND LICENSE
For the Module:
 Copyright 2002 by Dan Kogai, All rights reserved.

This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

For the Source Data:

Copyright (c) 1996-2002 Unicode, Inc. All Rights reserved.

 Name: Unihan database
 Unicode version: 3.2.0
 Table version: 1.1
 Date: 15 March 2002




[Encode] 1.61 released

2002-04-25 Thread Dan Kogai

I know we are one more step closer to 5.8 when I released Encode 1.61, 
available as follows;

Whole:
http://www.dan.co.jp/~dankogai/Encode-1.61.tar.gz
and CPAN
Diff against current (840 lines)
http://www.dan.co.jp/~dankogai/current-1.61.diff.gz

And changes.

$Revision: 1.61 $ $Date: 2002/04/26 03:02:04 $
! t/mime-header.t
   Now does decent tests besides use_ok()
! lib/Encode/Guess.pm t/guess.t
   UI streamlined, document added
! Unicode/Unicode.xs
   various signed/unsigned mismatch nits (#16173)
   http://public.activestate.com/cgi-bin/perlbrowse?patch=16173
! Encode.pm
   POD:  utf8-flag-related caveats added.  A few sections completely
   rewritten.
! Encode.xs
! AUTHORS
   Thou shalt not assume %d works, either!
   Robin Baker added to AUTHORS for this
   Message-Id: [EMAIL PROTECTED]
! t/CJKT.t
   Change 16144 by gsar@onru on 2002/04/24 18:59:05

Dan the Encode Maintainer




Re: Practical problems with custom .ucm based encoding

2002-04-24 Thread Dan Kogai

On Wednesday, April 24, 2002, at 09:25 , Bart Schuller wrote:
 Hello,

 The cool Encoding support in 5.8 to be enables me to properly solve a
 very common task: making HTML entities out of utf-8 data.

 I generated a ucm file with entries like this:

 U00A0 \x26\x6E\x62\x73\x70\x3B |0 # nbsp

 The resulting Encode::HTMLEntities encoding works perfectly. However, I
 want it to do more.

 Not every unicode character has a corresponding entity. Unknown ones can
 be encoded like #8364;, so I would like my Encoding to use a simple
 function as a fallback. This proves hard. With CHECK == Encode::FB_WARN
 it looks like the whole string is left untouched, so my plan to just
 substr() off the first character, handle it by hand and repeat is not
 going to work.

 I'd be very happy with a CHECK mode which would allow me to handle a
 single problematic character in perl. Having to find it in a longer
 string is very hard in this case, because it's every character  0x{7f}
 which is not in my .ucm file.

As a matter of fact, I was thinking of adding FB_HTMLENT or something 
like that.  It seems trivial;  Unless jhi whips me for the sin of 
Feeping Creaturism, I'll do so.

CAVEAT;  This will be done via fallback so  will not turn into 
entities!

Dan the Encode Maintainer




Re: Practical problems with custom .ucm based encoding

2002-04-24 Thread Dan Kogai

On Wednesday, April 24, 2002, at 09:43 , Bart Schuller wrote:
 Character Reference is the proper term, for entities you'd need my whole
 module.
 Please go completely overboard and have FB_XMLCHARREF in addition to
 FB_HTMLCHARREF, the difference being that the XML version would make it
 #x20ac;

Shoot!  I've just implemented FB_HTMLENT ! (quick, wasn't it?)  Okay, be 
it CHARREF (or isn't there a good short abbreviation for that?).  Let me 
check to make surewait,

HTML Char Ref;  #1234
XML  Char Ref;  #xabcd

(just checked http://www.w3.org/TR/html4/charset.html).

Dan the Encode Maintainer




Re: Practical problems with custom .ucm based encoding

2002-04-24 Thread Dan Kogai

On Wednesday, April 24, 2002, at 10:07 , Bart Schuller wrote:
 On Wed, Apr 24, 2002 at 09:56:29PM +0900, Dan Kogai wrote:
 Shoot!  I've just implemented FB_HTMLENT ! (quick, wasn't it?)  Okay, 
 be
 it CHARREF (or isn't there a good short abbreviation for that?).  Let 
 me

 CHARREF is as short as I can make it. You don't happen to have an Encode
 interim release that I can test, do you?

Sorry.  I have picked CREF.  Well, I have to prepend ENCODE_ for those 
macros so even 7 letters is too long ;).  Well, at least they are 
documented.  They are all implemented, with a test suite in 
t/fallback.t, and documented now.  Stay tuned!

 Perhaps my encoding is also small enough to be included with Encode, if
 you think it makes sense.

Please send me your UCM and I will review it.  Well... maybe not.  
enc2xs is trivially easy and I should give CPAN modules a room to 
explore :)

 Isn't open source fun, instant response. Many thanks for your work on
 Encode, you are putting a lot of time in it.

It definitely is.  You're welcome.  And as for the time I have spent, 
heck, that's what The First Great Virtue of a Programmer dictates me.  I 
hope, no, DEMAND, that users save billions of hours w/ Perl 5.8.0

Dan the Lazy Man




FYI: Encode performance on Japanese encodings

2002-04-23 Thread Dan Kogai

I was curious to find how fast or slow Encode is against popular 
Japanese transcoder modules.  So I benchmarked and relieved that 
Encode's performance was good!

I benchmarked it against Jcode.pm (mine, too) and jcode.pl (the first 
and still popular transcoder available since Perl4 by Utashiro-san) and 
here is the result.  Except for 7bit-jis (ISO-2022) - euc-jp Encode 
performed the best.  And even for those losing against Jcode.pm, 
performance loss is not that big.  Note Unicode conversion tests are 
missing from jcode.pl because they are unimplemented thereby.

The Japanese has one more good reason to switch to Perl 5.8.

Dan the Encode Maintainer.

 7bit-jis - euc-jp 
Rate   Encode jcode.pl Jcode.pm
Encode   69.6/s   -- -16% -40%
jcode.pl 83.1/s  19%   -- -29%
Jcode.pm  116/s  67%  40%   --
 7bit-jis - shiftjis 
Rate Jcode.pm jcode.pl   Encode
Jcode.pm 14.6/s   --  -9% -80%
jcode.pl 16.0/s   9%   -- -78%
Encode   71.9/s 393% 351%   --
 7bit-jis - ucs2 
Rate Jcode.pm   Encode
Jcode.pm 48.3/s   -- -26%
Encode   65.3/s  35%   --
 7bit-jis - utf8 
Rate Jcode.pm   Encode
Jcode.pm 31.9/s   -- -63%
Encode   86.5/s 171%   --
 euc-jp - 7bit-jis 
Rate jcode.pl   Encode Jcode.pm
jcode.pl 45.9/s   -- -32% -60%
Encode   67.7/s  48%   -- -41%
Jcode.pm  114/s 149%  69%   --
 euc-jp - shiftjis 
Rate Jcode.pm jcode.pl   Encode
Jcode.pm 16.8/s   -- -16% -92%
jcode.pl 20.0/s  19%   -- -90%
Encode206/s1129% 931%   --
 euc-jp - ucs2 
Rate Jcode.pm   Encode
Jcode.pm 85.3/s   -- -47%
Encode160/s  87%   --
 euc-jp - utf8 
Rate Jcode.pm   Encode
Jcode.pm 44.9/s   -- -89%
Encode400/s 791%   --
 shiftjis - 7bit-jis 
Rate Jcode.pm jcode.pl   Encode
Jcode.pm 13.3/s   --  -7% -81%
jcode.pl 14.2/s   7%   -- -80%
Encode   70.7/s 434% 397%   --
 shiftjis - euc-jp 
Rate Jcode.pm jcode.pl   Encode
Jcode.pm 15.1/s   -- -25% -93%
jcode.pl 20.1/s  33%   -- -90%
Encode210/s1285% 943%   --
 shiftjis - ucs2 
Rate Jcode.pm   Encode
Jcode.pm 12.8/s   -- -93%
Encode175/s1270%   --
 shiftjis - utf8 
Rate Jcode.pm   Encode
Jcode.pm 11.2/s   -- -98%
Encode512/s4456%   --
 ucs2 - 7bit-jis 
Rate Jcode.pm   Encode
Jcode.pm 61.8/s   --  -1%
Encode   62.4/s   1%   --
 ucs2 - euc-jp 
   Rate Jcode.pm   Encode
Jcode.pm 138/s   --  -9%
Encode   151/s   9%   --
 ucs2 - shiftjis 
Rate Jcode.pm   Encode
Jcode.pm 14.9/s   -- -91%
Encode162/s 989%   --
 ucs2 - utf8 
Rate Jcode.pm   Encode
Jcode.pm 33.3/s   -- -87%
Encode267/s 700%   --
 utf8 - 7bit-jis 
Rate Jcode.pm   Encode
Jcode.pm 59.5/s   -- -18%
Encode   72.7/s  22%   --
 utf8 - euc-jp 
   Rate Jcode.pm   Encode
Jcode.pm 129/s   -- -44%
Encode   233/s  80%   --
 utf8 - shiftjis 
Rate Jcode.pm   Encode
Jcode.pm 14.7/s   -- -94%
Encode261/s1673%   --
 utf8 - ucs2 
Rate Jcode.pm   Encode
Jcode.pm 50.6/s   -- -74%
Encode191/s 278%   --




[Encode] 1.51 Released

2002-04-20 Thread Dan Kogai

I was anticipating the release of 1.51 AFTER I get to bed and back.  But 
my insomnia and earlier-than-expected responses from NI-XS and Autrijus 
have accelerated the release by at lease 6 hours :)  Get it via

http://www.dan.co.jp/~dankogai/Encode-1.51.tar.gz
or CPAN.

Though changes are small codewise, This release includes the updates in 
two giant ucm files so diff will not be supplied.  Please get the whole 
thing.

1.51 $Date: 2002/04/20 09:58:23 $
! t/TW.t
   Updated test suite by Autrijis so make test is happy again
   Message-Id: [EMAIL PROTECTED]
+ ucm/big5-eten.ucm
! ucm/big5-hkscs.ucm lib/Encode/Alias.pm
- ucm/big5.ucm
   TW/TW.pm TW/Makefile.PL
   Updates by Autrijus.  'big5' is no longer a canonical but an
   alias to 'big5-eten'. big5-hkscs is now in 2001 edition.
   Message-Id: [EMAIL PROTECTED]
! Encode.xs
   Fix by NI-XS that fallback may cause SEGV w/ Perl/TK
   Message-Id: [EMAIL PROTECTED]
! Encode.pm
   PerlIO detection a little bit smarter; no longer uses eval qq{}
   but eval {}.

Dan the Encode Maintainer




Re: Encode-1.50 +

2002-04-20 Thread Dan Kogai

On Sunday, April 21, 2002, at 04:50 , Nick Ing-Simmons wrote:
 I just checked in these changes to ext/Encode/... as change 16022
 on perlio branch.

To honor whitespaces,  I usually rsync perl-core first then copy 
filesback to my repository for NI-XS (this works only for patches from 
those w/ commit right to perl repository, however).  But seems like AS 
is not new enough so...

  - switch to XSLoader
  - spelling  trailing whitespace removal.
  - remove a use loop (Encode loaded PerlIO::encoding, loaded Encode)
it never loops, but such things cause problems for imports.
  - Changed how LEAVE_SRC was tested
  x  ~y is not same as !(x  y)
  - Moved Unicode.xs towards supporting same check values.
  - Set Encode::XS::ISA  to Encode::Encoding
  - added -needs_lines method with my best guess at which ones do.

I did this;

* Copy the patch chunk
* perl -i.bak 's/\s+\n/\n/o' patch.file to make sure no trailing space 
after LF
* patch -l so patch ignores the number of whitespaces ahead

And the resulting patch work pretty good.  Among 18 hunks one failed at 
Encode.pm and that was trivial to mend manually.  and make distclean - 
breadperl Makefile.PL - make test
works beautifully.

 I still cannot get TODO tests in t/perlio.t despite some work
 on PerlIO::encoding to honour -needs_lines. I need to study it some 
 more.
 What I really want to do is get have PerlIO::encoding use fallback
 schemes. Which ENCODE_FB_XXX flag bit(s) give me fallback characters 
 but still
 remove translated stuff from the src buffer?

 Perhaps update src should be an active rather than a passive bit?

Please wait till caffain runs on my bloodstream.   I just woke up 
(because of insomnia or whatever I was not quite nocturnal last night; 
It is 5 minutes before 06:00 AM JST).

Dan the Encode Maintainer




Re: [big5-*.ucm] please revise if possible

2002-04-20 Thread Dan Kogai

On Sunday, April 21, 2002, at 02:32 , Autrijus Tang wrote:
 Updated maps and test:
 http://egb.elixus.org/~autrijus/big5-1.52.tgz

 Ucmlint still complains, due to the order issue outlined in the
 previous mail.

As you have intelligently found, the order for duplicate map DOES 
matter; |1 or |3 have to come AFTER |0.

So I wrote a quick and dirty sort program that just does this as follows;

#!
use strict;
my lines;
while (){
 chomp;
 m/^U/o or next;
 push lines,[ split ];
}

for (sort {
 $a-[0] cmp $b-[0] # Unicode descending order
 or $a-[2] cmp $b-[2] # fallback descending order
 or $a-[1] cmp $b-[1] # Encoding descending order
 }
   lines) {
 print join(  = $_), \n;
}
__END__

And put the sorted text back to ucm files and now they all round-trip.  
This is easy to understand how enc2xs works.  It has two hashes, %e2u 
and %u2e.  If |0, it updates  both.  if |1 or |3, it updates either.  
And when update, old hash entry is overwritten.  so |3 goes in vain if 
it is followed by |0.

Maybe I should document this on enc2xs pod.

XinKeLe!

Dan the Encode Maintainer

  perl5.7.3 -Mblib bin/ucmlint -e 
ucm/big5-eten.ucm  
 
ucm/big5-eten.ucm:warning in line 421: dupe encode map: U2550 = F9,F9 
and A2,A4
ucm/big5-eten.ucm:warning in line 436: dupe encode map: U255E = F9,E9 
and A2,A5
ucm/big5-eten.ucm:warning in line 440: dupe encode map: U2561 = F9,EB 
and A2,A7
ucm/big5-eten.ucm:warning in line 450: dupe encode map: U256A = F9,EA 
and A2,A6
ucm/big5-eten.ucm:warning in line 454: dupe encode map: U256D = A2,7E 
and F9,FA
ucm/big5-eten.ucm:warning in line 456: dupe encode map: U256E = A2,A1 
and F9,FB
ucm/big5-eten.ucm:warning in line 458: dupe encode map: U256F = A2,A3 
and F9,FD
ucm/big5-eten.ucm:warning in line 460: dupe encode map: U2570 = A2,A2 
and F9,FC
ucm/big5-eten.ucm: no error found
dankogai@dan-attic[6276]:~/work/Encode perl5.7.3 -Mblib bin/ucmlint -e 
ucm/big5-hkscs.ucm
ucm/big5-hkscs.ucm:warning in line 1900: dupe encode map: U301E = A1,AA 
and C6,DE
ucm/big5-hkscs.ucm:warning in line 2710: dupe encode map: U4EDD = C9,69 
and C6,DF
ucm/big5-hkscs.ucm:warning in line 2932: dupe encode map: U50ED = B9,B0 
and 9F,CB
ucm/big5-hkscs.ucm:warning in line 2981: dupe encode map: U5159 = A2,59 
and 92,AF
ucm/big5-hkscs.ucm:warning in line 2983: dupe encode map: U515B = A2,5A 
and 92,B0
ucm/big5-hkscs.ucm:warning in line 2986: dupe encode map: U515D = A2,5C 
and 92,B1
ucm/big5-hkscs.ucm:warning in line 2988: dupe encode map: U515E = A2,5B 
and 92,B2
ucm/big5-hkscs.ucm:warning in line 4137: dupe encode map: U5C10 = C9,5C 
and 9C,BC
ucm/big5-hkscs.ucm:warning in line 4384: dupe encode map: U5F0C = 93,61 
and 9F,D8
ucm/big5-hkscs.ucm:warning in line 4509: dupe encode map: U6062 = AB,EC 
and 9E,A9
ucm/big5-hkscs.ucm:warning in line 4765: dupe encode map: U62CE = A9,F0 
and A0,77
ucm/big5-hkscs.ucm:warning in line 4767: dupe encode map: U62D0 = A9,E4 
and 9D,C4
ucm/big5-hkscs.ucm:warning in line 5935: dupe encode map: U6FB6 = BF,47 
and 9B,F6
ucm/big5-hkscs.ucm:warning in line 5974: dupe encode map: U701E = 96,EE 
and 96,ED
ucm/big5-hkscs.ucm:warning in line 6119: dupe encode map: U71DF = C0,E7 
and 9C,62
ucm/big5-hkscs.ucm:warning in line 6165: dupe encode map: U7250 = 94,55 
and A0,E4
ucm/big5-hkscs.ucm:warning in line 6337: dupe encode map: U7468 = 94,7A 
and A0,D5
ucm/big5-hkscs.ucm:warning in line 6659: dupe encode map: U77D7 = C5,F7 
and 9B,78
ucm/big5-hkscs.ucm:warning in line 6825: dupe encode map: U79E3 = AF,B0 
and 9C,BD
ucm/big5-hkscs.ucm:warning in line 6958: dupe encode map: U7B51 = B5,AE 
and 9D,5A
ucm/big5-hkscs.ucm:warning in line 6990: dupe encode map: U7BB8 = BA,E6 
and 8E,69
ucm/big5-hkscs.ucm:warning in line 7084: dupe encode map: U7CCE = A2,61 
and 8E,7E
ucm/big5-hkscs.ucm:warning in line 7195: dupe encode map: U7DD2 = BA,FC 
and 8E,AB
ucm/big5-hkscs.ucm:warning in line 7227: dupe encode map: U7E1D = BF,A6 
and 8E,B4
ucm/big5-hkscs.ucm:warning in line 7368: dupe encode map: U8005 = AA,CC 
and 8E,CD
ucm/big5-hkscs.ucm:warning in line 7387: dupe encode map: U8028 = BF,AE 
and 8E,D0
ucm/big5-hkscs.ucm:warning in line 7736: dupe encode map: U83C1 = B5,D7 
and 8F,57
ucm/big5-hkscs.ucm:warning in line 7839: dupe encode map: U8503 = 92,42 
and 92,44
ucm/big5-hkscs.ucm:warning in line 8047: dupe encode map: U880F = 8F,B6 
and A0,63
ucm/big5-hkscs.ucm:warning in line 8181: dupe encode map: U89A6 = BF,CC 
and 8F,CB
ucm/big5-hkscs.ucm:warning in line 8184: dupe encode map: U89A9 = A0,D4 
and 8F,CC
ucm/big5-hkscs.ucm:warning in line 8494: dupe encode map: U8D77 = B0,5F 
and 8F,FE
ucm/big5-hkscs.ucm:warning in line 8825: dupe encode map: U90FD = B3,A3 
and 90,6D
ucm/big5-hkscs.ucm:warning in line 9045: dupe encode map: U936E = A0,5F 
and 92,C8
ucm/big5-hkscs.ucm:warning in line 9335: dupe encode map: U975C = C0,52 
and 90,DC
ucm/big5-hkscs.ucm:warning in line 9337: dupe 

Encode-1.50 and PerlIO::encoding 0.02 released

2002-04-19 Thread Dan Kogai

I am daydreaming that I am a caravan member, driving a herd of 
disobedient camels on the never-ending desert to an oasis called 5.8.0 
when I released new Encode and PerlIO::encoding.  You can get one as 
follows.

Whole:
Encode
http://www.dan.co.jp/~dankogai/Encode-1.50.tar.gz
and CPAN
PerlIO::encoding
http://www.dan.co.jp/~dankogai/PerlIO-encoding-0.02.tar.gz
Diff
Encode
http://www.dan.co.jp/~dankogai/current-1.50.diff.gz
PerlIO::encoding
[ none ]

Diff is pretty big ( 3000 lines) so you should get a whole thing 
instead.

The biggest and the foremost change is the fallback API which is greatly 
enhanced.  NI-XS request of

On Friday, April 19, 2002, at 05:01 , Nick Ing-Simmons wrote:
   check == 11 - silent fail with $string updated (What Tk wants)

is implemented as FB_QUIET.  see below;


Handling Malformed Data
THE CHECK argument is used as follows.  When you omit it,
it is identical to CHECK = 0.

CHECK = Encode::FB_DEFAULT ( == 0)
If CHECK is 0, (en|de)code will put substitution char-
acter in place of the malformed character.  for UCM-
based encodings, subchar will be used.  For Unicode,
\xFFFD is used.  If the data is supposed to be UTF-8,
an optional lexical warning (category utf8) is given.

CHECK = Encode::DIE_ON_ERROR (== 1)
If CHECK is 1, methods will die immediately  with an
error message.  so when CHECK is set,  you should trap
the fatal error with eval{} unless you really want to
let it die on error.

CHECK = Encode::FB_QUIET
If CHECK is set to Encode::FB_QUIET, (en|de)code will
immediately return proccessed part on error, with data
passed via argument overwritten with unproccessed
part.  This is handy when have to repeatedly call
because the source data is chopped in the middle for
some reasons, such as fixed-width buffer.  Here is a
sample code that just does this.

  my $data = '';
  while(defined(read $fh, $buffer, 256)){
# buffer may end in partial character so we append
$data .= $buffer;
$utf8 .= decode($encoding, $data, ENCODE::FB_QUIET);
# $data now contains unprocessed partial character
  }

CHECK = Encode::FB_WARN
This is the same as above, except it warns on error.
Handy when you are debugging the mode above.

perlqq mode (CHECK = Encode::FB_PERLQQ)
For encodings that are implemented by Encode::XS,
CHECK == Encode::FB_PERLQQ turns (en|de)code into
perlqq fallback mode.

When you decode, '\xXX' will be placed where XX is the
hex representation of the octet  that could not be
decoded to utf8.  And when you encode, '\x{}' will
be placed where  is the Unicode ID of the charac-
ter that cannot be found in the character repartoire
of the encoding.

The bitmask
These modes are actually set via bitmask.  here is how
FB_XX are laid out.  for FB_XX you can import via use
Encode qw(:fallbacks) for generic bitmask constants,
you can import via
 use Encode qw(:fallback_all).

 FB_DEFAULT FB_CROAK FB_QUIET FB_WARN  
FB_PERLQQ
 DIE_ON_ERR0x0001 X
 WARN_ON_ER0x0002   X
 RETURN_ON_ERR 0x0004  XX
 LEAVE_SRC 0x0008
 PERLQQ0x0100X

Unemplemented fallback schemes

In future you will be able to use a code reference to a
callback function for the value of CHECK but its API is
still undecided.


Since PerlIO::encoding was uncapable of using this new feature, I have 
updated PerlIO::encoding as well;  Instead of pushing PL_sv_yes to 
stack, now struct PerlIOEncode has one more member, chk, that is 
initialized with Encode::FB_QUIET.

typedef struct {
 PerlIOBuf base; /* PerlIOBuf stuff */
 SV *bufsv;  /* buffer seen by layers above */
 SV *dataSV; /* data we have read from layer below */
 SV *enc;/* the encoding object */
 SV *chk;/* CHECK in Encode methods */
} PerlIOEncode;

Encode now checks the version of PerlIO::encoding and refuse to use an 
obsolete version.  see t/perlio.t on details.

That way PerlIO::encode has no trouble should Encode changes the value 
of FB_QUIET.
As for the partial character problem, I have found it is nearly 
impossible for escape-based encodings to 

Re: [PATCH] Big5-related changes.

2002-04-19 Thread Dan Kogai

On Saturday, April 20, 2002, at 04:53 , Autrijus Tang wrote:
 I've been immersed in Big5-related issues in the past few days, and
 came back with these last-minute (err, week?) changes before 5.8-RC1.

 The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).

Excellent!

 (For dan) big5-hkscs should be upgraded to the 2001 edition, as per
 Hong Kong government's decree. It's available separately at:

 http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz

 Also, please delete big5.ucm and replace it with big5-eten, at:

 http://egb.elixus.org/~autrijus/big5-eten.ucm.gz

Thus updated.  I needed to update TW/Makefile.PL and 
lib/Encode/Config.pm (so it loads on 'big5-eten' instead of just 
'big5'). but that's not at all a big deal.

 I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that
 the 'Big5' as originally defined isn't used anywhere on earth; non-
 Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft
 uses 'big5' to mean 'cp950'.

 It is therefore unwise to have a canonical 'big5' encoding, much like
 there should not be a 'gb2312' encoding. Since gb2312 is now aliased
 to euc-cn and not cp936, I think big5 should alias to big5-eten and
 not cp950.

I agree.  AFAIK, Big5 is the only major CJK encoding not endorsed by the 
government.  What's so funny is that there seems less confusions between 
encodings there in Taiwan than in Japan or Korea.  Japan is the worst 
for using Shift_JIS, EUC-JP, ISO-2022-JP(-[12])? and now Unicode (IMHO, 
however, the Japanese people should be proud for making multibyte 
character encoding a reality.  But I can't help wondering this mess is 
way too much a price to pay :)

 Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although
 the encoding is called 'gb2312-raw'. I admit that I don't fully
 understand the reason, but if that's to stand, then big5-eten could also
 be named 'big5.ucm', and still say 'code_set_name big5-eten', for
 consistency's sake.

I renamed big5.ucm to big5-eten.ucm.  -raw that are missing from *.ucm 
filenames is just that they look too funny on 8.3 filesystems, nothing 
more :)

 Thanks,
 /Autrijus/

Xin Ku  Le  !
\x{8f9b}\x{82e6}\x{4e86}

XiaoSi   Dan
\x{5c0f}\x{98fc} \x{5f3e}\n




Re: Tk804 + Encode-1.50 :-) again

2002-04-19 Thread Dan Kogai

On Saturday, April 20, 2002, at 03:45 , Nick Ing-Simmons wrote:
 Dan Kogai [EMAIL PROTECTED] writes:
 I am daydreaming that I am a caravan member, driving a herd of
 disobedient camels on the never-ending desert to an oasis called 5.8.0
 when I released new Encode and PerlIO::encoding.  You can get one as
 follows.

 p4 integrated to //depot/perlio for testing.

 Without any changes to Tk804 things improved a bit - only the JP.t and 
 KR.t
 tests were failing, and those not failing as badly.

I though I relocated perlio-related test in them to t/perlio.t.  Is 
there any left?

 Adding ENCODE_FB_QUIET to Tk's encode glue makes those pass as well.

That was my biggest concern.  So glad to hear that.

 Suggest one small tweak as in attached patch.

 The patch turns off utf8_to_uvuni's warning and checks as only
 thing we are using the UV for is an error message (which in my case
 isn't going to be printed as I am in FB_QUIET). Otherwise I get noise
 when Tk is groping about in U+FFXX page.

Applied, thanks.

 The indent looks better - but has cuddled else - no big deal.

 I was a little surprised that Encode/encode.h gets installed in lib
 rather than archlib/CORE but can live with that (makes a kind of sense
 it is architecture neutral - but perl.h et. al. go elsewhere).
 The snag here is that Makefile.PL has added -I to find perl.h, so I
 have to
 #include ../../Encode/encode.h
 which is portability issue as there is no certainty that lib / archlib
 relative paths work like that. Will tweak Tk's Makefile.PL configure
 to hunt down encode.h.

I wonder if there is more sensible way to install NON-PM files to 
PERL5LIB.  For the time being it is at the mercy of MM.  Though not a 
show stopper I would like Encode to be as clean and standard-compliant 
as possible.  MM is so vast I don't even know how many more features are 
hidden...

 Will do a spelling patch on the pod(s) when I get a chance.

Yes, please.  Emacs doesn't do spellcheck-as-you-type like recent 
mailers in MacOS and Windows :)  (I know you can spellcheck in Emacs but 
I am not sure if it is a good idea to to do so in .pm).

Dan the Encode Maintainer




[Encode] Dark Side of the Emacs Modes [Was: Re: Tk804 ...]

2002-04-19 Thread Dan Kogai

On Saturday, April 20, 2002, at 05:38 , Nicholas Clark wrote:
 On Sat, Apr 20, 2002 at 04:27:15AM +0900, Dan Kogai wrote:
 Yes, please.  Emacs doesn't do spellcheck-as-you-type like recent
 mailers in MacOS and Windows :)  (I know you can spellcheck in Emacs 
 but
 I am not sure if it is a good idea to to do so in .pm).

 You underestimate the power of the dark side.

 M-x flyspell-mode

I knew something like this existed but never checked the mode name :)
Hmm  Requires ispell...  Piece of cake with portupgrade (could be 
the most widely used ruby program in (Free)BSD world) Oh man! you're 
right!  It even supports mouse (but I usually use emacs only via tty).  
But how about perl jargons?  automagicalNi!  
barewordsNi!  Hmm.  This mode needs some more education :)  
Thanks.  More than 10 years w/ Emacs and still lost in modes

 Definitely part of the dark side because here it defaults to American.

Does it correct pronunciation of the Britons so CAN'T do that sounds 
less obscene :?

 And then refuses to start because I don't have American dictionaries
 installed. ispell has no problem just running and finding the correct
 dictionaries.

Dan the Emacs User, not Elisp Hacker
 ^pretty funny.  MacOS X Mail underline this 
but not
 Emacs.  Is it smart enough to scan $PATH and 
make them
correct?




Re: Please update Encode::HanExtra

2002-04-18 Thread Dan Kogai

On Thursday, April 18, 2002, at 04:40 , Autrijus Tang wrote:
 On Thu, Apr 18, 2002 at 11:41:48AM +0900, Dan Kogai wrote:
 http://www.dan.co.jp/~dankogai/Encode-HanExtra-0.04.tar.gz
   Please pick it up, add necessary changes and upload YOUR version to
 CPAN.

 Okay, will do.

XieXieGeZuo

 But.

 Do we want optional compatibility with 5.7.[23]? i.e. only use
 enc2xs where it's available.

I don't.  So far we have no 5.8.0, meaning we don't have to think about 
backward compatibility at all -- yet.

Dan the Encode Maintainer




Encode-1.42 PerlIO-encoding-0.01 now available

2002-04-16 Thread Dan Kogai

NI-XS, jhi and porters,

The surgical operation is finished.  PerlIO layer functions in Encode.xs 
has been successfully detached.  Now PerlIO part is in 
PerlIO::encoding.  They are now more like interdependent than 
dependent.  You can get one via URLs below;

http://www.dan.co.jp/~dankogai/PerlIO-encoding-0.01.tar.gz
http://www.dan.co.jp/~dankogai/Encode-1.42.tar.gz
http://www.dan.co.jp/~dankogai/perl-dan.tar.bz2

The last one is the whole perl with interdependent versions of Encode 
and PerlIO.  As a matter of fact, just replace Encode with 1.42 above, 
untargzip PerlIO-encoding-0.01 at ext/PerlIO/ and rename the thawed 
directory to encoding, and fix toplevel MANIFEST and it will work 
perfectly.  Configure file needed now modification.

Here is how Encode tests as a module.

 t/Aliases.ok
 t/CN..ok
 t/Encode..ok
 t/Encoder.ok
 t/JP..ok, 6/27 skipped: PerlIO Encoding Needed
 t/KR..ok, 6/22 skipped: PerlIO Encoding Needed
 t/TW..ok
 t/Unicode.ok
 t/encodingok
 t/growok
 t/jperl...ok
 All tests successful, 12 subtests skipped.
 Files=11, Tests=4616, 11 wallclock secs ( 7.52 cusr +  0.50 csys =  
 8.02 CPU)

And with Whole perl and PerlIO
 ext/Encode/t/CN.ok
 ext/Encode/t/Encode.ok
 ext/Encode/t/Encoderok
 ext/Encode/t/JP.ok
 ext/Encode/t/KR.ok
 ext/Encode/t/TW.ok
 ext/Encode/t/Unicodeok
 ext/Encode/t/encoding...ok
 ext/Encode/t/grow...ok
 ext/Encode/t/jperl..ok
 []
 ext/PerlIO/PerlIO...ok
 ext/PerlIO/t/encoding...ok
 ext/PerlIO/t/scalar.ok
 ext/PerlIO/t/viaok

See ext/PerlIO/t/encoding.t was never modified.  So it is 100% 
compatible with the prior version.

FYI those will not be uploaded to CPAN;  I'll wait until perl-current 
catches up.  And PerlIO::encoding is not mine but NI-XS.  So if it is to 
be CPANized, it must be done by NI-XS (I pretty much doubt if he does, 
however).

.Man, I'm exhausted.  Autrijus, Jungshik, sorry for not responding 
soon.  Please let me take a nap before I process your new READMEs.

Dan the Encode Maintainer.




Re: iso-2022-jp problem

2002-04-15 Thread Dan Kogai

On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
 I tracked down the problem tkmail was/is having with iso-2022-jp.
 The snag is I am using the API the way I designed it, not the way
 it is reliably implemented.

 When called thus:

 my $decoded = $enc-decode($encoded,1);

 decode is supposed to return portion it can decode, and set $encoded
 to what remains.

Ah,  I see.  But it is pain in the arse for doubly-encoded encodings 
like ISO-2022-JP.

Here is the problem.  As you see, to decode ISO-2022-JP, we first have 
to decode it into EUC-JP.  And ISO-2022-JP - EUC-JP is treated (and 
should be treated) purely as a CES so there is no chance for error 
(unless there is a bogus escape sequence).  However, errors may rise 
when you try to convert the resulting EUC-JP stream to UTF-8.

The problem is that not all of the possible code points in JIS X 0208 
and JIS X 0212 are actually used (94x94 = 8836).  of which only 6884 are 
used in 0208 and 6072 are used in 0212.  So the remainder won't map to 
Unicode.

It was possible to use jis02*-raw instead of EUC-JP but that 
implementation was too slow because you have to invoke encode() chunk by 
chunk.  in fact I tried and it got 3 times as slow.

And what is a sense of what remain gets moot when it comes to 
ISO-2022.  Suppose you got a string like this;

abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu
 ^^error occurs here.

What's the remaining stream?

ghijklmnESC-to-asciiopqrstu


is WRONG because we are now in jis0208 chunk and escape sequence is 
already stripped.  Do we have to go like

ESC-to-jis0208ghijklmnESC-to-asciiopqrstu

but that slows down the encoder too much.   I just woke up.  Let me 
think about this a little bit more

Dan the Encode Maintainer




Re: iso-2022-jp problem

2002-04-15 Thread Dan Kogai

On Tuesday, April 16, 2002, at 12:00 , Nick Ing-Simmons wrote:
 abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu
 ^^error occurs here.

 What's the remaining stream?

 ghijklmnESC-to-asciiopqrstu

 Does not matter for that case.
 does not map is a fatal error with $chk true (and would have
 become a replacement char if $chk was false).

 What matters is being able to tell the complete case, from partial case.

  A. When you have converted whole thing set remains to ''.
  B. When you have a partial encoding consume as much as you can
 and leave string with what is partial.

 e.g.

 abcdESC-to-jis0208cdefghijklmnESC-to  -asciiopqrstu
^- buffer boundary

 Then you return translation of
 abcdESC-to-jis0208cdefghijklmn
 and set remains to Esc-to
 so that :encoding can append -asciiopqrstu

One of many reasons that programmers dislike 7bit ISO-2022 is exactly 
how to handle case B -- how to split the buffer in the middle  When 
handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY 
LENGTH.  Of course that causes the problem for large files and even 
worse, network streams.  But fortunately, 7bit ISO-2022 has one safety 
net for that solution;  IT ALWAYS REVERTS TO ASCII BEFORE CONTROL 
CHARACTERS, including CRLF.  So if you need it you can safely split 
buffer line by line.  A script

binmode(STDOUT, :utf8);
while(){
print Encode:decode(iso-2022-jp, $_);
}

is completely safe because $_ is guaranteed to start in ASCII and end in 
ASCII.

Check RFC 1468  (http://www.ietf.org/rfc/rfc1468.txt and others).  It is 
not as complicated as it sounds.

 If you cannot do that then don't return or consume anything
 so :encoding can keep appending till you have whole file but that
 is going to be very memory hungry.

As I said, if you are worried about memory, just use line buffer is 
the answer.

Other encodings are subject to this boundary problem -- and solution.  
Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for 
decomposed form), you name it.  But very fortunately for all these, 
legacy encodings for those are all designed so that you can rely on CRLF 
to split the stream.

Dan the Encode Maintainer.




Re: iso-2022-jp problem

2002-04-15 Thread Dan Kogai

On Tuesday, April 16, 2002, at 01:06 , Nick Ing-Simmons wrote:
 So we need some way of telling from an encoding object (e.g.
 an attribute or a method call) that it needs line buffering
 so that :encoding layer can take the appropriate steps.

Okay, which way do you like, attribute or method ?  I think method is 
more elegant but attribute seems easier to fetch.  Since this is more 
for PerlIO than Encode itself, I would appreciate if you gave me the API 
(just name would be enough) and I will add them to ISO-2022 stuff (not 
just JP but KR has one, too).

Dan




Re: README.jp (or README.jp?)

2002-04-15 Thread Dan Kogai

On Tuesday, April 16, 2002, at 08:14 , Jarkko Hietaniemi wrote:
 Could I ask for the Japanese translation?  (Check out Autrijus' latest
 message about the subject, they had a useful additional section.)

Sorry.  I was too preoccupied w/ the module itself.  Will be submitted 
before I go to bed.

Dan




[Encode] 1.40 will be released in a few hours!

2002-04-14 Thread Dan Kogai

Folks,

   I will release ver. 1.40 of Encode after the smoke testings are done.  
With In-XSimmons'  XS version of Unicode transcoders, encoding.pm 
enhancements and fixes (that led to child gets croaked before born bug 
discovery), and other nits picked, simple version increment is not 
enough.

* With all modules loaded, it can transcode some 113 encodings and it is 
easy to add more via enc2xs.
* With encoding pragma, you can emulate Jperl and more
* Though Encode accounts for some 30% of PERL5LIB in size, its memory 
consumption is not that big.  Here is a list of core file sizes via 
dump immediately after modules loaded on my FreeBSD box.

perl alone  774,144 bytes
No Encode::XX   1,171,456
With All  2,990,080
All+HanExtra  3,534,848

* I decided not to include Indics.  It is MY obsession to include all 
encodings that are available in unicode.org but come to think of it, 
HanExtra is already 'external' and for other Encodings there are always 
others that are 'obsession'.  So I decided to wait till my obsession 
becomes 'ours'.  And I already added '-C' option to enc2xs so 
postinstalled modules can also join the demand-loading list.  Better 
take time to let it mature enough for production quality.

Detailed Changes right after my signature.

Dan the Encode Maintainer

1.40
+ Encode/ConfigLocal_PM.e2x
! lib/Encode/Config.pm
! bin/enc2xs
   enc2xs -C now generates/updates Encode::ConfigLocal.
   ConfigLocal_PM.e2x is a skelton thereof.
! lib/Encode/Config.pm
! CN/CN.pm
   use Encode::CN::HZ; was missing.
! t/Unicode.t
! t/unibench.t
   More rigorous tests added to test XS, especially on memory allocation.
! Encode.xs
! lib/Encode/Unicode.pm
   NI-S implemented an XS version -- merged
   Message-Id: [EMAIL PROTECTED]
! encoding.pm
! t/jperl.t
   Source filter option added.  With this option on, you can write
   perl 5.8-savvy scripts (such as UTF-8 identifiers) in legacy
   encodings.  t/jperl.t enhanced to test this feature.
! t/Unicode.t
   ok() gotcha addressed by Benjamin fixed.  Though I didn't exactly
   apply his suggestion, this degree of nitting is enough to add him
   to AUTHORS list.
   Message-Id: [EMAIL PROTECTED]
! JP/JP.pm
+ lib/Encode/JP/JIS7.pm
- lib/Encode/JP/JIS.pm
- lib/Encode/JP/2022_JP.pm
- lib/Encode/JP/2022_JP1.pm
   7bit-jis, iso-2022-jp and iso-2022-jp1 are all aggregated to
   JIS7.pm for better maintainability and performance
! encoding.pm
   Added caveat for non-ascii identifiers.
! encoding.pm
   fixes by jhi, the original author of this pragramtic module.
   Message-Id: [EMAIL PROTECTED]




  1   2   3   >