Re: [PATCH] Big5-related changes.

2002-04-20 Thread Autrijus Tang

On Sat, Apr 20, 2002 at 04:21:04PM +0800, Autrijus Tang wrote:
> Yes, as attached, thanks.

oops, please nix the "use Encode::Big5z" line in that attachment. :-)

I've been preferring enc2xs -M over recompiling the whole Encode suite..

/Autrijus/



msg01208/pgp0.pgp
Description: PGP signature


Re: [PATCH] Big5-related changes.

2002-04-20 Thread Autrijus Tang

On Sat, Apr 20, 2002 at 08:00:04AM +0900, Dan Kogai wrote:
> Is this okay?  I think this is due to the edition difference.  If so, 
> please submit a fixed version of TW.t

Yes, as attached, thanks.

> FYI here is what ucmlint is saying.
> >> perl5.7.3 -Mblib bin/ucmlint -e ucm/big5-*.ucm
> >ucm/big5-eten.ucm: no error found
> >ucm/big5-hkscs.ucm: no error found
> Is:
> >ucm/big5-eten.ucm: 16 errors found
> >ucm/big5-hkscs.ucm: 1319 errors found

That's right. The old ucm files was lacking these codepoints.

/Autrijus/



TW.t.gz
Description: application/gunzip


msg01206/pgp0.pgp
Description: PGP signature


Re: [PATCH] Big5-related changes.

2002-04-19 Thread Dan Kogai

Autrijus,

I now found that your new *.ucm smokes on t/TW.t

> 1..17
> ok 1 - use Encode::TW;
> ok 2 - [big5] decode - Basic Big5 range
> ok 3 - [big5] encode - Basic Big5 range
> ok 4 - [big5] from_to => utf8 - Basic Big5 range
> ok 5 - [big5] utf8 => from_to - Basic Big5 range
> ok 6 - [big5-hkscs] decode - Basic Big5 range
> ok 7 - [big5-hkscs] encode - Basic Big5 range
> ok 8 - [big5-hkscs] from_to => utf8 - Basic Big5 range
> ok 9 - [big5-hkscs] utf8 => from_to - Basic Big5 range
> ok 10 - [cp950] decode - Basic Big5 range
> ok 11 - [cp950] encode - Basic Big5 range
> ok 12 - [cp950] from_to => utf8 - Basic Big5 range
> ok 13 - [cp950] utf8 => from_to - Basic Big5 range
> ok 14 - [big5-hkscs] decode - Hong Kong Extensions
> not ok 15 - [big5-hkscs] encode - Hong Kong Extensions
> # Failed test (t/TW.t at line 81)
> #  got: '·P?©??¨??Perl 
> ??Aµ?§?]??¨£©M?ª??pªG?s?X??¥i¶D§?]¡C'
> # expected: '·P?©??¨??Perl 
> Aµ?§?]??«?¨£©M?ª??pªG?s?X??¥i¶D§?]¡C'
> ok 16 - [big5-hkscs] from_to => utf8 - Hong Kong Extensions
> not ok 17 - [big5-hkscs] utf8 => from_to - Hong Kong Extensions
> # Failed test (t/TW.t at line 90)
> #  got: '·P?©??¨??Perl 
> ??Aµ?§?]??¨£©M?ª??pªG?s?X??¥i¶D§?]¡C'
> # expected: '·P?©??¨??Perl 
> Aµ?§?]??«?¨£©M?ª??pªG?s?X??¥i¶D§?]¡C'
> # Looks like you failed 2 tests of 17.

Is this okay?  I think this is due to the edition difference.  If so, 
please submit a fixed version of TW.t

FYI here is what ucmlint is saying.

Was:
> > perl5.7.3 -Mblib bin/ucmlint -e ucm/big5-*.ucm
> ucm/big5-eten.ucm: no error found
> ucm/big5-hkscs.ucm: no error found
Is:
> ucm/big5-eten.ucm: 16 errors found
> ucm/big5-hkscs.ucm: 1319 errors found

Dan the Encode Maintainer





Re: [PATCH] Big5-related changes.

2002-04-19 Thread Dan Kogai

On Saturday, April 20, 2002, at 04:53 , Autrijus Tang wrote:
> I've been immersed in Big5-related issues in the past few days, and
> came back with these last-minute (err, week?) changes before 5.8-RC1.
>
> The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).

Excellent!

> (For dan) big5-hkscs should be upgraded to the 2001 edition, as per
> Hong Kong government's decree. It's available separately at:
>
> http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz
>
> Also, please delete big5.ucm and replace it with big5-eten, at:
>
> http://egb.elixus.org/~autrijus/big5-eten.ucm.gz

Thus updated.  I needed to update TW/Makefile.PL and 
lib/Encode/Config.pm (so it loads on 'big5-eten' instead of just 
'big5'). but that's not at all a big deal.

> I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that
> the 'Big5' as originally defined isn't used anywhere on earth; non-
> Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft
> uses 'big5' to mean 'cp950'.
>
> It is therefore unwise to have a canonical 'big5' encoding, much like
> there should not be a 'gb2312' encoding. Since gb2312 is now aliased
> to euc-cn and not cp936, I think big5 should alias to big5-eten and
> not cp950.

I agree.  AFAIK, Big5 is the only major CJK encoding not endorsed by the 
government.  What's so funny is that there seems less confusions between 
encodings there in Taiwan than in Japan or Korea.  Japan is the worst 
for using Shift_JIS, EUC-JP, ISO-2022-JP(-[12])? and now Unicode (IMHO, 
however, the Japanese people should be proud for making multibyte 
character encoding a reality.  But I can't help wondering this mess is 
way too much a price to pay :)

> Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although
> the encoding is called 'gb2312-raw'. I admit that I don't fully
> understand the reason, but if that's to stand, then big5-eten could also
> be named 'big5.ucm', and still say ' "big5-eten"', for
> consistency's sake.

I renamed big5.ucm to big5-eten.ucm.  "-raw" that are missing from *.ucm 
filenames is just that they look too funny on 8.3 filesystems, nothing 
more :)

> Thanks,
> /Autrijus/

Xin Ku  Le  !
\x{8f9b}\x{82e6}\x{4e86}

XiaoSi   Dan
\x{5c0f}\x{98fc} \x{5f3e}\n




Re: [PATCH] Big5-related changes.

2002-04-19 Thread Autrijus Tang

On Sat, Apr 20, 2002 at 03:53:46AM +0800, Autrijus Tang wrote:
> The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).
> (For jhi) README fixes are trivial -- mentions new HanExtra encodings,
> fix some China word usage, and add my latin-1 name.

Err, forget the patch chunks, please use the attachments verbatim. Sorry.

/Autrijus/


If you read this file _as_is_, just ignore the funny characters you
see. It is written in the POD format (see perlpod manpage) which is
specially designed to be readable as is.

The following documentation is written in Big5 encoding.

¦pªG§A¥Î¤@¯ëªº¤å¦r½s¿è¾¹¾\Äý³o¥÷¤å¥ó, ½Ð©¿²¤¤å¤¤©_¯Sªºµù°O¦r²Å.
³o¥÷¤å¥ó¬O¥H POD (²©ú¤å¥ó®æ¦¡) ¼g¦¨; ³oºØ®æ¦¡¬O¬°¤F¯àÅý¤Hª½±µÅª¨ú,
¦Ó¯S§O³]­pªº. Ãö©ó¦¹®æ¦¡ªº¶i¤@¨B¸ê°T, ½Ð°Ñ¦Ò perlpod ½u¤W¤å¥ó.

=head1 NAME

perltw - ¥¿Å餤¤å Perl «ü«n

=head1 DESCRIPTION

Åwªï¨Ó¨ì Perl ªº¤Ñ¦a!

±q 5.8.0 ª©¶}©l, Perl ¨ã³Æ¤F§¹µ½ªº Unicode (¸U°ê½X) ¤ä´©,
¤]³s±a¤ä´©¤F³\¦h©Ô¤B»y¨t¥H¥~ªº½s½X¤è¦¡; CJK (¤¤¤éÁú) «K¬O¨ä¤¤ªº¤@³¡¥÷.
Unicode ¬O°ê»Ú©Êªº¼Ð·Ç, ¸Õ¹Ï²[»\¥@¬É¤W©Ò¦³ªº¦r²Å: ¦è¤è¥@¬É, ªF¤è¥@¬É,
¥H¤Î¨âªÌ¶¡ªº¤@¤Á (§Æþ¤å, ±Ô§Q¨È¤å, ªü©Ô§B¤å, §Æ§B¨Ó¤å, ¦L«×¤å,
¦L¦a¦w¤å, µ¥µ¥). ¥¦¤]®e¯Ç¤F¦hºØ§@·~¨t²Î»P¥­»O (¦p PC ¤Î³Áª÷¶ð).

Perl ¥»¨­¥H Unicode ¶i¦æ¾Þ§@. ³oªí¥Ü Perl ¤º³¡ªº¦r¦ê¸ê®Æ¥i¥Î Unicode
ªí¥Ü; Perl ªº¨ç¦¡»Pºâ²Å (¨Ò¦p¥¿³Wªí¥Ü¦¡¤ñ¹ï) ¤]¯à¹ï Unicode ¶i¦æ¾Þ§@.
¦b¿é¤J¤Î¿é¥X®É, ¬°¤F³B²z¥H Unicode ¤§«eªº½s½X¤è¦¡Àx¦sªº¸ê®Æ, Perl
´£¨Ñ¤F Encode ³o­Ó¼Ò²Õ, ¥i¥HÅý§A»´©ö¦aŪ¨ú¤Î¼g¤J¦³ªº½s½X¸ê®Æ.

Encode ©µ¦ù¼Ò²Õ¤ä´©¤U¦C¥¿Å餤¤åªº½s½X¤è¦¡ ('big5' ªí¥Ü 'big5-eten'):

big5-eten   Big5 ½s½X (§t­Ê¤Ñ©µ¦ù¦r§Î)
big5-hkscs  Big5 + ­»´ä¥~¦r¶°, 2001 ¦~ª©
cp950   ¦r½X­¶ 950 (Big5 + ·L³n²K¥[ªº¦r²Å)

Á|¨Ò¨Ó»¡, ±N Big5 ½s½XªºÀÉ®×Âন Unicode, ¯­»ÝÁä¤J¤U¦C«ü¥O:

perl -Mencoding=big5,STDOUT,utf8 -pe1 < file.big5 > file.utf8

Perl ¤]¤ºªþ¤F "piconv", ¤@¤ä§¹¥þ¥H Perl ¼g¦¨ªº¦r²ÅÂà´«¤u¨ãµ{¦¡, ¥Îªk¦p¤U:

piconv -f big5 -t utf8 < file.big5 > file.utf8
piconv -f utf8 -t big5 < file.utf8 > file.big5

¥t¥~, §Q¥Î encoding ¼Ò²Õ, §A¥i¥H»´©ö¼g¥X¥H¦r²Å¬°³æ¦ìªºµ{¦¡½X, ¦p¤U©Ò¥Ü:

#!/usr/bin/env perl
# ±Ò°Ê big5 ¦r¦ê¸ÑªR; ¼Ð·Ç¿é¥X¤J¤Î¼Ð·Ç¿ù»~³£³]¬° big5 ½s½X
use encoding 'big5', STDIN => 'big5', STDOUT => 'big5';
print length("Àd¾m");#  2 (Âù¤Þ¸¹ªí¥Ü¦r²Å)
print length('Àd¾m');#  4 (³æ¤Þ¸¹ªí¥Ü¦ì¤¸²Õ)
print index("½Î½Î±Ð»£", "να"); # -1 (¤£¥]§t¦¹¤l¦r¦ê)
print index('½Î½Î±Ð»£', 'να'); #  1 (±q²Ä¤G­Ó¦ì¤¸²Õ¶}©l)

¦b³Ì«á¤@¦C¨Ò¤l¸Ì, "½Î" ªº²Ä¤G­Ó¦ì¤¸²Õ»P "½Î" ªº²Ä¤@­Ó¦ì¤¸²Õµ²¦X¦¨ Big5
½Xªº "ν"; "½Î" ªº²Ä¤G­Ó¦ì¤¸²Õ«h»P "±Ð" ªº²Ä¤@­Ó¦ì¤¸²Õµ²¦X¦¨ "α".
³o¸Ñ¨M¤F¥H«e Big5 ½X¤ñ¹ï³B²z¤W±`¨£ªº°ÝÃD.

=head2 ÃB¥~ªº¤¤¤å½s½X

¦pªG»Ý­n§ó¦hªº¤¤¤å½s½X, ¥i¥H±q CPAN (L) ¤U¸ü
Encode::HanExtra ¼Ò²Õ. ¥¦¥Ø«e´£¨Ñ¤U¦C½s½X¤è¦¡:

cccii   1980 ¦~¤å«Ø·|ªº¤¤¤å¸ê°T¥æ´«½X
euc-tw  Unix ©µ¦ù¦r²Å¶°, ¥]§t CNS11643 ¥­­± 1-7
big5plus¤¤¤å¼Æ¦ì¤Æ§Þ³N±À¼s°òª÷·|ªº Big5+
big5ext ¤¤¤å¼Æ¦ì¤Æ§Þ³N±À¼s°òª÷·|ªº Big5e

¥t¥~, Encode::HanConvert ¼Ò²Õ«h´£¨Ñ¤F²ÁcÂà´«¥Îªº¨âºØ½s½X:

big5-simp   Big5 ¥¿Å餤¤å»P Unicode ²Å餤¤å¤¬Âà
gbk-tradGBK ²Å餤¤å»P Unicode ¥¿Å餤¤å¤¬Âà

­Y·Q¦b GBK »P Big5 ¤§¶¡¤¬Âà, ½Ð°Ñ¦Ò¸Ó¼Ò²Õ¤ºªþªº b2g.pl »P g2b.pl ¨â¤äµ{¦¡,
©Î¦bµ{¦¡¤º¨Ï¥Î¤U¦C¼gªk:

use Encode::HanConvert;
$euc_cn = big5_to_gb($big5); # ±q Big5 Âର GBK
$big5 = gb_to_big5($euc_cn); # ±q GBK Âର Big5

=head2 ¶i¤@¨Bªº¸ê°T

½Ð°Ñ¦Ò Perl ¤ºªþªº¤j¶q»¡©ú¤å¥ó (¤£©¯¥þ¬O¥Î­^¤å¼gªº), ¨Ó¾Ç²ß§ó¦hÃö©ó
Perl ªºª¾ÃÑ, ¥H¤Î Unicode ªº¨Ï¥Î¤è¦¡. ¤£¹L, ¥~³¡ªº¸ê·½¬Û·íÂ×´I:

=head2 ´£¨Ñ Perl ¸ê·½ªººô§}

=over 4

=item L

Perl ªº­º­¶ (¥Ñ¼ÚµÜ§¤½¥qºûÅ@)

=item L

Perl ºî¦X¨åÂúô (Comprehensive Perl Archive Network)

=item L

Perl ¶l»¼½×¾Â¤@Äý

=back

=head2 ¾Ç²ß Perl ªººô§}

=over 4

=item L

¥¿Å餤¤åª©ªº¼ÚµÜ§ Perl ®ÑÂÇ

=item L

»OÆW Perl ³s½u°Q½×°Ï (¤]´N¬O¦U¤j BBS ªº Perl ³s½uª©)

=back

=head2 Perl ¨Ï¥ÎªÌ¶°·|

=over 4

=item L

»OÆW Perl ±À¼s²Õ¤@Äý

=item L

ÃÀ¥ß¨ó½u¤W²á¤Ñ«Ç

=back

=head2 Unicode ¬ÛÃöºô§}

=over 4

=item L

Unicode ¾Ç³N¾Ç·| (Unicode ¼Ð·Çªº¨î©wªÌ)

=item L

Unix/Linux ¤Wªº UTF-8 ¤Î Unicode µª«È°Ý

=head2 ¤¤¤å¤Æ¸ê°T

=item ¬°¤°»ò¥s "¥¿Å餤¤å" ¤£¥s "ÁcÅ餤¤å"?

L

=item ¤¤¤å¤Æ³nÅéÁp·ù

L

=item Linux ³nÅ餤¤å¤Æ­p¹º

L

=back

=head1 SEE ALSO

L, L, L, L, L

=head1 AUTHORS

Jarkko Hietaniemi E[EMAIL PROTECTED]

Autrijus Tang (­ð©vº~) E[EMAIL PROTECTED]

=cut


If you read this file _as_is_, just ignore the funny characters you
see. It is written in the POD format (see perlpod manpage) which is
specially designed to be readable as is.

The following documentation is written in EUC-CN enco

[PATCH] Big5-related changes.

2002-04-19 Thread Autrijus Tang

I've been immersed in Big5-related issues in the past few days, and
came back with these last-minute (err, week?) changes before 5.8-RC1.

The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).

(For jhi) README fixes are trivial -- mentions new HanExtra encodings,
fix some China word usage, and add my latin-1 name.

(For dan) big5-hkscs should be upgraded to the 2001 edition, as per
Hong Kong government's decree. It's available separately at:

http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz

Also, please delete big5.ucm and replace it with big5-eten, at:

http://egb.elixus.org/~autrijus/big5-eten.ucm.gz

I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that
the 'Big5' as originally defined isn't used anywhere on earth; non-
Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft
uses 'big5' to mean 'cp950'.

It is therefore unwise to have a canonical 'big5' encoding, much like
there should not be a 'gb2312' encoding. Since gb2312 is now aliased
to euc-cn and not cp936, I think big5 should alias to big5-eten and
not cp950.



Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although
the encoding is called 'gb2312-raw'. I admit that I don't fully
understand the reason, but if that's to stand, then big5-eten could also
be named 'big5.ucm', and still say ' "big5-eten"', for
consistency's sake.

Thanks,
/Autrijus/

--- /home/autrijus/perl/ext/Encode/TW/TW.pm Fri Apr 19 22:02:58 2002
+++ TW.pm   Sat Apr 20 03:13:07 2002
@@ -30,10 +30,10 @@
 
   Canonical   AliasDescription
   
-  big5/\bbig-?5$/i The original Big5 encoding
-  big5-hkscs  /\bbig5-hk(scs)?$/i
-Big5 plus Cantonese characters in 
-Hong Kong
+  big5-eten   /\bbig-?5$/i Big5 encoding (with ETen extensions)
+ /\bbig5-?et(en)?$/i
+  big5-hkscs  /\bbig5-?hk(scs)?$/i
+Big5 + Cantonese characters in Hong Kong
   MacChineseSimp   Big5 + Apple Vendor Mappings
   cp950Code Page 950 
 = Big5 + Microsoft vendor mappings
@@ -44,11 +44,18 @@
 =head1 NOTES
 
 Due to size concerns, C (Extended Unix Character), C
-(Chinese Character Code for Information Interchange) and C
-(CMEX's Big5+) are distributed separately on CPAN, under the name
-L. That module also contains extra China-based encodings.
+(Chinese Character Code for Information Interchange), C
+(CMEX's Big5+) and C (CMEX's Big5e) are distributed separately
+on CPAN, under the name L. That module also contains
+extra China-based encodings.
 
 =head1 BUGS
+
+Since the original C encoding (1984) is not supported anywhere
+(glibc and DOS-based systems uses C to mean C; Microsoft
+uses C to mean C), a concious decision was made to alias
+C to C, which is the de facto superset of the original
+big5.
 
 The C encoding files are not complete. For common C
 manipulation, please use C in L, which contains
--- /home/autrijus/perl/ext/Encode/lib/Encode/Alias.pm  Wed Apr 10 05:13:28 2002
+++ Alias.pmSat Apr 20 03:11:11 2002
@@ -217,8 +217,9 @@
 define_alias( qr/(?:x-)?windows-949$/i=> '"cp949"' );
 define_alias( qr/\bks_c_5601-1987$/i  => '"cp949"' );
 # for Encode::TW
-   define_alias( qr/\bbig-?5$/i  => '"big5"' );
-   define_alias( qr/\bbig5-hk(?:scs)?$/i => '"big5-hkscs"' );
+   define_alias( qr/\bbig-?5$/i  => '"big5-eten"' );
+   define_alias( qr/\bbig5-?et(?:en)$/i  => '"big5-eten"' );
+   define_alias( qr/\bbig5-?hk(?:scs)?$/i=> '"big5-hkscs"' );
 }
 # utf8 is blessed :)
 define_alias( qr/^UTF-8$/i => '"utf8"',);
--- /home/autrijus/perl/README.tw   Thu Apr 18 06:01:01 2002
+++ README.tw   Sat Apr 20 03:15:51 2002
@@ -29,8 +29,8 @@
 
 Encode ©µ¦ù¼Ò²Õ¤ä´©¤U¦C¥¿Å餤¤åªº½s½X¤è¦¡:
 
-big5   ­ì©lªº Big5 ½s½X (§t­Ê¤Ñ¤é¤å¦r§Î)
-big5-hkscs Big5 + ­»´ä¥~¦r¶°
+big5   Big5 ½s½X (§t­Ê¤Ñ©µ¦ù¦r§Î)
+big5-hkscs Big5 + ­»´ä¥~¦r¶°, 2001 ¦~ª©
 cp950  ¦r½X­¶ 950 (Big5 + ·L³n²K¥[ªº¦r²Å)
 
 Á|¨Ò¨Ó»¡, ±N Big5 ½s½XªºÀÉ®×Âন Unicode, ¯­»ÝÁä¤J¤U¦C«ü¥O:
@@ -61,8 +61,10 @@
 ¦pªG»Ý­n§ó¦hªº¤¤¤å½s½X, ¥i¥H±q CPAN (L) ¤U¸ü
 Encode::HanExtra ¼Ò²Õ. ¥¦¥Ø«e´£¨Ñ¤U¦C½s½X¤è¦¡:
 
+cccii  1980 ¦~¤å«Ø·|ªº¤¤¤å¸ê°T¥æ´«½X
 euc-tw Unix ©µ¦ù¦r²Å¶°, ¥]§t CNS11643 ¥­­± 1-7
 big5plus   ¤¤¤å¼Æ¦ì¤Æ§Þ³N±À¼s°òª÷·|ªº Big5+
+big5ext¤¤¤å¼Æ¦ì¤Æ§Þ³N±À¼s°òª÷·|ªº Big5e
 
 ¥t¥~, Encode::HanConvert ¼Ò²Õ«h´£¨Ñ¤F²ÁcÂà´«¥Îªº¨âºØ½s½X:
 
@@ -163,6 +165,6 @@
 
 Jarkko Hietaniemi E[EMAIL PROTECTED]
 
-­ð©vº~ E[EMAIL PROTECTED]
+Autrijus Tang (­ð©vº~) E[EMAIL PROTECTED]
 
 =cut
--- /home/autrijus/perl/README.cn   Thu Apr 18 06:01:01 2002
+++ README.cn   Sat Apr 20 03:15:43 2002
@@ -24,7 +24,7 @@
 
 Perl ±¾ÉíÒÔ Unicode ½øÐвÙ×÷. Õâ±íʾ Perl ÄÚ²¿µÄ×Ö·û´®Êý¾