Re: ucm/cp???.ucm will be updated

2002-10-18 Thread Dan Kogai
Autrijus and others,

On Friday, Oct 18, 2002, at 22:21 Asia/Tokyo, Dan Kogai wrote:

[2] http://www.microsoft.com/typography/unicode/cscp.htm
[3] http://www.microsoft.com/typography/unicode/932.txt

[snip]

The URI [2] also has links to other code pages so I would also like 
to review them and if neccessary, update them.  8 bit code pages 
(CP12??) seem OK but other CJK (CP9??) needs reviews.

So I did to 932 (JP), 936 (CN), 949 (KR), and 950 (TW).  The new maps 
generated via http://www.microsoft.com/typography/unicode/9??.txt all 
seem to pass roundtrip tests in t/CJKT.t but 936 and 950 fails in 
t/at-cn.t and t/at-tw.t.

Aiiiya!  It was the fault of my ms2ucm.pl that forgot to ignore ";Lead 
Byte Range" line;  that line was mistakenly parsed as a part of 
mapping.  With that fixed, the new mapping looks okay and passes all 
tests.

Nevertheless, those updated mappings are still subject to reviews.  
Should you have any objection please say so ASAP.  Otherwise I will 
commit the new *.ucm.

I would like you to review them at (sorry, last 't' was missing)

http://www.dan.co.jp/~dankogai/bleedperl/cp-cjkt/

You can also find my crude script that was used for conversion as

http://www.dan.co.jp/~dankogai/bleedperl/cp-cjkt/ms2ucm.pl

Xie4Xie4Ge3Zuo1 !

Dan the Encode Maintainer



ucm/cp???.ucm will be updated

2002-10-18 Thread Dan Kogai
Autrijus and others,

On Friday, Oct 18, 2002, at 22:21 Asia/Tokyo, Dan Kogai wrote:

[2] http://www.microsoft.com/typography/unicode/cscp.htm
[3] http://www.microsoft.com/typography/unicode/932.txt

[snip]

The URI [2] also has links to other code pages so I would also like to 
review them and if neccessary, update them.  8 bit code pages (CP12??) 
seem OK but other CJK (CP9??) needs reviews.

So I did to 932 (JP), 936 (CN), 949 (KR), and 950 (TW).  The new maps 
generated via http://www.microsoft.com/typography/unicode/9??.txt all 
seem to pass roundtrip tests in t/CJKT.t but 936 and 950 fails in 
t/at-cn.t and t/at-tw.t.

Those are tests originally submitted as a patch to t/CJKT.t by Autrijus 
a long ago then wound up in where they are now.

I found those tests rather obsolete but I am no expert in those 
encodings tested there.  So I would like you to review them at

http://www.dan.co.jp/~dankogai/bleedperl/cp-cjk/

You can also find my crude script that was used for conversion as

http://www.dan.co.jp/~dankogai/bleedperl/cp-cjk/ms2ucm.pl

Xie4Xie4Ge3Zuo1 !

Dan the Encode Maintainer



[OT] That annoying yen mark! [Was: Re: [Encode] ...]

2002-10-18 Thread Dan Kogai
On Friday, Oct 18, 2002, at 22:25 Asia/Tokyo, Nicholas Clark wrote:

On Fri, Oct 18, 2002 at 10:21:07PM +0900, Dan Kogai wrote:


AFAIK, CP¥d+ should be avoided for any data exchanged in the Net so 
you
   ^
Yen sign? That should be a backslash, as in CP\d+  ?


Right.  The smart-ass Mail.app (among other *.app) does this to me when 
your input method is Japanese and '\' is typed.  I have configured 
Kotoeri (the input method) to be English-friendly --does Kana-Kanji 
conversion when and only when caps lock is set (much more convenient 
than toggling Keyboard script with command-space) but even that won't 
stop replacing slashes with yen mark.  You have to get Kotoeri out of 
picture, something you would so easily forget in apps like Mail.

[I seem to remember something about some Japanese character sets 
swapping
\ and ¥ so that the Yen sign had a 7 bit value.

As you see now '\' appears correctly because I now toggled off Kotoeri 
now.  I usually notice this on Terminal.app because the difference is 
critical but not Mail.app

Well, at least with MacOS X you can TELL THE DIFFERENCE even though it 
is sometimes annoying;  Win* won't even let you notice that and you are 
trapped in "Yen jail" :)

Dan the Man with Too Many (Script|Encoding|Charset)s to Fiddle With



[Encode] HEADS-UP: ucm/cp932.ucm will be updated

2002-10-18 Thread Dan Kogai
Porters (especially Nick Ing-XS),

  I would like to release Encode 1.78 soon to address the problem in  
CP932 (MS version of Shift_JIS) which MORIYAMA Masayuki  
<[EMAIL PROTECTED]> has discovered.  Not only has he addressed the  
problem he has also supplied me a patch.  Though he was reluctant to  
come to perl(5-porters|unicode)@perl.org (I have invited him but I was  
too shy to talk to us in English), the problem and solution he has  
raised was too good to ignore so I would like to update Encode on his  
behalf.  Here is the summery of his points.

* ucm/cp932.ucm was based on the mapping file at unicode.org [0] but  
that mapping is obsolete;  it works on Windows 3.1 but not in the era  
of Win32.
* as a result, cp932 is rendered almost useless, at least too  
impractical
* patch was made available [1]

My first suggestion was to "Ask MS to update the data at unicode.org  
and if you are unsatisfied w/ the one that comes w/ Encode you are free  
to CPANize your version".  But he has raised even more points and I was  
finally convinced.

* Though not in unicode.org, MS has already made the mapping available  
in their web [2][3]
* Python and Ruby will be using the MS version, not the one at  
unicode.org
* Java has been known to suffer badly for confusing Shift_JIS and CP932  
but Encode is already free of this problem by supplying different  
mappings for Shift_JIS and CP932.

[0]	http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ 
CP932.TXT
[1] http://www2d.biglobe.ne.jp/~msyk/perl/cp932.html
[2] http://www.microsoft.com/typography/unicode/cscp.htm
[3] http://www.microsoft.com/typography/unicode/932.txt

One small but significant concern is Tcl/Tk;  So far Encode's CP932  
does match that of Tcl but not after my next release of Encode.  So I  
decided to call for opinion before I commit the release.

AFAIK, CP¥d+ should be avoided for any data exchanged in the Net so you  
should not use it on the web or mails so it's perfectly all right if  
Tk(Web|Mail) has a problem handling them.  At the same time Win32 Perl  
users would be much happier if CP¥d+ are made more practical.

The URI [2] also has links to other code pages so I would also like to  
review them and if neccessary, update them.  8 bit code pages (CP12??)  
seem OK but other CJK (CP9??) needs reviews.

Dan the Encode Maintainer