Re: ucm/cp???.ucm will be updated
Autrijus and others, On Friday, Oct 18, 2002, at 22:21 Asia/Tokyo, Dan Kogai wrote: [2] http://www.microsoft.com/typography/unicode/cscp.htm [3] http://www.microsoft.com/typography/unicode/932.txt [snip] The URI [2] also has links to other code pages so I would also like to review them and if neccessary, update them. 8 bit code pages (CP12??) seem OK but other CJK (CP9??) needs reviews. So I did to 932 (JP), 936 (CN), 949 (KR), and 950 (TW). The new maps generated via http://www.microsoft.com/typography/unicode/9??.txt all seem to pass roundtrip tests in t/CJKT.t but 936 and 950 fails in t/at-cn.t and t/at-tw.t. Aiiiya! It was the fault of my ms2ucm.pl that forgot to ignore ";Lead Byte Range" line; that line was mistakenly parsed as a part of mapping. With that fixed, the new mapping looks okay and passes all tests. Nevertheless, those updated mappings are still subject to reviews. Should you have any objection please say so ASAP. Otherwise I will commit the new *.ucm. I would like you to review them at (sorry, last 't' was missing) http://www.dan.co.jp/~dankogai/bleedperl/cp-cjkt/ You can also find my crude script that was used for conversion as http://www.dan.co.jp/~dankogai/bleedperl/cp-cjkt/ms2ucm.pl Xie4Xie4Ge3Zuo1 ! Dan the Encode Maintainer
ucm/cp???.ucm will be updated
Autrijus and others, On Friday, Oct 18, 2002, at 22:21 Asia/Tokyo, Dan Kogai wrote: [2] http://www.microsoft.com/typography/unicode/cscp.htm [3] http://www.microsoft.com/typography/unicode/932.txt [snip] The URI [2] also has links to other code pages so I would also like to review them and if neccessary, update them. 8 bit code pages (CP12??) seem OK but other CJK (CP9??) needs reviews. So I did to 932 (JP), 936 (CN), 949 (KR), and 950 (TW). The new maps generated via http://www.microsoft.com/typography/unicode/9??.txt all seem to pass roundtrip tests in t/CJKT.t but 936 and 950 fails in t/at-cn.t and t/at-tw.t. Those are tests originally submitted as a patch to t/CJKT.t by Autrijus a long ago then wound up in where they are now. I found those tests rather obsolete but I am no expert in those encodings tested there. So I would like you to review them at http://www.dan.co.jp/~dankogai/bleedperl/cp-cjk/ You can also find my crude script that was used for conversion as http://www.dan.co.jp/~dankogai/bleedperl/cp-cjk/ms2ucm.pl Xie4Xie4Ge3Zuo1 ! Dan the Encode Maintainer
[OT] That annoying yen mark! [Was: Re: [Encode] ...]
On Friday, Oct 18, 2002, at 22:25 Asia/Tokyo, Nicholas Clark wrote: On Fri, Oct 18, 2002 at 10:21:07PM +0900, Dan Kogai wrote: AFAIK, CP¥d+ should be avoided for any data exchanged in the Net so you ^ Yen sign? That should be a backslash, as in CP\d+ ? Right. The smart-ass Mail.app (among other *.app) does this to me when your input method is Japanese and '\' is typed. I have configured Kotoeri (the input method) to be English-friendly --does Kana-Kanji conversion when and only when caps lock is set (much more convenient than toggling Keyboard script with command-space) but even that won't stop replacing slashes with yen mark. You have to get Kotoeri out of picture, something you would so easily forget in apps like Mail. [I seem to remember something about some Japanese character sets swapping \ and ¥ so that the Yen sign had a 7 bit value. As you see now '\' appears correctly because I now toggled off Kotoeri now. I usually notice this on Terminal.app because the difference is critical but not Mail.app Well, at least with MacOS X you can TELL THE DIFFERENCE even though it is sometimes annoying; Win* won't even let you notice that and you are trapped in "Yen jail" :) Dan the Man with Too Many (Script|Encoding|Charset)s to Fiddle With
[Encode] HEADS-UP: ucm/cp932.ucm will be updated
Porters (especially Nick Ing-XS), I would like to release Encode 1.78 soon to address the problem in CP932 (MS version of Shift_JIS) which MORIYAMA Masayuki <[EMAIL PROTECTED]> has discovered. Not only has he addressed the problem he has also supplied me a patch. Though he was reluctant to come to perl(5-porters|unicode)@perl.org (I have invited him but I was too shy to talk to us in English), the problem and solution he has raised was too good to ignore so I would like to update Encode on his behalf. Here is the summery of his points. * ucm/cp932.ucm was based on the mapping file at unicode.org [0] but that mapping is obsolete; it works on Windows 3.1 but not in the era of Win32. * as a result, cp932 is rendered almost useless, at least too impractical * patch was made available [1] My first suggestion was to "Ask MS to update the data at unicode.org and if you are unsatisfied w/ the one that comes w/ Encode you are free to CPANize your version". But he has raised even more points and I was finally convinced. * Though not in unicode.org, MS has already made the mapping available in their web [2][3] * Python and Ruby will be using the MS version, not the one at unicode.org * Java has been known to suffer badly for confusing Shift_JIS and CP932 but Encode is already free of this problem by supplying different mappings for Shift_JIS and CP932. [0] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ CP932.TXT [1] http://www2d.biglobe.ne.jp/~msyk/perl/cp932.html [2] http://www.microsoft.com/typography/unicode/cscp.htm [3] http://www.microsoft.com/typography/unicode/932.txt One small but significant concern is Tcl/Tk; So far Encode's CP932 does match that of Tcl but not after my next release of Encode. So I decided to call for opinion before I commit the release. AFAIK, CP¥d+ should be avoided for any data exchanged in the Net so you should not use it on the web or mails so it's perfectly all right if Tk(Web|Mail) has a problem handling them. At the same time Win32 Perl users would be much happier if CP¥d+ are made more practical. The URI [2] also has links to other code pages so I would also like to review them and if neccessary, update them. 8 bit code pages (CP12??) seem OK but other CJK (CP9??) needs reviews. Dan the Encode Maintainer