Robin explained: > First off - I didn't post specifics because I wasn't sure that it might > be of interest to the OSX perl comunity as a whole, I hoped to get the > interested parties emailing me privately, but then again the total > scarcity of docs (that I could find in English) regarding this topic on > the net, means it probably merits public posting.
Sure does. Lots of people pop up with these kinds of questions lately. > On Tuesday, October 1, 2002, at 06:47 PM, Dan Kogai wrote: > > just perldoc Encode && perldoc Encode::JP. > > Dan, I know you are one of the driving forces behind the unicode side > of perl 5.8.0, hats off to you man (sincerly), I got as far as perldoc > Encode but haven't yet got to Encode::JP - there's a lot to read. > > My basic problem is I don't have any fast n' hard examples to go by > which I can apply to the situation where I find myself now which is: > > *parse a collection of ASCII docs mixed in with docs in iso-2022-jp, > shiftjis and possibly 7bit-jis, (by which I mean each doc could be 1 of > three encodings, not 1 doc a mixture of all three). So do you need something that will guess at the encoding? It would sure be neat if Perl 5.8 has that. I've seen something like that for Japanese only, could probably build it myself, if I had the time. (Not volunteering at this point, anyway.) > *parse for tokens (Kanji charcters - ie neither Hiragana or Katakana) Something equivalent to ctype? If so, and if you don't find anything likely in the pods, _and_ if you're okay reading C, look around here for ideas: http://www.page.sannet.ne.jp/joel_rees/sjctype/jtypeplan.html (I really ought to finish that project. But I guess my kids come first.) > *do regex substitutions accordingly Good question. Of course, you should be able to convert to Unicode and do regex on that, then convert back. That might be more effective than borrowing constants from the C source I linked above and hardwiring things. Unicode guarantees round-trip at this point, and it's definitely close enough for anything you're likely to run across in normal text. (Thats assuming there are no bugs in the Perl code.) > The unicode site however unicode apparently lumps kanji in with > Chinese, which is understandable but not helpful as I need specific > code points for specific Kanji characters ie '月' which are featured in > U3200.pdf but as glyphs combined with the number ie codes 32C1 - 32CB. Does this mean you'd like a tool to parse the characters and show them in hex? I could do that pretty quickly, I suppose, if you don't find anything in the pods. Perl would be better than C or Java, right? Actually, you might even find the machinery to do it exposed somewhere in Perl 5.8. Wait, wait, I get it. You need to do things like convert month numbers in shift-JIS to the combined forms provided by Unicode. Hmm. What you really want is a table of characters vs. values. That would be doable in Javascript, for anyone who has Japanese support loaded for their web browsers. (My provider is static-only at this point, so it would have to be browser side or static, and last I looked I'm only allowed 10 MB.) Maybe I could give that a try in Javascript sometime in the near future. Anyway, the JIS for numbers is the range 0x2330 to 0x2339. For shift-JIS, the equivalent range is 0x824f to 0x8258. For euc-jp, IIRC, you just add 0x80 to each byte of the JIS codes. Wait. Maybe it was 0xa0. One or the other, you should be able to figure out which. Of course, with shift-JIS and euc-jp, you'll also want to check the ASCII range (0x30 to 0x39). Do you need the kanji numerals? You probably want these: toshi (year) is 0x472f (JIS) and 0x944e (shift). tsuki (month) is 0x376e (JIS) and 0x8c8e (shift). nichi (day) is 0x467c (JIS) and 0x93fa (shift). But, thinking again, converting to Unicode and applying the regex would make more sense, especially since the combined characters don't exist in the older versions of standard JIS. (Don't remember whether they exist in the newest version.) Or, if you need to convert from the combined forms to character-by-character, you should apply the regex to the Unicode and then convert to whichever JIS. > Then, as my own intutition was drawing blanks, I thought perhaps I > should ask if anyone else has some pointers which led to my original > posting. > > Pointers anyone ^_^? I'm probably thinking all around what you want to do and missing the point. :) -- Joel Rees <[EMAIL PROTECTED]>