Re: Parsing JIS X 0208 & Shift JIS with 5.8.0

Joel Rees Wed, 02 Oct 2002 00:59:17 -0700

Robin explained:

> First off - I didn't post specifics because I wasn't sure that it might 
> be of interest to the OSX perl comunity as a whole, I hoped to get the 
> interested parties emailing me privately, but then again the total 
> scarcity of docs (that I could find in English) regarding this topic on 
> the net, means it  probably merits public posting.

Sure does. Lots of people pop up with these kinds of questions lately.

> On Tuesday, October 1, 2002, at 06:47 PM, Dan Kogai wrote:
> > just perldoc Encode && perldoc Encode::JP.
> 
> Dan, I know you are one of the driving forces behind the unicode side 
> of perl 5.8.0, hats off to you man (sincerly), I got as far as perldoc 
> Encode but haven't yet got to Encode::JP - there's a lot to read.
> 
> My basic problem is I don't have any fast n' hard examples to go by 
> which I can apply to the situation where I find myself now which is:
> 
> *parse a collection of ASCII docs mixed in with docs in iso-2022-jp, 
> shiftjis and possibly 7bit-jis, (by which I mean each doc could be 1 of 
> three encodings, not 1 doc a mixture of all three).

So do you need something that will guess at the encoding? It would sure
be neat if Perl 5.8 has that. I've seen something like that for Japanese
only, could probably build it myself, if I had the time. (Not
volunteering at this point, anyway.)

> *parse for tokens (Kanji charcters - ie neither Hiragana or Katakana)

Something equivalent to ctype?

If so, and if you don't find anything likely in the pods, _and_ if you're
okay reading C, look around here for ideas:

    http://www.page.sannet.ne.jp/joel_rees/sjctype/jtypeplan.html

(I really ought to finish that project. But I guess my kids come first.)

> *do regex substitutions accordingly

Good question. Of course, you should be able to convert to Unicode and
do regex on that, then convert back. That might be more effective than
borrowing constants from the C source I linked above and hardwiring
things.

Unicode guarantees round-trip at this point, and it's definitely close
enough for anything you're likely to run across in normal text. (Thats
assuming there are no bugs in the Perl code.)

> The unicode site however unicode apparently lumps kanji in with 
> Chinese, which is understandable but not helpful as I need specific 
> code points for specific Kanji characters ie  '月' which are featured in 
> U3200.pdf but as glyphs combined with the number ie codes 32C1 - 32CB.

Does this mean you'd like a tool to parse the characters and show them
in hex? I could do that pretty quickly, I suppose, if you don't find
anything in the pods. Perl would be better than C or Java, right?
Actually, you might even find the machinery to do it exposed somewhere in
Perl 5.8.

Wait, wait, I get it. You need to do things like convert month numbers
in shift-JIS to the combined forms provided by Unicode.

Hmm. What you really want is a table of characters vs. values. That
would be doable in Javascript, for anyone who has Japanese support
loaded for their web browsers. (My provider is static-only at this point,
so it would have to be browser side or static, and last I looked I'm
only allowed 10 MB.) Maybe I could give that a try in Javascript sometime
in the near future.

Anyway, the JIS for numbers is the range 0x2330 to 0x2339. For shift-JIS,
the equivalent range is 0x824f to 0x8258. For euc-jp, IIRC, you just add
0x80 to each byte of the JIS codes. Wait. Maybe it was 0xa0. One or the
other, you should be able to figure out which. 

Of course, with shift-JIS and euc-jp, you'll also want to check the
ASCII range (0x30 to 0x39).

Do you need the kanji numerals?

You probably want these:

toshi (year) is 0x472f (JIS) and 0x944e (shift).
tsuki (month) is 0x376e (JIS) and 0x8c8e (shift).
nichi (day) is 0x467c (JIS) and 0x93fa (shift).

But, thinking again, converting to Unicode and applying the regex would
make more sense, especially since the combined characters don't exist in
the older versions of standard JIS. (Don't remember whether they exist
in the newest version.)

Or, if you need to convert from the combined forms to
character-by-character, you should apply the regex to the Unicode and
then convert to whichever JIS.

> Then, as my own intutition was drawing blanks,  I thought perhaps I 
> should ask if anyone else has some pointers which led to my original 
> posting.
> 
> Pointers anyone ^_^?

I'm probably thinking all around what you want to do and missing the
point. :)

-- 
Joel Rees <[EMAIL PROTECTED]>

Re: Parsing JIS X 0208 & Shift JIS with 5.8.0

Reply via email to