Re: Japanese text search problem

Benjamin Franz Wed, 08 Aug 2001 11:51:50 -0700
On Wed, 8 Aug 2001, Dan Kogai wrote:

> on 01.8.8 1:14 AM, Benjamin Franz at [EMAIL PROTECTED] wrote:
> > On Tue, 7 Aug 2001, Ashutosh Salgarkar wrote:
> >
> > my $safe_key = quotemeta($key1);
> > $searchStr =~ m/$safe_key/;
> >
> > is probably what you want. I am presuming you are trying to use m// to
> > search for exact string matches rather than exploiting the full regex
> > facilities.
>
>   No.  quotemeta would not cut it.  It depends on what character set is fed
> to regexes but for most (virtually all) cases, you convert strings to either
> EUC-jp or utf8.  Neither EUC-jp nor utf8 contains metacharacters when you
> use Japanese (or Korean or Chinese).  The problem is bit deeper.
>   The problem is that before perl 5.6.x, character and byte are
> interchangeable and Japanese character (Kanji as follows) takes 2 bytes on
> EUC (and 3 bytes on utf8).
>
>   For example,
>
>   /\xd1\xf1/ and print; # I want to find a line that contains 'to bore'
>
>   not only maches the character desired but also 'camel', which is
> represented by two Kanji (4 bytes).
>
> \xb4\xc1 \xbb\xfa
> -------- --------
> <RAKU>   <DA>     = a camel
>     ---------
>     <TEKI>        = to bore
>
>   There are ways to overcome this character boundary problem with EUC, like
> inserting delimiter character (such as beep and tab) between each Kanji but
> that's way too counter-intuitive, not to mention slow.

Oh, yeah. I forgot about that since I don't normally keep stuff in
JIS/SJIS/EUC-JP once I've acquired it. I always make my working store
UTF8. In UTF8 the 'frame' problem doesn't exist because character start
bytes _ALWAYS_ have bit eight set to 0 while continuation bytes _ALWAYS_
have bit eight set to 1. 'quotemeta' works fine if you use UTF8 as your
working encoding.

-- 
Benjamin Franz

  Programs must be written for people to read, and only
  incidentally for machines to execute.
                             ---Abelson and Sussman
Re: Japanese text search problem

Reply via email to