Re: Japanese text search problem

Dan Kogai Tue, 07 Aug 2001 10:36:11 -0700

on 01.8.8 1:14 AM, Benjamin Franz at [EMAIL PROTECTED] wrote:
> On Tue, 7 Aug 2001, Ashutosh Salgarkar wrote:
> 
> my $safe_key = quotemeta($key1);
> $searchStr =~ m/$safe_key/;
> 
> is probably what you want. I am presuming you are trying to use m// to
> search for exact string matches rather than exploiting the full regex
> facilities.


  No.  quotemeta would not cut it.  It depends on what character set is fed
to regexes but for most (virtually all) cases, you convert strings to either
EUC-jp or utf8.  Neither EUC-jp nor utf8 contains metacharacters when you
use Japanese (or Korean or Chinese).  The problem is bit deeper.
  The problem is that before perl 5.6.x, character and byte are
interchangeable and Japanese character (Kanji as follows) takes 2 bytes on
EUC (and 3 bytes on utf8).

  For example,

  /\xd1\xf1/ and print; # I want to find a line that contains 'to bore'

  not only maches the character desired but also 'camel', which is
represented by two Kanji (4 bytes).

\xb4\xc1 \xbb\xfa
-------- --------
<RAKU>   <DA>     = a camel
    ---------
    <TEKI>        = to bore

  There are ways to overcome this character boundary problem with EUC, like
inserting delimiter character (such as beep and tab) between each Kanji but
that's way too counter-intuitive, not to mention slow.

Dan the Man with Too Many Character Sets to Fiddle

Re: Japanese text search problem

Reply via email to