On Wed, 8 Aug 2001, Dan Kogai wrote:
> on 01.8.8 1:14 AM, Benjamin Franz at [EMAIL PROTECTED] wrote:
> > On Tue, 7 Aug 2001, Ashutosh Salgarkar wrote:
> >
> > my $safe_key = quotemeta($key1);
> > $searchStr =~ m/$safe_key/;
> >
> > is probably what you want. I am presuming you are trying to use m// to
> > search for exact string matches rather than exploiting the full regex
> > facilities.
>
> No. quotemeta would not cut it. It depends on what character set is fed
> to regexes but for most (virtually all) cases, you convert strings to either
> EUC-jp or utf8. Neither EUC-jp nor utf8 contains metacharacters when you
> use Japanese (or Korean or Chinese). The problem is bit deeper.
> The problem is that before perl 5.6.x, character and byte are
> interchangeable and Japanese character (Kanji as follows) takes 2 bytes on
> EUC (and 3 bytes on utf8).
>
> For example,
>
> /\xd1\xf1/ and print; # I want to find a line that contains 'to bore'
>
> not only maches the character desired but also 'camel', which is
> represented by two Kanji (4 bytes).
>
> \xb4\xc1 \xbb\xfa
> -------- --------
> <RAKU> <DA> = a camel
> ---------
> <TEKI> = to bore
>
> There are ways to overcome this character boundary problem with EUC, like
> inserting delimiter character (such as beep and tab) between each Kanji but
> that's way too counter-intuitive, not to mention slow.
Oh, yeah. I forgot about that since I don't normally keep stuff in
JIS/SJIS/EUC-JP once I've acquired it. I always make my working store
UTF8. In UTF8 the 'frame' problem doesn't exist because character start
bytes _ALWAYS_ have bit eight set to 0 while continuation bytes _ALWAYS_
have bit eight set to 1. 'quotemeta' works fine if you use UTF8 as your
working encoding.
--
Benjamin Franz
Programs must be written for people to read, and only
incidentally for machines to execute.
---Abelson and Sussman