On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote:
On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote:
On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi
wrote:
Both. I think the operation needed is straight-forward. When you get
tr[LHS][RHS], decode'em then
feed it to the naked tr// .
Urk... That means a dip into the toke.c, how the tr/// ranges are
implemented is... tricky. sv_recode_to_utf8() is needed somewhere...
but I'm a little bit pressed for time right now. I suggest you
perlbug this and move the process to perl5-porters. (Inaba Hiroto
also might have insight on this; he's the tr///-with-Unicode sensei,
really-- he practically implemented all of it. And he might read
*[gk]ana much better than me :-)
So now this thread is in perl5-porter. Since this undocumented (lack
of) feature has a very easy workaround, I am yet to perlbug this.
=head1 PROBLEM
Cuse encoding 'foo-encoding' nicely converts string literals and
regex into UTF-8 so you gen get the power of perl 5.8.0 even when your
source code is other text encodings than UTF-8. But tr/// does not
embrace this magic.
=head1 WORKAROUND
Suppose your script is in EUC-JP and your source contains this:
$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/;
And you want perl to do the following;
$kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/
All you have to do is:
use encoding 'euc-jp';
#
eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ };
=over
=item chars in this example
utf8 euc-jp charnames::viacode()
-
\x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A
\x{3093} \xA4\xF3 HIRAGANA LETTER N
\x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A
\x{30f3} \xA5\xF3 KATAKANA LETTER N
=backs
=head1 DISCUSSION
I found this when I was writing a CGI book and I wanted a form
validation/correction. THe example above converts all Hiragana to
Kanakana, which is a common task in Japan. Traditionally this kind of
operation was done via jcode::tr() (require jcode.pl;) or Jcode::tr()
(use Jcode;). But as of perl 5.6.0 you can apply Japanese directly
into regex and tr/// -- so long as your script is in UTF-8.
With perl 5.8.0, the direct application of multibyte regex was made
possible via Cuse encoding pragma. use encoding pragma applies its
magic as follows. Suppose you Cuse encoding 'foo';
=over
=item 0.
${^ENCODING}, a special, non-scoped variable, is set to
CEncode::find_encoding('foo'). if 'foo' is a supported encoding by
Encode, ${^ENCODING} is now a transcoder object.
=item 1.
all string literals in q//, qq//, qw// and qr// (not sure of qx//) are
first fed to ${^ENCODING}.-decode(). So from perl's point of view,
it's the same as literals written in UTF-8.
=item 2.
Cbinmode STDIN, :encoding(foo); and Cbinmode STDIN,
:encoding(foo) are implicitly applied So you can feed STDIN in
enconding 'foo' and get STDOUT in encoding 'foo'
=back
Very clever and powerful. But 1. is not done to tr///. qq{} is under
control of Cuse encoding so eval qq{} works as expected.
Though the workaround is simple, easy and clever it still leaves
inconsistency on how ${^ENCODING} gets used; It does indeed works on
non-interpolated literals already.
=head1 REPORTED BY
Dan the Encode Maintainer Elt[EMAIL PROTECTED]gt