Re: tr/// and use encoding

2002-10-03 Thread Jarkko Hietaniemi

 your source code is other text encodings than UTF-8.  But tr/// does 
 not embrace this magic.
 

Clarification: tr/A-E/P-T/ (the ranges) does not embrace that magic.
tr/ABCDE/PQRST/ does work with the encoding pragma since that employs
string literals.

-- 
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen



tr/// and use encoding

2002-10-03 Thread Dan Kogai

On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote:
 On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote:
 On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi 
 wrote:
 Both.  I think the operation needed is straight-forward.  When you get
 tr[LHS][RHS], decode'em then
 feed it to the naked tr// .

 Urk...  That means a dip into the toke.c, how the tr/// ranges are
 implemented is... tricky.  sv_recode_to_utf8() is needed somewhere...
 but I'm a little bit pressed for time right now.  I suggest you
 perlbug this and move the process to perl5-porters.  (Inaba Hiroto
 also might have insight on this; he's the tr///-with-Unicode sensei,
 really-- he practically implemented all of it.  And he might read
 *[gk]ana much better than me :-)

So now this thread is in perl5-porter.  Since this undocumented (lack 
of) feature has a very easy workaround, I am yet to perlbug this.

=head1 PROBLEM

Cuse encoding 'foo-encoding' nicely converts string literals and 
regex into UTF-8 so you gen get the power of perl 5.8.0 even when your 
source code is other text encodings than UTF-8.  But tr/// does not 
embrace this magic.

=head1 WORKAROUND

Suppose your script is in EUC-JP and your source contains this:

   $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/;
      

And you want perl to do the following;

   $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/

All you have to do is:

   use encoding 'euc-jp';
   # 
   eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ };

=over

=item chars in this example

   utf8 euc-jp   charnames::viacode()
   -
   \x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A
   \x{3093} \xA4\xF3 HIRAGANA LETTER N
   \x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A
   \x{30f3} \xA5\xF3 KATAKANA LETTER N

=backs

=head1 DISCUSSION

I found this when I was writing a CGI book and I wanted a form 
validation/correction.  THe example above converts all Hiragana to 
Kanakana, which is a common task in Japan.  Traditionally this kind of 
operation was done via jcode::tr() (require jcode.pl;) or Jcode::tr() 
(use Jcode;).  But as of perl 5.6.0 you can apply Japanese directly 
into regex and tr/// -- so long as your script is in UTF-8.

With perl 5.8.0, the direct application of multibyte regex was made 
possible via Cuse encoding pragma.  use encoding pragma applies its 
magic as follows.  Suppose you Cuse encoding 'foo';

=over

=item 0.

${^ENCODING}, a special, non-scoped variable, is set to 
CEncode::find_encoding('foo').  if 'foo' is a supported encoding by 
Encode, ${^ENCODING} is now a transcoder object.

=item 1.

all string literals in q//, qq//, qw// and qr// (not sure of qx//) are 
first fed to ${^ENCODING}.-decode().  So from perl's point of view, 
it's the same as literals written in UTF-8.

=item 2.

Cbinmode STDIN, :encoding(foo); and Cbinmode STDIN, 
:encoding(foo) are implicitly applied So you can feed STDIN in 
enconding 'foo' and get STDOUT in encoding 'foo'

=back

Very clever and powerful.  But 1. is not done to tr///.  qq{} is under 
control of Cuse encoding so eval qq{} works as expected.

Though the workaround is simple, easy and clever it still leaves 
inconsistency on how ${^ENCODING} gets used;  It does indeed works on 
non-interpolated literals already.

=head1 REPORTED BY

Dan the Encode Maintainer Elt[EMAIL PROTECTED]gt