[FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai
I am currently writing yet another CGI book.  That is for the Japanese 
market and written in Japanese.  So it is inevitable that you have to 
face the labyrinth of character encoding.

Before perl 5.8.0, most book teaches how to handle Japanese in CGI goes 
as follows;

* stick with EUC-JP.  it does not poison perl like Shift_JIS.
* use jcode.pl or Jcode.pm when you have to convert encoding.
* you can use jcode::tr or Jcode->tr when you have to convert between 
Hiragana and Katakana

fine, so far.  But

* totally forget regex unless you are happy with a very 
counter-intuitive measure illustrated in 6.18 of the Cookbook
* if you are desperate in Kanji regex, use jperl instead.

That has now changed with 'use encoding'.  But when it comes to CGI, 
'use encoding' alone will not cut it.  But CGI.pm can handle 
multipart/form-data .  Together you can use regex safely and 
intuitively without resorting to convert your CGI script to UTF-8.

The 120-line script right after my signature illustrates that.  Sorry, 
it contains some Japanese (or my point gets blurred).

As you see, tr/// is not subject to the magic of 'use encoding'.  jhi, 
have we made it so deliberately ?  I am begging to think tr/// is 
happier to enbrace the power thereof.

Still, it can be overcome by simple eval qq{} as illustrated.  This 
much idiom would not hurt much, at least not as much as the Cookbook 
sample

Dan the Transcoded Man

#!/usr/local/bin/perl
#
# Save me in EUC-JP!

use 5.008;
use strict;
use CGI;
use CGI::Carp qw(fatalsToBrowser);
our $Method  = 'POST';
#our $Method  = 'GET';
our $Enctype = 'multipart/form-data';
#our $Enctype = 'application/x-www-form-urlencoded';
our $Charset = 'euc-jp';
use encoding 'euc-jp';

my $cgi = CGI->new();

my %Label =
 (
  name=> '名前',
  kana=> 'フリガナ',
  mailto  => '電子メール',
  mailto2 => '電子メール(確認)',
  tel => '電話',
  fax => 'ファックス',
  zip => '〒',
  address => '住所',
  comment => 'ご意見',
  );


unless ($cgi->param()){
 print_input($cgi);
}else{
 my $kana = $cgi->param('kana');
 $kana =~ s/[¥s ]+//g; # beware of zenkaku space!
 eval qq{ ¥$kana =~ tr/ぁ-ん/ァ-ン/ };
 # $kana =~ tr/ぁ-ん/ァ-ン/; # will not work but do you 
know why?
 $cgi->param(kana => $kana);
 print_output($cgi);
}

sub print_input{
 my $c = shift;
 print_html(
$c,
title =>"Form:入力",
name=> $c->textfield(-name => 'name'),
kana=> $c->textfield(-name => 'kana'),
mailto  => $c->textfield(-name => 'mailto'),
mailto2 => $c->textfield(-name => 'mailto2'),
tel => $c->textfield(-name => 'tel'),
fax => $c->textfield(-name => 'fax'),
zip => $c->textfield(-name => 'zip'),
address => $c->textfield(-name => 'address'),
comment => $c->textarea(-name => 'comment'),
);
}

sub print_output{
 my $c = shift;
 print_html(
$c,
title   => "Form:出力",
name=> $c->param('name'),
kana=> $c->param('kana'),
mailto  => $c->param('mailto'),
mailto2 => $c->param('mailto2'),
tel => $c->param('tel'),
fax => $c->param('fax'),
zip => $c->param('zip'),
address => $c->param('address'),
comment => $c->param('comment'),
);
};

sub print_html{
 my $c = shift;
 my %arg = @_;
 print
 $c->header(-charset   => $Charset),
 $c->start_html(-title => $arg{title}),
 $c->h1($arg{title});
 $c->param() or print
 $c->start_form(-method => $Method, -enctype => $Enctype);
 print
 $c->start_table({border => 1}),
 $c->Tr([
 $c->td([ $Label{name}=> $arg{name} ]),
 $c->td([ $Label{kana}=> $arg{kana} ]),
 $c->td([ $Label{mailto}  => $arg{mailto} ]),
 $c->td([ $Label{mailto2} => $arg{mailto2} ]),
 $c->td([ $Label{tel} => $arg{tel} ]),
 $c->td([ $Label{fax} => $arg{fax} ]),
 $c->td([ $Label{zip} => $arg{zip} ]),
 $c->td([ $Label{address} => $arg{address} ]),
 $c->td([ $Label{comment} => $arg{comment} ]),
 ]);
 if ($c->param()){
 print
 $c->td($c->a({href=>$ENV{SCRIPT_TEXT}}, "Retry"));
 }else{
 print
 $c->td([$c->reset(), $c->submit()]),
 };
 print $c->end_form() unless $c->param();
 print
 $c->end_table(),
 $c->end_html();
}
__END__


Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

> As you see, tr/// is not subject to the magic of 'use encoding'.  
> jhi, have we made it so deliberately ?  I am begging to think tr/// 

Not deliberately, no.  I agree that making tr/// to understand
'use encoding' would be good.

> is happier to enbrace the power thereof.
> 
> Still, it can be overcome by simple eval qq{} as illustrated.  This 
> much idiom would not hurt much, at least not as much as the Cookbook 
> sample

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

However, I will need to stare at your example some more, since
for simpler cases I think tr/// *is* obeying the 'use encoding':

use encoding 'greek';
($a = "\x{3af}bc\x{3af}de") =~ tr/\xdf/a/;
print $a, "\n";

This does print "abcade\n", and it also works when I replace the \xdf
with the literal \xdf.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

(Not that I understand any Japanese but) could you resend your script
as an attachment?  I'm afraid it might get mangled otherwise.  In the
headers I see the following:

  Content-Type: text/plain; charset=ISO-2022-JP; format=flowed
  ...
  Content-Transfer-Encoding: 7bit

and when I save the message from mutt, I do not see any eight-bit
characters in the saved file...


-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

(Hi, it's me again...)

Are you doing character ranges in the tr/// under 'use encoding'?
(I'm asking because I see a "-" in the middle of what I assume is
mangled EUC-JP)

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai

On Wednesday, Oct 2, 2002, at 22:15 Asia/Tokyo, Jarkko Hietaniemi wrote:
> (Hi, it's me again...)
>
> Are you doing character ranges in the tr/// under 'use encoding'?
> (I'm asking because I see a "-" in the middle of what I assume is
> mangled EUC-JP)

Yes. that's where hiragana -> katakana conversion is attempted;  
English equivalent of tr/A-Z/a-z/.

Dan




Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai

On Wednesday, Oct 2, 2002, at 21:51 Asia/Tokyo, Jarkko Hietaniemi wrote:
> However, I will need to stare at your example some more, since
> for simpler cases I think tr/// *is* obeying the 'use encoding':
>
> use encoding 'greek';
> ($a = "\x{3af}bc\x{3af}de") =~ tr/\xdf/a/;
> print $a, "\n";
>
> This does print "abcade\n", and it also works when I replace the \xdf
> with the literal \xdf.

I can explain that.  "\x{3af}bc\x{3af}de" is is a string literal so it 
gets encoded.  however, my example in escaped form is;

   $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/

   which does not get encoded.  the intention was;

   $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/

   That's why

   eval qq{ $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ }

works because \xA4\xA1-\xA4\xF3 and \xA5\xA1-\xA5\xF3 are converted. to 
\x{3041}-\x{3093} and \x{30a1}-\x{30f3}, respectively.

Dan




Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

> >Are you doing character ranges in the tr/// under 'use encoding'?
> >(I'm asking because I see a "-" in the middle of what I assume is
> >mangled EUC-JP)
> 
> Yes. that's where hiragana -> katakana conversion is attempted;  
> English equivalent of tr/A-Z/a-z/.

Okay...  What are the {begin,end} codepoints of those ranges,
both LHS and RHS of tr, both in EUC-JP and in Unicode?

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re[4]: Encode::compat 0.01 says "Unsupported conversion"

2002-10-02 Thread Robert Allerstorfer

Hi,

On Tue, 1 Oct 2002, 10:07 GMT+08 (04:07 local time) Autrijus Tang
wrote:

> On Wed, Sep 25, 2002 at 07:44:47PM +0200, Robert Allerstorfer wrote:
>>  my ($from, $to) = map { s/^utf8$/utf-8/i; lc($_) } ($_[1], $_[2]);
>> But this fails due to the attempt to change $_[1]. I fixed this by
>> replacing this line by
>> 
>> my ($from, $to) = @_[1, 2];
>> ($from, $to) = map {
>> s/^utf8$/utf-8/i;
>> s/^shiftjis$/shift_jis/i;
>> lc;
>> } ($from, $to);

> Thanks for your input.  I decide to bundle the Alias.pm with Encode::compat 
> to fix this problem more cleanly.

cool, but this produces encoding names that may not be supported by
iconv that is called by the Iconv module on some platforms. For
example, if the Perl code is

Encode::from_to($text, "sjis", "utf8");

Encode::compat would send "sjis" to Encode::Alias that comes back
with "shiftjis". On Windows, this results in an error due to an
Unsupported conversion because Iconv.dll does not know "shiftjis", but
it does know "sjis". "sjis" did work fine on several platforms that I
have tested so far.

Similary, I have found some limitations on certain platforms:

s/^utf8$/utf-8/i unless $^O eq "hpux";
# at least Win32 requires UTF-8 to be called 'utf-8'
# On HP-UX, the encoding name for UTF-8 must be exactly 'utf8' 
(lowercase)
s/^utf-8$/UTF-8/i if $^O eq "solaris";
# On SunOS, the encoding name for UTF-8 must be exactly 'UTF-8' 
(UPPERcase)

s/^ucs-2$/ucs2/i if $^O eq "hpux";
# On HP-UX, the encoding name for UCS-2 must be exactly 'ucs2' 
(lowercase)
s/^ucs-2$/UCS-2/i if $^O eq "solaris";
# On SunOS, the encoding name for UCS-2 must be exactly 'UCS-2' 
(UPPERcase)

s/^shiftjis$/sjis/i;
s/^sjis$/SJIS/i if $^O eq "solaris";
# On SunOS, the encoding name for Shift_JIS must be exactly 'SJIS' 
(UPPERcase)
lc unless $^O eq "solaris";

I used this code on Encode::compat 0.02 and will now have to fit it to
0.04. You have announced that you are planning to use Unicode::MapUTF8
instead of Text::Iconv in a future version. Will this add more
platform independency?

best,
rob.




Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

> I can explain that.  "\x{3af}bc\x{3af}de" is is a string literal so 
> it gets encoded.  however, my example in escaped form is;
> 
>   $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/
> 
>   which does not get encoded.  the intention was;
> 
>   $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/
> 
>   That's why
> 
>   eval qq{ $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ }
> 
> works because \xA4\xA1-\xA4\xF3 and \xA5\xA1-\xA5\xF3 are converted. 
> to \x{3041}-\x{3093} and \x{30a1}-\x{30f3}, respectively.

I'm confused.  Firstly, the tr/\xA4... converts bytes thusly:

  A1 -> A1
  A2 -> A2
  A3 -> A3
  A4 -> A5
  A5 -> A5
  F3 -> A5

So why isn't it just tr/\xA4\xF3/\xA5/?

Secondly, aren't you expecting tr/// to magically recognize that when
the EUC-JP codes \xA4, \xA1 to \xA4, and \xF3 are converted to their
Unicode counterparts they are supposed to spell out the Hiragana range?
The "range" concept of tr/// is very limited.  I think you want s///e.

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Dan Kogai

On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi wrote:
>> Yes. that's where hiragana -> katakana conversion is attempted;
>> English equivalent of tr/A-Z/a-z/.
>
> Okay...  What are the {begin,end} codepoints of those ranges,
> both LHS and RHS of tr, both in EUC-JP and in Unicode?

Both.  I think the operation needed is straight-forward.  When you get 
tr[LHS][RHS], decode'em then
feed it to the naked tr// .

Dan






Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;

2002-10-02 Thread Jarkko Hietaniemi

On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote:
> On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi wrote:
> >>Yes. that's where hiragana -> katakana conversion is attempted;
> >>English equivalent of tr/A-Z/a-z/.
> >
> >Okay...  What are the {begin,end} codepoints of those ranges,
> >both LHS and RHS of tr, both in EUC-JP and in Unicode?
> 
> Both.  I think the operation needed is straight-forward.  When you get 
> tr[LHS][RHS], decode'em then
> feed it to the naked tr// .

Urk...  That means a dip into the toke.c, how the tr/// ranges are
implemented is... tricky.  sv_recode_to_utf8() is needed somewhere...
but I'm a little bit pressed for time right now.  I suggest you
perlbug this and move the process to perl5-porters.  (Inaba Hiroto
also might have insight on this; he's the tr///-with-Unicode sensei,
really-- he practically implemented all of it.  And he might read
*[gk]ana much better than me :-)

> Dan
> 
> 

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen



Re: Parsing JIS X 0208 & Shift JIS with 5.8.0 +++++Success

2002-10-02 Thread Robin
I'm cross posting this to the perl unicode list because the pods say they might be interested in my dopey luser feedback, well actually not with those words they don't :-). 

The process by which I arrived at the solution might seem painful to some, but I'm listing it here in case anyone else is/will be facing the same problem:
which is 

On Tuesday, October 1, 2002, at 11:50 PM, Robin wrote:
*parse a collection of ASCII docs mixed in with docs in iso-2022-jp, shiftjis and possibly 7bit-jis, (by which I mean each doc could be 1 of three encodings, not 1 doc a mixture of all three).
*parse for tokens (Kanji charcters - ie neither Hiragana or Katakana)
*do regex substitutions accordingly


On Wednesday, October 2, 2002, at 02:14 PM, Joel Rees wrote:

You probably want these:
toshi (year) is 0x472f (JIS) and 0x944e (shift).
tsuki (month) is 0x376e (JIS) and 0x8c8e (shift).
nichi (day) is 0x467c (JIS) and 0x93fa (shift).

Thanks Joel (for all your imput), that is exactly what I needed - the character codes of the kanji I'm testing for 

===from the perluniintro pod===

See "Further Resources" for how to find all these numeric codes.

===from the perluniintro pod===

The  "Further Resources" mentioned in the pod, lists the vast unicode website.  Once there I can't find kanji related documents, which is logical due to their Chinese origins but doesn't seem to faciliate my search as I have no idea what Chinese characters belong together as a group -  after wading through various pdfs the only listing I can find which features any (one actually) of the kanji I'm intending to use as a token is in U3200.pdf and has character codes 32C1 - 32CB (namely character =[NUMBER + tsuki] ). All I want (from this group) is the character code for tsuki . Of course it's on the site somewhere, but not where I expect it to be (ie in a section called Kanji, but that's my problem not the unicode consortium's).

Time for hubris and laziness, I know which kanji I want to test for, so why not get these codes programatically using ord(), the way I would with ASCII:

===from the perluniintro pod===
At run-time you can use "chr()":

my $hebrew_alef = chr(0x05d0);

Naturally, "ord()" will do the reverse: it turns a character into a code
point.
===from the perluniintro pod===

use Encode::jp;


print ord('月'); #tsuki

which outputs: 140 (aka \X8C)

Ok, obviously ord() is assuming it's testing ASCII and returning the value for the first bite of a multi byte character encoding, ergo I'm missing something vital about how encoding works.

A while back, when I was researching how to approach dealing with Japanese text, Sadahiro Tomoyuki (owner of http://homepage1.nifty.com/nomenclator/perl/indexE.htm) (arigatou gosai masu Sadahiro -san) kindly wrote and told me how it was effectively done in the past by Japanese perlers:

(1) conversion of input in Shift_JIS to EUC_JP
(2) processing (in EUC_JP)
(3) conversion of the result (in EUC_JP) to Shift_JIS
(4) output

the same method is used by perl 5.8.0 except unicode (UTF8) is used as the internal perl processing form instead of EUC_JP

Dan Kogai wrote:

use Encode qw/encode decode/;
#...
my $utf8 = decode('shift-jis', $string);

use strict;
use diagnostics-verbose;
use Encode qw/encode decode/;


my ($data)='月';   
my $utf8 = decode('shift-jis', $data); 
print ord($utf8);
print chr($utf8);

yielded: 
26376
Argument "\x{6708}" isn't numeric in chr at /Users/robin/Desktop/test jp.pl line 13 (#1)

Ok so it was a partial success, but I grasp what I'm doing now. Thanks to everyone that took the time to reply.