Re: Japanese chars and ARGV

2006-12-02 Thread Chris Wagner
Eh, I don't think this is right.  I'm mixing up the code point numbers with
the numeric value of the constituent bytes.  I'll keep looking though.

At 07:58 PM 12/1/2006 -0500, Chris Wagner wrote:
The S format turns ur 6 bytes into 3 integers(machine native).  U can then
feed those 3 integers to the U format to get a unicode string.  This is
merely a workaround; Perl by all rights should take unicode directly from
STDIN.  Maybe declaring a unicode pragma on STDIN might help.



--
REMEMBER THE WORLD TRADE CENTER ---= WTC 911 =--
...ne cede malis

0100

___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs


Re: Japanese chars and ARGV

2006-12-02 Thread Eric Amick
On Sat, 02 Dec 2006 12:00:12 -0800, you wrote:

On a Japanese version of Windows when you execute a Perl to run a script, the 
length() fcn returns
the wrong number of characters for anything you pass in as @ARGV[0], and the 
split() fcn seems to
work the same way.

Using some of the samples shows in perluniintro we do not get the same 
results, so something is wrong.

Using ActivePerl 5.8.8 Build 819. Using Win2003 Server, Japanese. No 
emulation, all default Japanese
installation.

Here is what we are doing:

perl script.pl #12486;#12473;#12488;

(there are three characters for @ARGV[0], the Japanese word for 'test')

The perl script does this:

print length(@ARGV[0]);  # returns 6

If one tries to use split(\\, @ARGV[0]) there are 6 iterations.

Tried use encoding UTF8, the -C6 flag and a ton of other stuff.
Oddly, if one does 'print @ARGV[0]' the output is #12486;#12473;#12488;.

Even used something from perluniintro:
$Unicode_string = pack(U*, unpack(W*, $ARGV[0]));
print $Unicode_string # returns #12486;#12473;#12488;
print length($Unicode_string) # returns 6

We need to capture each character in #12486;#12473;#12488; (3 of them) and 
get the HEX or UNICODE value for the
character. Since Perl thinks the length is 6 we cannot get correct hex/unicode 
values using
pack/unpack or anything else for that matter.

I may be missing something, but wouldn't -CA or -C32 do what you want?
According to perlrun, it means the elements of @ARGV are strings
encoded in UTF-8.
-- 
Eric Amick
Columbia, MD
___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs


Re: Japanese chars and ARGV

2006-12-02 Thread Chris Wagner
Ok I think I've got this figured out.  utf8::decode() does what u want.  The
bytes in @bytes represent the constituent octets in tesuto in Kana.  Using
utf::decode successfully turned the 9 bytes into 3 characters.  Let me know
if this gets what u need.


@bytes = split /\|/, e3|83|86|e3|82|b9|e3|83|88;
@bytes = map {chr hex $_} @bytes;
print scalar @bytes, \n;
$text = join , @bytes;
print length $text, \n$text\n;
utf8::decode($text) or die;
print length $text, \n$text\n;
^D
9
9
pâåpé¦pâê
Wide character in print at - line 7.
3
pâåpé¦pâê




--
REMEMBER THE WORLD TRADE CENTER ---= WTC 911 =--
...ne cede malis

0100

___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs