Re: still working with utf8

2007-06-22 Thread Dr.Ruud
Tom Allison schreef:

 I have a string:
 =?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
 That is a MIME::Base64 encoded string of iso-2022-jp characters.

 After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them
 I can print out something that looks exactly like japanese characters.

 But you can't match /(\w+) on them.  It's apparently one word
 without spaces in it.

http://www.patentstorm.us/patents/5337233-description.html
(look for JLE)

So maybe if you convert to EUC, than insert spaces as the text suggests,
than convert back to utf8, you might have a better string to work
with.

-- 
Affijn, Ruud

Gewoon is een tijger.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: still working with utf8

2007-06-22 Thread tom

Yes, be prepared for the fact that not all foreign languages will

support the concept of spaces between words. I don't know anything about

Japanese, but I do vaguely remember from high school that, for Chinese

texts, there are often no spaces between words and the reader's

knowledge of the language allows him or her to infer the word separations.



So the chinese might have a sentence like:

thequickbrownfoxjumpedoverthefence

and it's up to you, the reader, to figure out where the spaces are?





However, even without knowing Japanese, we might be able to help you

find acceptable solutions. What is your program supposed to do?



Well, for phonetic, character based langauges it's trying to do

something like:

while($string=~/(\w+)/g) {

  push @array, $1;

}

would be a great start.

Similarly I guess @array=~split /\W/, $string would be close.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: still working with utf8

2007-06-22 Thread tom

On 6/22/2007, Tom Phoenix [EMAIL PROTECTED] wrote:





On 6/21/07, Tom Allison [EMAIL PROTECTED] wrote:



 I guess my question is, for CJK languages, should I expect the notion

 of using a regex like \w+ to pick up entire strings of text instead

 of discrete words like latin based languages?



Once you've enabled what the perlunicode manpage calls Character

Semantics, it says:



Character classes in regular expressions match characters instead

of bytes and match against the character properties specified in

the Unicode properties database.  \w can be used to match a

Japanese ideograph, for instance.



http://perldoc.perl.org/perlunicode.html



Does that manpage get you any closer to a solution? Hope this helps!





I got a long ways with this.



Given a base64 encoded string I can decode it using MIME::Base64.  But it

returns octets (thought they all look the same).

Convert the octets to string using encode_utf() from Encode and you can

use regex on it just fine.



But I was surprised to find that my first test case what a japanese

string of some 8-10 characters with no whitespace.  I suppose it could

be a single word, but I didn't think the CJK languages had more than

2-4 characters (pictographs?) to a word.  But I have no real experiences.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




RE: still working with utf8

2007-06-22 Thread Bob McConnell
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: Friday, June 22, 2007 8:36 AM
 To: [EMAIL PROTECTED]; beginners@perl.org; 
 Mumia W.; Beginners List
 Subject: Re: still working with utf8
 
 
 Yes, be prepared for the fact that not all foreign languages will
 
 support the concept of spaces between words. I don't know 
 anything about
 
 Japanese, but I do vaguely remember from high school that, 
 for Chinese
 
 texts, there are often no spaces between words and the reader's
 
 knowledge of the language allows him or her to infer the 
 word separations.
 
 
 
 So the chinese might have a sentence like:
 
 thequickbrownfoxjumpedoverthefence
 
 and it's up to you, the reader, to figure out where the spaces are?
 

It has been a while since I had to deal with Asian character sets, but
for Chinese and (I believe) Kanji (Japanese) each pictograph (character)
is a word, so no spaces are required. Katakana is the phonetic version
of Japanese, which may or may not have spaces between the words. I never
had to read them, only validate that the images in the service manuals
looked like what was being displayed or printed.

Bob McConnell

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: still working with utf8

2007-06-21 Thread Tom Phoenix

On 6/21/07, Tom Allison [EMAIL PROTECTED] wrote:


I guess my question is, for CJK languages, should I expect the notion
of using a regex like \w+ to pick up entire strings of text instead
of discrete words like latin based languages?


Once you've enabled what the perlunicode manpage calls Character
Semantics, it says:

   Character classes in regular expressions match characters instead
   of bytes and match against the character properties specified in
   the Unicode properties database.  \w can be used to match a
   Japanese ideograph, for instance.

   http://perldoc.perl.org/perlunicode.html

Does that manpage get you any closer to a solution? Hope this helps!

--Tom Phoenix
Stonehenge Perl Training

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: still working with utf8

2007-06-21 Thread Mumia W.

On 06/21/2007 09:42 PM, Tom Allison wrote:
OK, I sorted out what the deal is with charsets, Encode, utf8 and other 
goodies.


Now I have something I'm just not sure exactly how it is supposet to 
operate.


I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.

After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I 
can print out something that looks exactly like japanese characters.


But you can't match /(\w+) on them.  It's apparently one word without 
spaces in it.
Um... I don't know Japanese.  But I guess this string of spaghetti (to 
me) is actually a language where one character as represented in a 
unicode terminal is actually one 'word' according to the perl definition 
of a word...


In english, this would pick apart words in a sense that is simple for me 
and many on this list to understand.


I guess my question is, for CJK languages, should I expect the notion of 
using a regex like \w+ to pick up entire strings of text instead of 
discrete words like latin based languages?




Sadly, I must admit that I'm operating way outside of my knowledge 
domain on this one, but I'll try to give an answer.


Yes, be prepared for the fact that not all foreign languages will 
support the concept of spaces between words. I don't know anything about 
Japanese, but I do vaguely remember from high school that, for Chinese 
texts, there are often no spaces between words and the reader's 
knowledge of the language allows him or her to infer the word separations.


However, even without knowing Japanese, we might be able to help you 
find acceptable solutions. What is your program supposed to do?



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/