Re: still working with utf8
Tom Allison schreef: I have a string: =?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?= That is a MIME::Base64 encoded string of iso-2022-jp characters. After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I can print out something that looks exactly like japanese characters. But you can't match /(\w+) on them. It's apparently one word without spaces in it. http://www.patentstorm.us/patents/5337233-description.html (look for JLE) So maybe if you convert to EUC, than insert spaces as the text suggests, than convert back to utf8, you might have a better string to work with. -- Affijn, Ruud Gewoon is een tijger. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: still working with utf8
Yes, be prepared for the fact that not all foreign languages will support the concept of spaces between words. I don't know anything about Japanese, but I do vaguely remember from high school that, for Chinese texts, there are often no spaces between words and the reader's knowledge of the language allows him or her to infer the word separations. So the chinese might have a sentence like: thequickbrownfoxjumpedoverthefence and it's up to you, the reader, to figure out where the spaces are? However, even without knowing Japanese, we might be able to help you find acceptable solutions. What is your program supposed to do? Well, for phonetic, character based langauges it's trying to do something like: while($string=~/(\w+)/g) { push @array, $1; } would be a great start. Similarly I guess @array=~split /\W/, $string would be close. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: still working with utf8
On 6/22/2007, Tom Phoenix [EMAIL PROTECTED] wrote: On 6/21/07, Tom Allison [EMAIL PROTECTED] wrote: I guess my question is, for CJK languages, should I expect the notion of using a regex like \w+ to pick up entire strings of text instead of discrete words like latin based languages? Once you've enabled what the perlunicode manpage calls Character Semantics, it says: Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance. http://perldoc.perl.org/perlunicode.html Does that manpage get you any closer to a solution? Hope this helps! I got a long ways with this. Given a base64 encoded string I can decode it using MIME::Base64. But it returns octets (thought they all look the same). Convert the octets to string using encode_utf() from Encode and you can use regex on it just fine. But I was surprised to find that my first test case what a japanese string of some 8-10 characters with no whitespace. I suppose it could be a single word, but I didn't think the CJK languages had more than 2-4 characters (pictographs?) to a word. But I have no real experiences. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
RE: still working with utf8
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, June 22, 2007 8:36 AM To: [EMAIL PROTECTED]; beginners@perl.org; Mumia W.; Beginners List Subject: Re: still working with utf8 Yes, be prepared for the fact that not all foreign languages will support the concept of spaces between words. I don't know anything about Japanese, but I do vaguely remember from high school that, for Chinese texts, there are often no spaces between words and the reader's knowledge of the language allows him or her to infer the word separations. So the chinese might have a sentence like: thequickbrownfoxjumpedoverthefence and it's up to you, the reader, to figure out where the spaces are? It has been a while since I had to deal with Asian character sets, but for Chinese and (I believe) Kanji (Japanese) each pictograph (character) is a word, so no spaces are required. Katakana is the phonetic version of Japanese, which may or may not have spaces between the words. I never had to read them, only validate that the images in the service manuals looked like what was being displayed or printed. Bob McConnell -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: still working with utf8
On 6/21/07, Tom Allison [EMAIL PROTECTED] wrote: I guess my question is, for CJK languages, should I expect the notion of using a regex like \w+ to pick up entire strings of text instead of discrete words like latin based languages? Once you've enabled what the perlunicode manpage calls Character Semantics, it says: Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance. http://perldoc.perl.org/perlunicode.html Does that manpage get you any closer to a solution? Hope this helps! --Tom Phoenix Stonehenge Perl Training -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: still working with utf8
On 06/21/2007 09:42 PM, Tom Allison wrote: OK, I sorted out what the deal is with charsets, Encode, utf8 and other goodies. Now I have something I'm just not sure exactly how it is supposet to operate. I have a string: =?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?= That is a MIME::Base64 encoded string of iso-2022-jp characters. After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I can print out something that looks exactly like japanese characters. But you can't match /(\w+) on them. It's apparently one word without spaces in it. Um... I don't know Japanese. But I guess this string of spaghetti (to me) is actually a language where one character as represented in a unicode terminal is actually one 'word' according to the perl definition of a word... In english, this would pick apart words in a sense that is simple for me and many on this list to understand. I guess my question is, for CJK languages, should I expect the notion of using a regex like \w+ to pick up entire strings of text instead of discrete words like latin based languages? Sadly, I must admit that I'm operating way outside of my knowledge domain on this one, but I'll try to give an answer. Yes, be prepared for the fact that not all foreign languages will support the concept of spaces between words. I don't know anything about Japanese, but I do vaguely remember from high school that, for Chinese texts, there are often no spaces between words and the reader's knowledge of the language allows him or her to infer the word separations. However, even without knowing Japanese, we might be able to help you find acceptable solutions. What is your program supposed to do? -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/