Re: still working with utf8

tom Fri, 22 Jun 2007 05:41:55 -0700

On 6/22/2007, "Tom Phoenix" <[EMAIL PROTECTED]> wrote:



>

>On 6/21/07, Tom Allison <[EMAIL PROTECTED]> wrote:

>

>> I guess my question is, for CJK languages, should I expect the notion

>> of using a regex like \w+ to pick up entire strings of text instead

>> of discrete words like latin based languages?

>

>Once you've enabled what the perlunicode manpage calls "Character

>Semantics", it says:

>

>    Character classes in regular expressions match characters instead

>    of bytes and match against the character properties specified in

>    the Unicode properties database.  "\w" can be used to match a

>    Japanese ideograph, for instance.

>

>    http://perldoc.perl.org/perlunicode.html

>

>Does that manpage get you any closer to a solution? Hope this helps!

>



I got a long ways with this.



Given a base64 encoded string I can decode it using MIME::Base64.  But it

returns octets (thought they all look the same).

Convert the octets to string using encode_utf() from Encode and you can

use regex on it just fine.



But I was surprised to find that my first test case what a japanese

string of some 8-10 characters with no whitespace.  I suppose it could

be a single word, but I didn't think the CJK languages had more than

2-4 characters (pictographs?) to a word.  But I have no real experiences.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: still working with utf8

Reply via email to