On 6/22/2007, "Tom Phoenix" <[EMAIL PROTECTED]> wrote:
> >On 6/21/07, Tom Allison <[EMAIL PROTECTED]> wrote: > >> I guess my question is, for CJK languages, should I expect the notion >> of using a regex like \w+ to pick up entire strings of text instead >> of discrete words like latin based languages? > >Once you've enabled what the perlunicode manpage calls "Character >Semantics", it says: > > Character classes in regular expressions match characters instead > of bytes and match against the character properties specified in > the Unicode properties database. "\w" can be used to match a > Japanese ideograph, for instance. > > http://perldoc.perl.org/perlunicode.html > >Does that manpage get you any closer to a solution? Hope this helps! > I got a long ways with this. Given a base64 encoded string I can decode it using MIME::Base64. But it returns octets (thought they all look the same). Convert the octets to string using encode_utf() from Encode and you can use regex on it just fine. But I was surprised to find that my first test case what a japanese string of some 8-10 characters with no whitespace. I suppose it could be a single word, but I didn't think the CJK languages had more than 2-4 characters (pictographs?) to a word. But I have no real experiences. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/