On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <da...@cpan.org> wrote: > Let the regex engine help you advance the character counter. > > $ cat langs > ΕλληνικάEnglish한국어日本語Русскийไทย > > ---- > > $ cat langs.pl > use 5.010; > use strictures; > use Unicode::UCD qw(charinfo); > > sub script { > return charinfo(ord substr($_[0], 0, 1))->{script} > }; > > # necessary because pos() magic is tracked on the scalar. > my $copy = $_; > while (/(\X)/g) { > my $script = script $1; > my ($part) = $copy =~ /(\p{$script}+)/; > say $part; > pos($_) = pos($_) + length($part); > }
Thanks a lot! Here is the first version of my tokenizer based on this idea: use Lingua::ZH::MMSEG; sub tokenize { my $text = shift; my @tokens; while ( $text =~ /(\X)/g ) { my $part = $1; my $script = charinfo( ord $1)->{script}; $text=~ /(\p{$script}*)/g; next if $script eq 'Common'; $part .= $1; if( $script eq 'Han' ){ push @tokens, mmseg( $part ); } else{ push @tokens, $part; } } return @tokens; } And the surprise - this works even without further splitting because space and other dots all get the 'Common' script and are not matched by \p{Latin}. -- Zbigniew Lukasiak http://brudnopis.blogspot.com/ http://perlalchemy.blogspot.com/