On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <da...@cpan.org> wrote:
> Let the regex engine help you advance the character counter.
>
>    $ cat langs
>    ΕλληνικάEnglish한국어日本語Русскийไทย
>
> ----
>
>    $ cat langs.pl
>    use 5.010;
>    use strictures;
>    use Unicode::UCD qw(charinfo);
>
>    sub script {
>        return charinfo(ord substr($_[0], 0, 1))->{script}
>    };
>
>    # necessary because pos() magic is tracked on the scalar.
>    my $copy = $_;
>    while (/(\X)/g) {
>        my $script = script $1;
>        my ($part) = $copy =~ /(\p{$script}+)/;
>        say $part;
>        pos($_) = pos($_) + length($part);
>    }

Thanks a lot!

Here is the first version of my tokenizer based on this idea:


use Lingua::ZH::MMSEG;

sub tokenize {
    my $text = shift;
    my @tokens;
    while ( $text =~ /(\X)/g ) {
        my $part = $1;
        my $script = charinfo( ord $1)->{script};
        $text=~ /(\p{$script}*)/g;
        next if $script eq 'Common';
        $part .= $1;
        if( $script eq 'Han' ){
            push @tokens, mmseg( $part );
        }
        else{
            push @tokens, $part;
        }
    }
    return @tokens;
}

And the surprise - this works even without further splitting because
space and other dots all get the 'Common' script and are not matched
by \p{Latin}.

-- 
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/

Reply via email to