For people who wish to process texts in Chinese (Traditional, but
also Simplfied via Encode::HanConvert) language, I have just uploaded
Lingua::ZH::Toke on CPAN.

That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE
module; it allows you to manupulate linguistic objects like below
(in big5):

    use Lingua::ZH::Toke;       # add 'utf8' to use unicode strings

    # Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
    my $token = Lingua::ZH::Toke->new( '���H�o�b/�O������B/�q�o�N�����' );

    # Easy tokenization via array deferencing
    print $token->[0]           # Fragment       - ���H�o�b
                ->[2]           # Phrase         - �o�b
                ->[0]           # Character      - �o
                ->[0]           # Pronounciation - ��������
                ->[2];          # Phonetic        - ��

    # Magic histogram via hash deferencing
    print $token->{'���H�o�b'}; # 1 - One such fragment there
    print $token->{'�N�����'}; # 1 - One such phrase there
    print $token->{'�o�N����'}; # undef - That's not a phrase
    print $token->{'��'};       # 2 - Two such character there
    print $token->{'����'};     # 2 - Two such pronounciation: �q�N
    print $token->{'��'};       # 3 - Three such phonetics: �����B

    # Iteration over fragments
    while (my $fragment = <$token>) {
        # Iteration over phrases
        while (my $phrase = <$token>) {
            # ...
        }
    }

The 'phonetic' symbols are expressed in BoPoMoFo notation.
There are also various utility methods (complex segmentation, etc.);
see Lingua::ZH::TaBE for details.

Comments welcome. :-)

Thanks,
/Autrijus/

Attachment: msg01641/pgp00000.pgp
Description: PGP signature

Reply via email to