For people who wish to process texts in Chinese (Traditional, but also Simplfied via Encode::HanConvert) language, I have just uploaded Lingua::ZH::Toke on CPAN.
That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE
module; it allows you to manupulate linguistic objects like below
(in big5):
use Lingua::ZH::Toke; # add 'utf8' to use unicode strings
# Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
my $token = Lingua::ZH::Toke->new( '���H�o�b/�O������B/�q�o�N�����' );
# Easy tokenization via array deferencing
print $token->[0] # Fragment - ���H�o�b
->[2] # Phrase - �o�b
->[0] # Character - �o
->[0] # Pronounciation - ��������
->[2]; # Phonetic - ��
# Magic histogram via hash deferencing
print $token->{'���H�o�b'}; # 1 - One such fragment there
print $token->{'�N�����'}; # 1 - One such phrase there
print $token->{'�o�N����'}; # undef - That's not a phrase
print $token->{'��'}; # 2 - Two such character there
print $token->{'����'}; # 2 - Two such pronounciation: �q�N
print $token->{'��'}; # 3 - Three such phonetics: �����B
# Iteration over fragments
while (my $fragment = <$token>) {
# Iteration over phrases
while (my $phrase = <$token>) {
# ...
}
}
The 'phonetic' symbols are expressed in BoPoMoFo notation.
There are also various utility methods (complex segmentation, etc.);
see Lingua::ZH::TaBE for details.
Comments welcome. :-)
Thanks,
/Autrijus/
msg01641/pgp00000.pgp
Description: PGP signature
