[FYI] Lingua::ZH::Toke (and ::TaBE)

2003-01-19 Thread Autrijus Tang
For people who wish to process texts in Chinese (Traditional, but
also Simplfied via Encode::HanConvert) language, I have just uploaded
Lingua::ZH::Toke on CPAN.

That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE
module; it allows you to manupulate linguistic objects like below
(in big5):

use Lingua::ZH::Toke;   # add 'utf8' to use unicode strings

# Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
my $token = Lingua::ZH::Toke->new( '¨º¤H«o¦b/¿O¤õÁñ¬À³B/¯qµo·N¿³Áñ¬À' );

# Easy tokenization via array deferencing
print $token->[0]   # Fragment   - ¨º¤H«o¦b
->[2]   # Phrase - «o¦b
->[0]   # Character  - «o
->[0]   # Pronounciation - £¢£º£®£¿
->[2];  # Phonetic- £®

# Magic histogram via hash deferencing
print $token->{'¨º¤H«o¦b'}; # 1 - One such fragment there
print $token->{'·N¿³Áñ¬À'}; # 1 - One such phrase there
print $token->{'µo·N¿³Áñ'}; # undef - That's not a phrase
print $token->{'¬À'};   # 2 - Two such character there
print $token->{'£¸£¿'}; # 2 - Two such pronounciation: ¯q·N
print $token->{'£¹'};   # 3 - Three such phonetics: ¨º¤õ³B

# Iteration over fragments
while (my $fragment = <$token>) {
# Iteration over phrases
while (my $phrase = <$token>) {
# ...
}
}

The 'phonetic' symbols are expressed in BoPoMoFo notation.
There are also various utility methods (complex segmentation, etc.);
see Lingua::ZH::TaBE for details.

Comments welcome. :-)

Thanks,
/Autrijus/



msg01641/pgp0.pgp
Description: PGP signature


Re: CGI and UTF

2003-01-19 Thread Benjamin Franz
On Sat, 18 Jan 2003, Jarkko Hietaniemi wrote:

> Now Perl-5.8.1-to-be has been changed to
> 
> (1) not to do any implicit UTF-8-ification of any filehandles unless
> explicitly asked to do so (either by the -C command line switch
> or by setting the env var PERL_UTF8_LOCALE to a true value, the switch
> wins if both are present) (and if the locale settings do not indicate
> a UTF-8 locale, both are silent no-ops)
> 
> (2) illegal UTF-8 causing a -w(arning) immediately when read in e.g. by <>
> (an immediate croak is a possibility, but a warning is how it now works,
> and a croak would be, err, even more non-traditional for UNIX...)
> 
> Note that the above do not change the fact that if a *programmer* wants
> their code to be UTF-8 aware, they need to think about the evil binmode().

Wonderful. :) This will definitely simplify the day I have to migrate our
existing codebase to 5.8.

Thank you.

-- 
Benjamin Franz

"If the code and the comments disagree, then both are probably wrong."
-- Norm Schryer, Bell Labs