[FYI] Lingua::ZH::Toke (and ::TaBE)
For people who wish to process texts in Chinese (Traditional, but also Simplfied via Encode::HanConvert) language, I have just uploaded Lingua::ZH::Toke on CPAN. That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE module; it allows you to manupulate linguistic objects like below (in big5): use Lingua::ZH::Toke; # add 'utf8' to use unicode strings # Create Lingua::ZH::Toke::Sentence object (->Sentence also works) my $token = Lingua::ZH::Toke->new( '¨º¤H«o¦b/¿O¤õÁñ¬À³B/¯qµo·N¿³Áñ¬À' ); # Easy tokenization via array deferencing print $token->[0] # Fragment - ¨º¤H«o¦b ->[2] # Phrase - «o¦b ->[0] # Character - «o ->[0] # Pronounciation - £¢£º£®£¿ ->[2]; # Phonetic- £® # Magic histogram via hash deferencing print $token->{'¨º¤H«o¦b'}; # 1 - One such fragment there print $token->{'·N¿³Áñ¬À'}; # 1 - One such phrase there print $token->{'µo·N¿³Áñ'}; # undef - That's not a phrase print $token->{'¬À'}; # 2 - Two such character there print $token->{'£¸£¿'}; # 2 - Two such pronounciation: ¯q·N print $token->{'£¹'}; # 3 - Three such phonetics: ¨º¤õ³B # Iteration over fragments while (my $fragment = <$token>) { # Iteration over phrases while (my $phrase = <$token>) { # ... } } The 'phonetic' symbols are expressed in BoPoMoFo notation. There are also various utility methods (complex segmentation, etc.); see Lingua::ZH::TaBE for details. Comments welcome. :-) Thanks, /Autrijus/ msg01641/pgp0.pgp Description: PGP signature
Re: CGI and UTF
On Sat, 18 Jan 2003, Jarkko Hietaniemi wrote: > Now Perl-5.8.1-to-be has been changed to > > (1) not to do any implicit UTF-8-ification of any filehandles unless > explicitly asked to do so (either by the -C command line switch > or by setting the env var PERL_UTF8_LOCALE to a true value, the switch > wins if both are present) (and if the locale settings do not indicate > a UTF-8 locale, both are silent no-ops) > > (2) illegal UTF-8 causing a -w(arning) immediately when read in e.g. by <> > (an immediate croak is a possibility, but a warning is how it now works, > and a croak would be, err, even more non-traditional for UNIX...) > > Note that the above do not change the fact that if a *programmer* wants > their code to be UTF-8 aware, they need to think about the evil binmode(). Wonderful. :) This will definitely simplify the day I have to migrate our existing codebase to 5.8. Thank you. -- Benjamin Franz "If the code and the comments disagree, then both are probably wrong." -- Norm Schryer, Bell Labs