It seems a bit odd to convert UTF-8 into EUC and back like this. The
cost of transcoding is admittedly small compared to the cost of using
Perl's UTF-8 regex support for the tests, but I would suggest you
evaluate tokenizers that can work directly in UTF-8. I believe MeCab is
one such tokenizer.
I tried MeCab today. It works fine. I changed code from:
use Text::Kakasi;
my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
my $rc = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
my $str = Text::Kakasi::do_kakasi($res);
$utf8= Encode::decode("euc-jp",$str);
to
use MeCab;
my @arg = ('dummy', '-Owakati');
my $mecab = new MeCab::Tagger ([EMAIL PROTECTED]);
$utf8 = $mecab->parse ($text);
I compiled Mecab with --with-charset=utf8 option. No charset conversion
is necessary. There is no distinct difference in process time but Mecab
is slightly faster than kakasi.
The result is almost equal but slightly different for Japanese. Mecab
is more sophisticated than Kakasi. However Mecab is clever enough to
split english word such like "EUC_JP" to "UEC _ JP",
"http://www.yahoo.com/" to "http :// www . yahoo . com /" Because URL
and e-mail address is an important signature, it may be problematic.
I will ask developer if we can avoid splitting for URL etc.
--
Motoharu Kubo
[EMAIL PROTECTED]