It seems a bit odd to convert UTF-8 into EUC and back like this. The cost of transcoding is admittedly small compared to the cost of using Perl's UTF-8 regex support for the tests, but I would suggest you evaluate tokenizers that can work directly in UTF-8. I believe MeCab is one such tokenizer.

I tried MeCab today.  It works fine.  I changed code from:

  use Text::Kakasi;
  my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
  my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
  my $str = Text::Kakasi::do_kakasi($res);
  $utf8= Encode::decode("euc-jp",$str);

to

  use MeCab;
  my @arg = ('dummy', '-Owakati');
  my $mecab = new MeCab::Tagger ([EMAIL PROTECTED]);
  $utf8 = $mecab->parse ($text);

I compiled Mecab with --with-charset=utf8 option. No charset conversion is necessary. There is no distinct difference in process time but Mecab is slightly faster than kakasi.

The result is almost equal but slightly different for Japanese. Mecab is more sophisticated than Kakasi. However Mecab is clever enough to split english word such like "EUC_JP" to "UEC _ JP", "http://www.yahoo.com/"; to "http :// www . yahoo . com /" Because URL and e-mail address is an important signature, it may be problematic.

I will ask developer if we can avoid splitting for URL etc.

--
Motoharu Kubo
[EMAIL PROTECTED]

Reply via email to