Re: Charset normalization issue (report, patch, and request)

Motoharu Kubo Thu, 12 Jan 2006 05:37:39 -0800

It seems a bit odd to convert UTF-8 into EUC and back like this. Thecost of transcoding is admittedly small compared to the cost of usingPerl's UTF-8 regex support for the tests, but I would suggest youevaluate tokenizers that can work directly in UTF-8. I believe MeCab isone such tokenizer.


I tried MeCab today.  It works fine.  I changed code from:


  use Text::Kakasi;
  my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
  my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
  my $str = Text::Kakasi::do_kakasi($res);
  $utf8= Encode::decode("euc-jp",$str);

to

  use MeCab;
  my @arg = ('dummy', '-Owakati');
  my $mecab = new MeCab::Tagger ([EMAIL PROTECTED]);
  $utf8 = $mecab->parse ($text);

I compiled Mecab with --with-charset=utf8 option. No charset conversionis necessary. There is no distinct difference in process time but Mecabis slightly faster than kakasi.

The result is almost equal but slightly different for Japanese. Mecabis more sophisticated than Kakasi. However Mecab is clever enough tosplit english word such like "EUC_JP" to "UEC _ JP","http://www.yahoo.com/"; to "http :// www . yahoo . com /" Because URLand e-mail address is an important signature, it may be problematic.


I will ask developer if we can avoid splitting for URL etc.

--
Motoharu Kubo
[EMAIL PROTECTED]

Re: Charset normalization issue (report, patch, and request)

Reply via email to