Re: Charset normalization issue (report, patch, and request)

Justin Mason Mon, 09 Jan 2006 13:30:00 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


"John Myers" writes:
> I must say I was quite pleasantly surprised to find my change tested so 
> quickly during a weekend.
> 
> I don't use Bayes, so I won't be putting a lot of effort into Japanese 
> support in Bayes.  I will review your proposals:
> 
> > (1) "split word with space" (tokenization) feature.  There is no space
> >     between words in Japanese (and Chinese, Korean).  Human can
> >     understand easily but tokenization is necessary for computer
> >     processing.  There is a program called kakasi and Text:Kakasi
> >     (GPLed) which handles tokenization based on special dictionary.  I
> >     made quick hack to John's patch experimentally and tested.
> >
> >     As Kakasi does not support UTF-8, we have to convert UTF-8 to
> >     EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
> >     ugly, but it works fine.  Most word is split correctly.  The
> >     mismatch mentioned above will not occur.
> 
> It seems a bit odd to convert UTF-8 into EUC and back like this.  The 
> cost of transcoding is admittedly small compared to the cost of using 
> Perl's UTF-8 regex support for the tests, but I would suggest you 
> evaluate tokenizers that can work directly in UTF-8.  I believe MeCab is 
> one such tokenizer.
> 
> Converting UTF-8 to EUC-JP and back is problematic when the source 
> charset does not fit in EUC-JP.  Consider what would happen with Russian 
> spam, for example.  It is probably not a good idea to tokenize if the 
> message is not in CJK.
> 
> The GPL license of Kakasi and MeCab might be problematic if you want 
> tokenization support to be included in stock SpamAssassin.

For what it's worth, Kakasi looks good for tokenizing Japanese text, and
it's well-established. Given that it's pretty widely packaged (e.g., in
Debian, 'libtext-kakasi-perl'), I think it's a reasonable optional
dependency for sites who expect to see a lot of traffic in japanese
charsets.

However, I'd greatly prefer if there was some way we could detect when
Kakasi tokenization should be used, instead of doing it in all cases.   It
only deals with one language, and I can foresee a case where there are 10
different tokenizers for different Asian charsets.  

We could make it dependent on TextCat's language identification... if
language is "ja", then apply the Kakasi tokenizer, if available.

That would also help reduce the impact of transcoding EUC-JP <-> UTF-8
if it's still required.

> I believe tokenization should be done in Bayes, not in Message::Node.  I 
> believe tests should be run against the non-tokenized form.

+1 agreed.

> > (2) Raw text body is passed to Bayes tokenizer.  This causes some
> >     difficulties.
> 
> My reading of the Bayes code suggests the "visible rendered" form of the 
> body is what is passed to the Bayes tokenizer.  But then I don't use 
> Bayes so haven't seen what really happens.

Yes, that is the intent (and what happens with english text, at least).

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDws4eMJF5cimLx9ARAvTWAKCvhqD46b2DVAfPzvWFES8Q3IP4UACfY3SS
9zQVoSWom/gRxe/7Q/xAoCo=
=ol9C
-----END PGP SIGNATURE-----

Re: Charset normalization issue (report, patch, and request)

Reply via email to