I must say I was quite pleasantly surprised to find my change tested so quickly during a weekend.

I don't use Bayes, so I won't be putting a lot of effort into Japanese support in Bayes. I will review your proposals:

(1) "split word with space" (tokenization) feature.  There is no space
    between words in Japanese (and Chinese, Korean).  Human can
    understand easily but tokenization is necessary for computer
    processing.  There is a program called kakasi and Text:Kakasi
    (GPLed) which handles tokenization based on special dictionary.  I
    made quick hack to John's patch experimentally and tested.

    As Kakasi does not support UTF-8, we have to convert UTF-8 to
    EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
    ugly, but it works fine.  Most word is split correctly.  The
    mismatch mentioned above will not occur.

It seems a bit odd to convert UTF-8 into EUC and back like this. The cost of transcoding is admittedly small compared to the cost of using Perl's UTF-8 regex support for the tests, but I would suggest you evaluate tokenizers that can work directly in UTF-8. I believe MeCab is one such tokenizer.

Converting UTF-8 to EUC-JP and back is problematic when the source charset does not fit in EUC-JP. Consider what would happen with Russian spam, for example. It is probably not a good idea to tokenize if the message is not in CJK.

The GPL license of Kakasi and MeCab might be problematic if you want tokenization support to be included in stock SpamAssassin.

I believe tokenization should be done in Bayes, not in Message::Node. I believe tests should be run against the non-tokenized form.


(2) Raw text body is passed to Bayes tokenizer.  This causes some
    difficulties.

My reading of the Bayes code suggests the "visible rendered" form of the body is what is passed to the Bayes tokenizer. But then I don't use Bayes so haven't seen what really happens.

Reply via email to