Re: Charset normalization issue (report, patch, and request)

John Myers Mon, 09 Jan 2006 11:27:15 -0800

I must say I was quite pleasantly surprised to find my change tested soquickly during a weekend.

I don't use Bayes, so I won't be putting a lot of effort into Japanesesupport in Bayes. I will review your proposals:

(1) "split word with space" (tokenization) feature.  There is no space
    between words in Japanese (and Chinese, Korean).  Human can
    understand easily but tokenization is necessary for computer
    processing.  There is a program called kakasi and Text:Kakasi
    (GPLed) which handles tokenization based on special dictionary.  I
    made quick hack to John's patch experimentally and tested.

    As Kakasi does not support UTF-8, we have to convert UTF-8 to
    EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
    ugly, but it works fine.  Most word is split correctly.  The
    mismatch mentioned above will not occur.

It seems a bit odd to convert UTF-8 into EUC and back like this. Thecost of transcoding is admittedly small compared to the cost of usingPerl's UTF-8 regex support for the tests, but I would suggest youevaluate tokenizers that can work directly in UTF-8. I believe MeCab isone such tokenizer.

Converting UTF-8 to EUC-JP and back is problematic when the sourcecharset does not fit in EUC-JP. Consider what would happen with Russianspam, for example. It is probably not a good idea to tokenize if themessage is not in CJK.

The GPL license of Kakasi and MeCab might be problematic if you wanttokenization support to be included in stock SpamAssassin.

I believe tokenization should be done in Bayes, not in Message::Node. Ibelieve tests should be run against the non-tokenized form.


(2) Raw text body is passed to Bayes tokenizer.  This causes some
    difficulties.

My reading of the Bayes code suggests the "visible rendered" form of thebody is what is passed to the Bayes tokenizer. But then I don't useBayes so haven't seen what really happens.

Re: Charset normalization issue (report, patch, and request)

Reply via email to