I must say I was quite pleasantly surprised to find my change tested so
quickly during a weekend.
I don't use Bayes, so I won't be putting a lot of effort into Japanese
support in Bayes. I will review your proposals:
(1) "split word with space" (tokenization) feature. There is no space
between words in Japanese (and Chinese, Korean). Human can
understand easily but tokenization is necessary for computer
processing. There is a program called kakasi and Text:Kakasi
(GPLed) which handles tokenization based on special dictionary. I
made quick hack to John's patch experimentally and tested.
As Kakasi does not support UTF-8, we have to convert UTF-8 to
EUC-JP, process with kakasi, and then convert to UTF-8 again. It is
ugly, but it works fine. Most word is split correctly. The
mismatch mentioned above will not occur.
It seems a bit odd to convert UTF-8 into EUC and back like this. The
cost of transcoding is admittedly small compared to the cost of using
Perl's UTF-8 regex support for the tests, but I would suggest you
evaluate tokenizers that can work directly in UTF-8. I believe MeCab is
one such tokenizer.
Converting UTF-8 to EUC-JP and back is problematic when the source
charset does not fit in EUC-JP. Consider what would happen with Russian
spam, for example. It is probably not a good idea to tokenize if the
message is not in CJK.
The GPL license of Kakasi and MeCab might be problematic if you want
tokenization support to be included in stock SpamAssassin.
I believe tokenization should be done in Bayes, not in Message::Node. I
believe tests should be run against the non-tokenized form.
(2) Raw text body is passed to Bayes tokenizer. This causes some
difficulties.
My reading of the Bayes code suggests the "visible rendered" form of the
body is what is passed to the Bayes tokenizer. But then I don't use
Bayes so haven't seen what really happens.