On 4/12/2016 1:16 PM, Reindl Harald wrote:
Am 12.04.2016 um 18:44 schrieb Yu Qian:
SpamAssassin used Bayes as classier, this is typical and efficient for
English. But how does it processing languages like Asian language?
Can anyone introduce that or anyone can show the code where SpamAssassin
do that?
bayes is by definition language agnostic
*you train* bayes with samples of ham and spam (at least a few hundret
of both) and the tokenizer splits the messages in parts and creates a
database which words appear how often in spam and ham (simplified
explained)
While that's true, tokenizing languages that don't delimit words by
whitespace is extremely difficult. For languages like Chinese, it can
only be done by carrying around a language dictionary.
Yu Qian, if you're up to reading code you may want to look at
lib/Mail/SpamAssassin/Bayes.pm and
lib/Mail/SpamAssassin/Plugin/Bayes.pm. I'm not familiar enough with the
Bayes side of SA to say for sure, but you might be able to configure it
or write a plugin that can do the tokenization you desire. You may also
be able to reuse existing research from http://nlp.stanford.edu/ and such.