On 4/12/2016 1:16 PM, Reindl Harald wrote:


Am 12.04.2016 um 18:44 schrieb Yu Qian:
SpamAssassin used Bayes as classier, this is typical and efficient for
English. But how does it processing languages like Asian language?

Can anyone introduce that or anyone can show the code where SpamAssassin
do that?

bayes is by definition language agnostic

*you train* bayes with samples of ham and spam (at least a few hundret of both) and the tokenizer splits the messages in parts and creates a database which words appear how often in spam and ham (simplified explained)
While that's true, tokenizing languages that don't delimit words by whitespace is extremely difficult. For languages like Chinese, it can only be done by carrying around a language dictionary.

Yu Qian, if you're up to reading code you may want to look at lib/Mail/SpamAssassin/Bayes.pm and lib/Mail/SpamAssassin/Plugin/Bayes.pm. I'm not familiar enough with the Bayes side of SA to say for sure, but you might be able to configure it or write a plugin that can do the tokenization you desire. You may also be able to reuse existing research from http://nlp.stanford.edu/ and such.

Reply via email to