Re: How does SpamAssassin processing languages other than English

Joe Quinn Tue, 12 Apr 2016 10:24:12 -0700

On 4/12/2016 1:16 PM, Reindl Harald wrote:

Am 12.04.2016 um 18:44 schrieb Yu Qian:
SpamAssassin used Bayes as classier, this is typical and efficient for
English. But how does it processing languages like Asian language?

Can anyone introduce that or anyone can show the code where SpamAssassin
do that?
bayes is by definition language agnostic
*you train* bayes with samples of ham and spam (at least a few hundretof both) and the tokenizer splits the messages in parts and creates adatabase which words appear how often in spam and ham (simplifiedexplained)

While that's true, tokenizing languages that don't delimit words bywhitespace is extremely difficult. For languages like Chinese, it canonly be done by carrying around a language dictionary.

Yu Qian, if you're up to reading code you may want to look atlib/Mail/SpamAssassin/Bayes.pm andlib/Mail/SpamAssassin/Plugin/Bayes.pm. I'm not familiar enough with theBayes side of SA to say for sure, but you might be able to configure itor write a plugin that can do the tokenization you desire. You may alsobe able to reuse existing research from http://nlp.stanford.edu/ and such.

Re: How does SpamAssassin processing languages other than English

Reply via email to