On Tue, 13 Mar 2012 06:42:05 -0700 (PDT) John Hardin <jhar...@impsec.org> wrote:
> > PS: I haven't looked at SA's Bayes implementation. Can it handle > > words in non-western character sets properly? > It seems to. All of the Chinese-language spam I get hits BAYES_99. I took a look at the code, and it does sort-of handle non-Western character sets, although I wouldn't say "properly". It looks like it simply tokenizes without regard to the character set. So a word like "français" would be tokenized as "fran\x{c3}\x{a7}ais" if the source character set is UTF-8, but as "fran\x{e7}ais" if the source character set is ISO-8859-1. Am I misunderstanding? Regards, David.