On Tue, 13 Mar 2012 06:42:05 -0700 (PDT)
John Hardin <jhar...@impsec.org> wrote:

> > PS: I haven't looked at SA's Bayes implementation.  Can it handle
> > words in non-western character sets properly?

> It seems to. All of the Chinese-language spam I get hits BAYES_99.

I took a look at the code, and it does sort-of handle non-Western character
sets, although I wouldn't say "properly".

It looks like it simply tokenizes without regard to the character set.
So a word like "français" would be tokenized as "fran\x{c3}\x{a7}ais"
if the source character set is UTF-8, but as "fran\x{e7}ais" if the
source character set is ISO-8859-1.  Am I misunderstanding?

Regards,

David.

Reply via email to