A troublesome one for me was Chinese, the GB2312 character set. I started weighting based on charset=GB2312 and started noticing legitimate e-mail in English from users/computers in China using the GB2312 character set. The characters a-z,A-Z are the same in the GB2312 character set. So just because it uses the character set, doesn't mean it is that language.
I also get someof Spanish spam. So I thought, I'll add some weight on the ņ character. Soon I started getting false hits on el niņo, piņata, seņor. So that went out the door too. Language based spam - filtering is a tough nut. ----- Original Message ----- From: "Jorge Asch" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, August 20, 2004 9:36 AM Subject: Re: [sniffer] Charset > > >Just to be clear - we're not precisely talking about spam per-se. > >Rather we're talking about stating that all traffic on a particular > >system should be only in one language as a matter of policy... > > > > > Well, since 100% of my users speak english/spanish I can safely bet that > NONE of my mail should have strange character sets. So I can assume if > they do, they must be spam. > > It's just a matter of demographics, and I am sure such a rule would not > apply to all other customers. But for some of them, it would... (foreign > spam messages seems to have increased ten-fold over the last couple of > months). > > > -- > Jorge Asch Revilla > CONEXION DCR > www.conexion.co.cr > 800-CONEXION > > > > > This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html > > This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html