A troublesome one for me was Chinese, the GB2312 character set. I started
weighting based on charset=GB2312 and started noticing legitimate e-mail in
English from users/computers in China using the GB2312 character set. The
characters a-z,A-Z are the same in the GB2312 character set. So just because
it uses the character set, doesn't mean it is that language.

I also get someof Spanish spam. So I thought, I'll add some weight on the ņ
character. Soon I started getting false hits on el niņo, piņata, seņor. So
that went out the door too.

Language based spam - filtering is a tough nut.

----- Original Message ----- 

From: "Jorge Asch" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, August 20, 2004 9:36 AM
Subject: Re: [sniffer] Charset


>
> >Just to be clear - we're not precisely talking about spam per-se.
> >Rather we're talking about stating that all traffic on a particular
> >system should be only in one language as a matter of policy...
> >
> >
> Well, since 100% of my users speak english/spanish I can safely bet that
> NONE of my mail should have strange character sets. So I can assume if
> they do, they must be spam.
>
> It's just a matter of demographics, and I am sure such a rule would not
> apply to all other customers. But for some of them, it would... (foreign
> spam messages seems to have increased ten-fold over the last couple of
> months).
>
>
> -- 
> Jorge Asch Revilla
> CONEXION DCR
> www.conexion.co.cr
> 800-CONEXION
>
>
>
>
> This E-Mail came from the Message Sniffer mailing list. For information
and (un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html
>
>



This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Reply via email to