Re: Eliminating russian spam

2009-09-22 Thread Makoev Alan
Thank you, John!
Both "how-to" (http://sa-russian.narod.ru/no_russian.html) and the ruleset 
(http://sa-russian.narod.ru/files/20090916/99_no_russian_mail.cf) are updated.


Re: Cyrillic charsets normalization

2009-02-16 Thread Makoev Alan
But that would also prevent MUAs from correct rendering the contents, wouldn't 
it?

16.02.09, 10:48, "Jeff Chan" :

> On Sunday, February 15, 2009, 11:19:17 PM, Makoev Alan wrote:
> > So my question is: Is it just due
> > to developers' time shortage, or there are some reasons for
> > avoiding using the charset indicated in the header field as a
> > source charset for normalization? 
> Perhaps spammers set that field deceptively or incorrectly some
> of the time or don't set it at all other times, so that an
> attempt to automatically detect the character set is useful in
> some cases?  This is just a guess on my part however.
> Cheers,
> Jeff C.
> -- 
> Jeff Chan
> mailto:je...@surbl.org
> http://www.surbl.org/


Cyrillic charsets normalization

2009-02-15 Thread Makoev Alan
Here was recently a discussion on "charset normalization" feature (see e.g. 
http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42)
I ran a simple check on results that Encode::Detect::Detector facility yields.
I selected manually a set of 39 spam messages in Russian (those that were not 
MIME-encoded so I could see the contents by just tapping F3 in mc) - 32 with 
KOI8-R encoding, 6 with CP-1251 and 1 (ham) UTF-8. After that I ran the a 
simple script that feeds message body to Encode::Detect::Detector::detect, and 
got the following:
- among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be 
pardonable when making texts for humans, since these encodings differ only in 2 
letters, but it may affect negatively text analysis results) and 1 was not 
recognized at all (Encode::Detect::Detector::detect returned "undef");
- among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew);
- 1 UTF-8 message was detected correctly.
Of course, this set is by no means representative, but it illustrates possible 
drawbacks in using "normalize_charset" option.
Strictly speaking, one could expect such result, because the tricks widely used 
by spammers (replacing cyrillic letters with similar-looking latin ones, 
replacing digits with letters that look similar to digits and vice versa, 
adding random letter sequences to poison bayes, etc.) should affect the 
detection result.
And despite that SA ignores "charset=" statement in "Content-type:" header 
field. So my question is: Is it just due to developers' time shortage, or there 
are some reasons for avoiding using the charset indicated in the header field 
as a source charset for normalization? 


Cyrillic charsets normalization

2009-02-13 Thread Makoev Alan
Here was recently a discussion on "charset normalization" feature (see e.g. 
http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42)
I ran a simple check of results Encode::Detect::Detector facility yields.
I selected manually a set of 39 spam messages in Russian (those that were not 
MIME-encoded so I could check them by just tapping F3 in mc) - 32 with KOI8-R 
encoding, 6 with CP-1251 and 1 UTF-8. After that I ran the a simple script that 
feeds message body to Encode::Detect::Detector::detect, and got the following:
- among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be 
pardonable when making texts for humans, since these encodings differ only in 2 
letters, but it may affect negatively text analysis results) and 1 was not 
recognized at all (Encode::Detect::Detector::detect returned "undef");
- among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew);
- 1 UTF-8 message was detected correctly.
Of course, this set is by no means representative, but it illustrates possible 
drawbacks in using "normalize_charset" option.
Strictly speaking, one could expect such result since the tricks widely used by 
spammers (replacing cyrillic letters with similar-looking latin ones, replacing 
digits with letters that look similar to digits and vice versa, adding random 
letter sequences to poison bayes, etc.) should affect the detection result.
And despite that SA ignores "charset=" statement in "Content-type:" header 
field. So my question is: Is it just due to developers' time shortage, or there 
are some reasons for avoiding using the charset indicated in the header field 
as a source charset for normalization?