Re: Eliminating russian spam
Thank you, John! Both "how-to" (http://sa-russian.narod.ru/no_russian.html) and the ruleset (http://sa-russian.narod.ru/files/20090916/99_no_russian_mail.cf) are updated.
Re: Cyrillic charsets normalization
But that would also prevent MUAs from correct rendering the contents, wouldn't it? 16.02.09, 10:48, "Jeff Chan" : > On Sunday, February 15, 2009, 11:19:17 PM, Makoev Alan wrote: > > So my question is: Is it just due > > to developers' time shortage, or there are some reasons for > > avoiding using the charset indicated in the header field as a > > source charset for normalization? > Perhaps spammers set that field deceptively or incorrectly some > of the time or don't set it at all other times, so that an > attempt to automatically detect the character set is useful in > some cases? This is just a guess on my part however. > Cheers, > Jeff C. > -- > Jeff Chan > mailto:je...@surbl.org > http://www.surbl.org/
Cyrillic charsets normalization
Here was recently a discussion on "charset normalization" feature (see e.g. http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42) I ran a simple check on results that Encode::Detect::Detector facility yields. I selected manually a set of 39 spam messages in Russian (those that were not MIME-encoded so I could see the contents by just tapping F3 in mc) - 32 with KOI8-R encoding, 6 with CP-1251 and 1 (ham) UTF-8. After that I ran the a simple script that feeds message body to Encode::Detect::Detector::detect, and got the following: - among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be pardonable when making texts for humans, since these encodings differ only in 2 letters, but it may affect negatively text analysis results) and 1 was not recognized at all (Encode::Detect::Detector::detect returned "undef"); - among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew); - 1 UTF-8 message was detected correctly. Of course, this set is by no means representative, but it illustrates possible drawbacks in using "normalize_charset" option. Strictly speaking, one could expect such result, because the tricks widely used by spammers (replacing cyrillic letters with similar-looking latin ones, replacing digits with letters that look similar to digits and vice versa, adding random letter sequences to poison bayes, etc.) should affect the detection result. And despite that SA ignores "charset=" statement in "Content-type:" header field. So my question is: Is it just due to developers' time shortage, or there are some reasons for avoiding using the charset indicated in the header field as a source charset for normalization?
Cyrillic charsets normalization
Here was recently a discussion on "charset normalization" feature (see e.g. http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42) I ran a simple check of results Encode::Detect::Detector facility yields. I selected manually a set of 39 spam messages in Russian (those that were not MIME-encoded so I could check them by just tapping F3 in mc) - 32 with KOI8-R encoding, 6 with CP-1251 and 1 UTF-8. After that I ran the a simple script that feeds message body to Encode::Detect::Detector::detect, and got the following: - among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be pardonable when making texts for humans, since these encodings differ only in 2 letters, but it may affect negatively text analysis results) and 1 was not recognized at all (Encode::Detect::Detector::detect returned "undef"); - among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew); - 1 UTF-8 message was detected correctly. Of course, this set is by no means representative, but it illustrates possible drawbacks in using "normalize_charset" option. Strictly speaking, one could expect such result since the tricks widely used by spammers (replacing cyrillic letters with similar-looking latin ones, replacing digits with letters that look similar to digits and vice versa, adding random letter sequences to poison bayes, etc.) should affect the detection result. And despite that SA ignores "charset=" statement in "Content-type:" header field. So my question is: Is it just due to developers' time shortage, or there are some reasons for avoiding using the charset indicated in the header field as a source charset for normalization?