Cyrillic charsets normalization

Makoev Alan Fri, 13 Feb 2009 07:42:37 -0800
Here was recently a discussion on "charset normalization" feature (see e.g. 
http://markmail.org/message/hvdtbca6lm5tsjtm?q=list:org.apache.spamassassin.users+date:200901+&page=42)
I ran a simple check of results Encode::Detect::Detector facility yields.
I selected manually a set of 39 spam messages in Russian (those that were not 
MIME-encoded so I could check them by just tapping F3 in mc) - 32 with KOI8-R 
encoding, 6 with CP-1251 and 1 UTF-8. After that I ran the a simple script that 
feeds message body to Encode::Detect::Detector::detect, and got the following:
- among 6 CP-1251 messages 1 was detected as Mac-Cyrillic (which might be 
pardonable when making texts for humans, since these encodings differ only in 2 
letters, but it may affect negatively text analysis results) and 1 was not 
recognized at all (Encode::Detect::Detector::detect returned "undef");
- among 32 KOI8-R messages 3 were detected as CP-1255 (Hebrew);
- 1 UTF-8 message was detected correctly.
Of course, this set is by no means representative, but it illustrates possible 
drawbacks in using "normalize_charset" option.
Strictly speaking, one could expect such result since the tricks widely used by 
spammers (replacing cyrillic letters with similar-looking latin ones, replacing 
digits with letters that look similar to digits and vice versa, adding random 
letter sequences to poison bayes, etc.) should affect the detection result.
And despite that SA ignores "charset=" statement in "Content-type:" header 
field. So my question is: Is it just due to developers' time shortage, or there 
are some reasons for avoiding using the charset indicated in the header field 
as a source charset for normalization?
Cyrillic charsets normalization

Reply via email to