[Bug 7022] New: normalize_charset

bugzilla-daemon Wed, 12 Mar 2014 14:40:24 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7022


            Bug ID: 7022
           Summary: normalize_charset
           Product: Spamassassin
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: spamassassin
          Assignee: [email protected]
          Reporter: [email protected]

Created attachment 5189
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5189&action=edit
SA/Conf.pm - changes at normalize_charset

English is, I believe, the only language using the Latin alphabet without any
diacritics (except for foreign words). Well, Dutch can go without diacritics
relatively fine, but that's all, I think. For all other languages SpamAssassin
does not work as well as it could because of it. Yes, when the option
normalize_charset is enabled, practically all of the many available foreign
charsets will be converted into Unicode, so it solves at least one part of the
problem - the multitude of standards. It's not the full fix, though. Unicode
brings some problems itself (more complex handling, slower regexes, taking more
space in memory, faster growth of the Bayes database, the necessity to write
and maintain the rules in Unicode,...), but the main problem is elsewhere.

Big part of users in most nations write often their email without any
diacritics, with incorrect diacritics, or with just a part of it. The reasons
differ, it can be the ignorance of the user, technical limitations of the
device, OS, or software they use, compatibility issues, conversions, simple
laziness, or many others. In conclusion, even if you set the normalize_charset
option, and if you carefully write and maintain your rules in Unicode, they
will still very often miss the target, as long as you do not add all possible
permutations with and without diacritics (and partial diacritics as well). The
same goes for Bayes. You may train it on plenty of spam, but it is sufficient
that the spammer uses different diacritics (incl. a completely wrong one), and
the tokens need to be learned again, and separately from their equivalents.

So despite the existence of Unicode, I believe that normalizing email for spam
detection with the old good 7bit ASCII is still the best way. For this reason I
patched SA with some rather minor changes to allow, besides the current UTF8
normalizing, also US-ASCII normalizing. I used the Text:Unidecode Perl module
that decomposes not only accented letters into their ASCII transcriptions (é =>
e, ô => o, ü => ue, ...), but it transliterates also Greek, Cyrillic, and
practically any other characters including Asian sign languages. It uses
systematic non-contextual transliteration, so at the more exotic sign alphabets
it is not always perfect, but should be sufficient for the needs of
Spamassassin (especially for those who primarily need it for European
languages).

The type of the setting normalize_charset was changed from Boolean to string,
and can take the form of 0 (no normalizing), 1 or UTF or UTF8 (normalizing to
Unicode), or ASCII (aliases like US-ASCII can be used too). The setting is
case-insensitive. When set to ASCII, the Node.pm module will convert the
text_visible_rendered and text_invisible_rendered into plain 7-bit ASCII with
only unaccented characters. Because in the original modules, the normalizing
happens before decoding HTML entities, which would be then let in UTF8, I had
to add the ASCII normalizing also there.

Bayes works with the rendered arrays, hence the change will impact as well
Bayes, as also regexes in rules. When writing rules for your language with the
ASCII normalizing enabled, you just write them unaccented. You only need to
remember that some special characters are transliterated into multiple
characters (for example characters with umlauts), so in such cases there is
still some ambiguity because there are people who will for example write Müller
as Mueller, and others as Muller, so you still may need to write more complex
regexes for such cases.

I am attaching the modified files - SA/Conf.pm, SA/Message/Node.pm, and
SA/Util/DependencyInfo.pm (added dependency on the Text::Unidecode module). The
originals were from the v3.004.000. Although it is a long time I wanted to
write this, I just stitched it together today, so it is not much tested, and
there still may be some bugs and issues.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7022] New: normalize_charset

Reply via email to