[Bug 4636] Charset normalization plugin support

bugzilla-daemon Mon, 09 Jan 2006 18:08:04 -0800

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636






------- Additional Comments From [EMAIL PROTECTED]  2006-01-10 03:07 -------
I've commented in the past that I'm opposed to the idea of character set
normalization and that this functionality would need to be isolated with
options and/or a plugin interface.  My reasoning is as follows:

- there is a performance penalty associated with character set recoding
- spam patterns are generally encoded in a limited number of character sets
- therefore, catch rates do not increase with recoding (if anything, they are
  quite likely to decrease due to spam tricks causing us to pick the wrong
  character set)
- however, ham catch rates will INCREASE since the amount of ham matching the
  pattern is likely to increase (matches being accidental)
- so, S/O is likely to go down for multi-character set rules (more often than
  not) and performance will go down as well

For these reasons, I am -1 (that is, vetoing) the current form of this code
that has the performance loss and requires recoding.  I would also be -1 on
requiring any non-utf8 rules to be utf8.

Basically, SpamAssassin does need better understanding of character set and
ability to support more character sets better, for rules, descriptions,
rendering, and tokenization, but I see no benefit to recoding messages,
especially since anti-spam patterns are written against a small subset of
possible encodings.




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4636] Charset normalization plugin support

Reply via email to