On Tue, 16 Jan 2007 14:06:14 -0500, Theo Van Dinter <[EMAIL PROTECTED]> posted to spamassassin-devel: > On Tue, Jan 16, 2007 at 10:49:36AM -0800, Karl Chen wrote: >> As I understand it, this rule is intended to match subject lines >> that were encoded, then RE-encoded recursively, or perhaps with >> two different encodings in the same subject line. > > It looks for Subject headers which have multiple encodings in it. > >> However, this regexp also matches singly-encoded long subject >> lines, since (from what I've seen) the subject string is broken up >> and encoded per line: >> Subject: >> =?iso-8859-1?Q?Automatisk_svar_n=E5r_du_er_borte_fra_kontoret=3A_and_the_?= >> =?iso-8859-1?Q?poor=2C_of_the_innocent_person_shall_in?= > > This isn't a single encoding, there are two independent encodings.
It's already been discussed before in SpamAssassin bug #5026 <http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5026> but I think it should be emphasized again: * This rule is prone to produce a lot of false positives in any locale where RFC2047 Subject encoding is a necessity (basically any long header line which is split up like in the example above) * The rule's name and description are somewhat inaccurate The first problem can be fixed by more international messages fed into the mass checks, I suppose. The second will become moot if it is proven by mass-check results that the rule is flawed (-, but in the meantime, perhaps a bug should be filed about that? /* era */ -- The email address era the contact information Just for kicks, imagine at iki dot fi is heavily link on my home page at what it's like to get spam filtered. If you <http://www.iki.fi/era/> 1000 pieces of spam for want to reach me, see instead. each wanted message.
