On Tue, 16 Jan 2007 14:06:14 -0500, Theo Van Dinter <[EMAIL PROTECTED]>
posted to spamassassin-devel:
 > On Tue, Jan 16, 2007 at 10:49:36AM -0800, Karl Chen wrote:
 >> As I understand it, this rule is intended to match subject lines
 >> that were encoded, then RE-encoded recursively, or perhaps with
 >> two different encodings in the same subject line.
 >
 > It looks for Subject headers which have multiple encodings in it.
 >
 >> However, this regexp also matches singly-encoded long subject
 >> lines, since (from what I've seen) the subject string is broken up
 >> and encoded per line:
 >> Subject: 
 >> =?iso-8859-1?Q?Automatisk_svar_n=E5r_du_er_borte_fra_kontoret=3A_and_the_?=
 >> =?iso-8859-1?Q?poor=2C_of_the_innocent_person_shall_in?=
 >
 > This isn't a single encoding, there are two independent encodings.

It's already been discussed before in SpamAssassin bug #5026
<http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5026>
but I think it should be emphasized again:

  * This rule is prone to produce a lot of false positives in any
    locale where RFC2047 Subject encoding is a necessity (basically
    any long header line which is split up like in the example above)

  * The rule's name and description are somewhat inaccurate

The first problem can be fixed by more international messages fed into
the mass checks, I suppose.

The second will become moot if it is proven by mass-check results that
the rule is flawed (-, but in the meantime, perhaps a bug should be
filed about that?

/* era */

-- 
The email address era     the contact information   Just for kicks, imagine
at iki dot fi is heavily  link on my home page at   what it's like to get
spam filtered.  If you    <http://www.iki.fi/era/>  1000 pieces of spam for
want to reach me, see     instead.                  each wanted message.

Reply via email to