Header rule performance

Robert Menschel 11 Aug 2004 05:04:46 -0000

One of the longer SARE rules in our header rule set is

header    SARE_HEAD_SPAM           ALL =~ 
/(?:Error-path|Rot|X-(?:BounceTrace|Camp...|ClientHost|cross|Contact|CS-IP|E(?:[Mm]ail)?|Encoding-Version|ENVID|EXP32-SerialNo|Find|[Ii][Mm]?|INFO_.Z|JLH|L-C|LIDCode|Mailid|MailingID|Message-Info|Misc_ID|mlcipher|mlmsgid|mpm|ms|ntc|PMG-.+|POPFile-Link|Rec|RMD-Text|SP-Track-ID|srk|Text-Classification|TID|T2-Posting-ID|Tnz-Problem-Type|Trans|Vig|WCMailID|yd)):/
describe  SARE_HEAD_SPAM           Message headers used which identify spam
score     SARE_HEAD_SPAM           2.222
#stype    SARE_HEAD_SPAM           spamp 
#hist     SARE_HEAD_SPAM           June 5 2004: Added X-T2-Posting-ID
#hist     SARE_HEAD_SPAM           Aug 10 2004: Added several more headers
#counts   SARE_HEAD_SPAM           3260s/0h of 58338 corpus (33610s/24728h RM) 
08/07/04
#counts   SARE_HEAD_SPAM           2143s/1h of 32586 corpus (9341s/23245h JH) 
06/10/04
#counts   SARE_HEAD_SPAM           731s/3h of 17050 corpus (14617s/2433h MY) 
08/08/04


An alternative form of this same rule would be:
header    __SARE_HEAD_SPAM_01      exists:Error-path
header    __SARE_HEAD_SPAM_02      exists:Rot
  ...
header    __SARE_HEAD_SPAM_xx      exists:X-T2-Posting-I
  ...
meta      SARE_HEAD_SPAM           __SARE_HEAD_SPAM_01 || __SARE_HEAD_SPAM_02 
|| ...

I suspect that the "exists" version would be more efficient, use less
resources.  I guess this because I believe SA identifies the headers as
or shortly after it first reads the email, and each "exists" test is
simply a boolean "have we seen it?", while the regex at the top requires
a full scan of the headers to see if any of them match.

Can anyone confirm this?  If so, I'll rework SARE_HEAD_SPAM into an
"exists" format for the release expected shortly.

Bob Menschel

Header rule performance

Reply via email to