One of the longer SARE rules in our header rule set is header SARE_HEAD_SPAM ALL =~ /(?:Error-path|Rot|X-(?:BounceTrace|Camp...|ClientHost|cross|Contact|CS-IP|E(?:[Mm]ail)?|Encoding-Version|ENVID|EXP32-SerialNo|Find|[Ii][Mm]?|INFO_.Z|JLH|L-C|LIDCode|Mailid|MailingID|Message-Info|Misc_ID|mlcipher|mlmsgid|mpm|ms|ntc|PMG-.+|POPFile-Link|Rec|RMD-Text|SP-Track-ID|srk|Text-Classification|TID|T2-Posting-ID|Tnz-Problem-Type|Trans|Vig|WCMailID|yd)):/ describe SARE_HEAD_SPAM Message headers used which identify spam score SARE_HEAD_SPAM 2.222 #stype SARE_HEAD_SPAM spamp #hist SARE_HEAD_SPAM June 5 2004: Added X-T2-Posting-ID #hist SARE_HEAD_SPAM Aug 10 2004: Added several more headers #counts SARE_HEAD_SPAM 3260s/0h of 58338 corpus (33610s/24728h RM) 08/07/04 #counts SARE_HEAD_SPAM 2143s/1h of 32586 corpus (9341s/23245h JH) 06/10/04 #counts SARE_HEAD_SPAM 731s/3h of 17050 corpus (14617s/2433h MY) 08/08/04
An alternative form of this same rule would be: header __SARE_HEAD_SPAM_01 exists:Error-path header __SARE_HEAD_SPAM_02 exists:Rot ... header __SARE_HEAD_SPAM_xx exists:X-T2-Posting-I ... meta SARE_HEAD_SPAM __SARE_HEAD_SPAM_01 || __SARE_HEAD_SPAM_02 || ... I suspect that the "exists" version would be more efficient, use less resources. I guess this because I believe SA identifies the headers as or shortly after it first reads the email, and each "exists" test is simply a boolean "have we seen it?", while the regex at the top requires a full scan of the headers to see if any of them match. Can anyone confirm this? If so, I'll rework SARE_HEAD_SPAM into an "exists" format for the release expected shortly. Bob Menschel
