Hi again, John - It's a good idea to add the realtime rules to the beginning of the filter. I didn't realize that would have such an impact. And the (?=x) tip is a good one too; thank you for that.
As far as Bayes, don't get me started! :) I work for an Email Service Provider and about 2 million messages go through our servers every day, so we have Bayes turned off because it would be too computationally expensive. I wish we could turn it on - it'd certainly make my job easier - but The Boss says no. Go figure. Autolearn, same story. Having such a large organization makes it a difficult balance to avoid false positives, too. We have one client who deals with credit reports and refinancing and stuff and pretty much every message that goes to their mailboxes looks like spam. We just have them set up to avoid all our financial rules. Luckily we don't have too many doctors' offices so we needn't really concern ourselves with legitimate Viagra email! :) I've scoured the net looking for rulesets from others that already have a lot of this stuff in there but I haven't found any rulesets since 2006. A lot of what I've seen is irrelevant - do you know a good place to get custom rulesets? I feel like there's someone else out there who already figured out how to write a rule that captures all those "learn a new language" spam messages so I don't need to just score "Language" as +4 ! : ) -----Original Message----- From: John Hardin [mailto:jhar...@impsec.org] Sent: Wednesday, April 24, 2013 1:53 PM To: users@spamassassin.apache.org Subject: RE: More longer rules or fewer shorter ones? On Wed, 24 Apr 2013, Andrew Talbot wrote: > John, > > Thanks for your prompt response! > > A lot of the rules are big jumbles of rules we are generating in real > time and adding to as things come in. Like I said in my original > question, we have them separated into separate cf files by category, > and within those cf files they are separated by score. So we have just > absolutely gargantuan rules for (for instance) sex words that we assign a 5 to automatically. > There's also lists of specific words and phrases that we see in > real-time spam (like the *$#ing garden hose spam). > > We are just tacking new rules on to the end to make them easier to > read. Our rules properly work with (this|that|theother) if it hits any > one of the words. > > Should we maybe have separate rules for all the phrases, since they're > longer strings? There's rules in there that are like RULE Subject =~ > /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah > |blah) > )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . . . . > . > > Etc. It goes on. .. My syntax is terrible and obviously those aren't > the actual rules but the point is that it's a bunch of "Or" for some > really long strings. Should I separate them out and have those long > (this|that|theother) rules be only for single words? Simple alternations on phrases are equivalent to simple alternations on single words with respect to the performance concerns. Performance is more governed by the number of alternations and the presence of repetition and .* than their raw length. You might want to limit the total number of alternations per rule. Another performance optimization would be to ensure all of the alternations in a given rule start with the same letter, and put (?=x) before the list of alternatatives e.g. /\b(?=x)(x1|x2|x3|x4)/ so that the engine can skip more easily. If they are simple alternations, it also depends on how you want to score them. For "poison pill" words or phrases, sure, a long alternation with a high score will be pretty efficient. I'd suggest tacking new hits onto the *front* of the list of alternatives, though, as it's reasonable to assume a spam run will use the same phrasing for a while, then change. > Alternately, should I separate out the rules with embedded pipes in > them (like in the example above)? Yeah, avoiding nested alternatives where possible will help. Is Bayes not catching things like this? > -----Original Message----- > From: John Hardin [mailto:jhar...@impsec.org] > Sent: Wednesday, April 24, 2013 12:58 PM > To: users@spamassassin.apache.org > Subject: Re: More longer rules or fewer shorter ones? > > On Wed, 24 Apr 2013, Andrew Talbot wrote: > >> Hey, all - >> >> I have my customized deployment split up into a bunch of separate CF >> files (by category) and I have those further split up into rules >> based on > score. >> >> So, I have a bunch of stuff like: >> >> header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i >> score RULE_1 1 >> describe RULE_1 Rule 1 >> >> header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe >> RULE_2 Rule 2 >> >> They are WAY longer than that (and some of them include further >> nesting of the pipe), but that's the general idea. >> >> My question is: is it better performance-wise to have the rules set >> up like this, or to have each separate thing have its own separate rule? > > For performance, with simple lists of variant values having no > repetition across the list e.g. (x|y|z){n,m}, if the most-likely > variants are listed first a "big" rule will (generally-speaking) > process less than a set of individual rules for each variant. The big > rule will stop trying as soon as a match for one variant is found, > whereas all of the individual rules must be tried regardless of what > other rules may have hit. RULE_1 won't try matching "that", "theother", "blah", etc. if "this" matches. > > Ignoring performance, the alternatives are *not* syntactically equivalent. > Absent "tflags multiple", RULE_1 would hit only once on a subject > containing both "this" and "that" and "theother", but if you split it > up into separate rules *each* would hit. This likely would affect scoring. -- John Hardin KA7OHZ http://www.impsec.org/~jhardin/ jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 ----------------------------------------------------------------------- Maxim IX: Never turn your back on an enemy. ----------------------------------------------------------------------- 328 days since the first successful private support mission to ISS (SpaceX)