RE: More longer rules or fewer shorter ones?

Andrew Talbot Wed, 24 Apr 2013 11:16:48 -0700

Hi again, John -

It's a good idea to add the realtime rules to the beginning of the filter. I
didn't realize that would have such an impact. And the (?=x) tip is a good
one too; thank you for that.

As far as Bayes, don't get me started! :)  I work for an Email Service
Provider and about 2 million messages go through our servers every day, so
we have Bayes turned off because it would be too computationally expensive.
I wish we could turn it on - it'd certainly make my job easier - but The
Boss says no. Go figure. Autolearn, same story. 

Having such a large organization makes it a difficult balance to avoid false
positives, too. We have one client who deals with credit reports and
refinancing and stuff and pretty much every message that goes to their
mailboxes looks like spam. We just have them set up to avoid all our
financial rules. 

Luckily we don't have too many doctors' offices so we needn't really concern
ourselves with legitimate Viagra email! :) 

I've scoured the net looking for rulesets from others that already have a
lot of this stuff in there but I haven't found any rulesets since 2006. A
lot of what I've seen is irrelevant - do you know a good place to get custom
rulesets? I feel like there's someone else out there who already figured out
how to write a rule that captures all those "learn a new language" spam
messages so I don't need to just score "Language" as +4 ! : )

-----Original Message-----
From: John Hardin [mailto:jhar...@impsec.org] 
Sent: Wednesday, April 24, 2013 1:53 PM
To: users@spamassassin.apache.org
Subject: RE: More longer rules or fewer shorter ones?

On Wed, 24 Apr 2013, Andrew Talbot wrote:

> John,
>
> Thanks for your prompt response!
>
> A lot of the rules are big jumbles of rules we are generating in real 
> time and adding to as things come in. Like I said in my original 
> question, we have them separated into separate cf files by category, 
> and within those cf files they are separated by score. So we have just 
> absolutely gargantuan rules for (for instance) sex words that we assign a
5 to automatically.
> There's also lists of specific words and phrases that we see in 
> real-time spam (like the *$#ing garden hose spam).
>
> We are just tacking new rules on to the end to make them easier to 
> read. Our rules properly work with (this|that|theother) if it hits any 
> one of the words.
>
> Should we maybe have separate rules for all the phrases, since they're 
> longer strings? There's rules in there that are like RULE Subject =~
> /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah
> |blah)
> )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .  .
.
> .
>
> Etc. It goes on. .. My syntax is terrible and obviously those aren't 
> the actual rules but the point is that it's a bunch of "Or" for some 
> really long strings. Should I separate them out and have those long 
> (this|that|theother) rules be only for single words?

Simple alternations on phrases are equivalent to simple alternations on
single words with respect to the performance concerns. Performance is more
governed by the number of alternations and the presence of repetition and
.* than their raw length. You might want to limit the total number of
alternations per rule.

Another performance optimization would be to ensure all of the alternations
in a given rule start with the same letter, and put (?=x) before the list of
alternatatives e.g. /\b(?=x)(x1|x2|x3|x4)/ so that the engine can skip more
easily.

If they are simple alternations, it also depends on how you want to score
them.

For "poison pill" words or phrases, sure, a long alternation with a high
score will be pretty efficient. I'd suggest tacking new hits onto the
*front* of the list of alternatives, though, as it's reasonable to assume a
spam run will use the same phrasing for a while, then change.

> Alternately, should I separate out the rules with embedded pipes in 
> them (like in the example above)?

Yeah, avoiding nested alternatives where possible will help.

Is Bayes not catching things like this?

> -----Original Message-----
> From: John Hardin [mailto:jhar...@impsec.org]
> Sent: Wednesday, April 24, 2013 12:58 PM
> To: users@spamassassin.apache.org
> Subject: Re: More longer rules or fewer shorter ones?
>
> On Wed, 24 Apr 2013, Andrew Talbot wrote:
>
>> Hey, all -
>>
>> I have my customized deployment split up into a bunch of separate CF 
>> files (by category) and I have those further split up into rules 
>> based on
> score.
>>
>> So, I have a bunch of stuff like:
>>
>> header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
>> score RULE_1 1
>> describe RULE_1 Rule 1
>>
>> header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe
>> RULE_2 Rule 2
>>
>> They are WAY longer than that (and some of them include further 
>> nesting of the pipe), but that's the general idea.
>>
>> My question is: is it better performance-wise to have the rules set 
>> up like this, or to have each separate thing have its own separate rule?
>
> For performance, with simple lists of variant values having no 
> repetition across the list e.g. (x|y|z){n,m}, if the most-likely 
> variants are listed first a "big" rule will (generally-speaking) 
> process less than a set of individual rules for each variant. The big 
> rule will stop trying as soon as a match for one variant is found, 
> whereas all of the individual rules must be tried regardless of what 
> other rules may have hit. RULE_1 won't try matching "that", "theother",
"blah", etc. if "this" matches.
>
> Ignoring performance, the alternatives are *not* syntactically equivalent.
> Absent "tflags multiple", RULE_1 would hit only once on a subject 
> containing both "this" and "that" and "theother", but if you split it 
> up into separate rules *each* would hit. This likely would affect scoring.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Maxim IX: Never turn your back on an enemy.
-----------------------------------------------------------------------
  328 days since the first successful private support mission to ISS
(SpaceX)

RE: More longer rules or fewer shorter ones?

Reply via email to