Re: Need help with several things in SA

Matt Kettler Mon, 09 Oct 2006 21:59:55 -0700

Steve Lake wrote:
>         Ok, I've got several pesky problems that won't go away and I
> need some help.  On some emails it automatically flags some as ham and
> says "autolearn=ham" and others that say "autolearn=no".  I'm guessing
> that the autolearn feature isn't always working.  Is there a way I can
> completely turn it off?  I know there used to be a way, but I can't
> figure it out in the newer version.
>
>         Second item, and this may be related to the bayes poisoning
> that Marc Perkel mentioned in his email.  I'm seeing a lot of mixed
> emails come through.  IE, the gif spam as people call it.  Any way to
> deal with this?
>
>         Third item.  I've noticed on a bunch of the spam messages
> coming through, that many of them have several common elements.  I'm
> wondering which would be safe to bump to a 5 to automatically score
> these as spam with just that credential.  
None of the below are safe for that. It really takes a HIGHLY accurate
rule for this, and none of the below are 100%.
> The ones I'm considering are as follows:
>
> MIME_HTML_ONLY_## (I saw a variety of numbers at the end of this one
> ranging from nothing up to 32.  Is there some correlation to which
> does what and which is best to use?)
Double check that. There's no such rule in SA. There's MIME_HTML_ONLY,
and MIME_HTML_##_##, but no MIME_HTML_ONLY_##.


The numbered ones indicate the percentage of message that is HTML. The
HTML_ONLY indicates that there are only text/html sections to the message.

Looking at the S/O in the STATISTICS-set3.txt, we can see how well these
rules performed in the last mass-check
http://spamassassin.apache.org/full/3.1.x/dist/rules/STATISTICS-set3.txt

MIME_HTML_ONLY has a S/O of , 0.905 meaning that 90.5% of the messages
it matches are spam, and 9.5% are nonspam.

> HTML_IMAGE_ONLY_## (ditto here)
These rules mean the message contains HTML, and an image with less than
## *100 words of body text. ie:  HTML_IMAGE_ONLY_04 indicates an image
and less than 400 words of body.
These rules are fairly accurate in 04, 08 and 12, but are still on the
order of 99.5% spam, 0.5% nonspam. Not good enough for a 5 point score
unless you want to loose some nonspam.
> RCVD_NUMERIC_HELO
Means the helo string passed to your server was a numeric IP, not a
hostname. 98.4% spam 1.6% nonspam
> URIBL_OB_SURBL
Message had a URL in the body listed in the outblaze URIBL hosted at
surbl.org

99.9% spam, 0.1% nonspam. IMHO, not good enough for a 5.0, but if you're
willing to take chances with FPs, go for it. I myself find the OB list
on rare occasions lists weblinks from large-volume commercial solicited
mailings, but this is somewhat rare. If you make it a 5 pointer, you'll
loose those. That's a trade-off you'll have to make, but the perceptron
assigned this rule it's score for a reason. Think LONG and HARD before
changing it. (and read below).
> EXTRA_MPART_TYPE
Invalid mime encoding with an extra Content-type entry. 91.2% spam, 8.8%
nonspam.
> URIBL_SBL
The body contains a URL for which one of the nameservers that serves the
DNS records for the domain is listed in SBL. 98.6%spam 1.4% nonspam.
>
> That's the only real common ones I saw that for certain were on each
> spam message.  Which would be good to automatically flag as a 5 and
> are there any others I should use?
Really, if there were any rules that were "good" to automatically flag
as 5, they'd already be set that way.

The scores are assigned the way they are based on how the rules perform
in the real world. This is a very rigorous test, and is generally much
better at picking scores for the rules than any individual person. The
process winds up accommodating not just the behavior of the rule, but
also other rules that wind up "pairing up" on the same messages. It's
not an infallible process, but it's not common for it to under or over
score a rule without VERY good reason in the test data.

>   I appreciate any guidance you guys can give me.  Thanks. 

I would give up looking for a single "magic bullet" rule that's good for
it all. No such rule exists. Focus on either:

1) using bayes, and training it well.

or

2) finding and testing some of the add-on rulesets to expand the
diversity of rules in your SA set.  Generally speaking, you'll get fewer
FPs from 2 rules that score 2.5 each on a particular spam than you will
from 1 rule scoring 5.0.

Re: Need help with several things in SA

Reply via email to