Re: Scoring Philosophy?

David Jones Tue, 21 Nov 2017 15:45:03 -0800

On 11/21/2017 05:11 PM, Jerry Malcolm wrote:

On 11/21/2017 3:52 PM, Bowie Bailey wrote:
On 11/21/2017 4:01 PM, Jerry Malcolm wrote:
I have been using SpamAssassin in my hosting environment for severalyears. It catches thousands of spam messages (thank you...). But myconcern is that it doesn't catch a couple of hundred messages perday. I have the Bayesian filter working, with a simple way to trainit. I have sent over 5000 training messages to it over the past 6-8months. I have set up a non-forwarding caching DNS, and the blacklist tests are working.
My question is with the scoring. I understand the general theory ofadding up 'votes' by all of the spam tests to determine if it'sindeed spam. But it appears that no one test, no matter how certainit is, has enough power to qualify the message as spam. The Bayesianfilter can say it's 80-100% certain it's spam. But some other testdecides it's not and even sometimes has a negative number thatsubtracts the Bayesian score from the total. But my biggest problemis that even if it's scored as coming from a BL URL, but if Bayesiandoesn't also say it's spam, then it's apparently still not spam. Ispend a couple of hours every day trying to tell the Bayesian filterabout today's new strains of spam that it hasn't yet seen.
Am I missing something obvious? Is this just the way it works, and Ishould expect to have to run a couple of hundred missed spams throughthe Bayesian filter each day? My threshold score was originally setto 5.0. I don't even remember where that came from. I dropped it to4.0 a couple of years ago, and that's where it is now. But (seeexample output below) when BL says it's spam and adds 2.5, thenBayesian says it's 40-60% spam and adds 0.8, and it's got a smallfont and gets another 0.5, and all other tests are neutral... it'snow 3.8 and STILL not spam with a threshold of 4.0.
Can someone tell me if this is by design and/or if my configurationshould be adjusted? I realize I can easily drop the threshold to 1.0or 2.0. But that would probably just shift the problem to tons offalse-positives which obviously is not a good solution.
The general philosophy is that no one rule should be able to mark amessage as spam on its own. However, quite a few of us have bumpedthe scores beyond that point for rules that we trust. Most commonlythe Bayes_99 or Bayes_99+Bayes_999 can be made to score 5+ points ifyour database is well trained. You can also keep an eye on negativescoring rules that seem to hit too frequently on spams and bump thescores for those.
You can also add third-party rules. I believe the onlyactively-maintained set at the moment is from KAM.
http://www.pccc.com/downloads/SpamAssassin/contrib/KAM.cf
http://www.pccc.com/downloads/SpamAssassin/contrib/nonKAMrules.cf

I have both of these in use on my server with good results.
You can use trusted blacklists to block spam at the MTA before it evengets to SA. I use zen.spamhaus.org for this on my server.
You didn't specify the version of SA you are running, but you shouldmake sure you are upgraded to the newest version (3.4.1) for bestresults.
The amount of spam that gets through will be dependent on yourmailflow. I usually see 1 or 2 missed spams in my inbox per day.Today, the zen blacklist blocked 33 messages to my inbox and SA markedanother 20 as spam while delivering 95 clean messages. I think theremight have been one spam that made it to my inbox. I have thethreshold set at 4.0 for my mailbox.
Hi Bowie,
Thanks for the quick response. I'm a bit concerned about going in andplaying around with scores, etc considering my minimal knowledge of theoverall process at this point. I'll download and install the KAMrules. And I'll need to start logging my percentages ofclean/spam/total per account and determine if my results are typical.But I'm just curious what others are using for their installation. Isrunning SA out-of-the-box simply not done? is it expected that allserious users will add 3rd party rule plugins and adjust scores?

Mail filtering is very dependent on your location, expected languages,and recipients. SpamAssassin is pretty generic/conservative out of thebox to fit in most environments safely. You need to tune it a bit foryour mail flow.

I'm currently being inundated with solar, skin tag removal, 3D organprinting, Shark Tank, and Russian brides spam messages. Seems like someof these could be pretty obvious rules. No matter what I do withBayesian training, they are still getting through. Just out ofcuriosity, what rule(s), if any, should be expected to catch spam emailswith subjects such as these? Or, said differently.... are these specificsubject lines being caught by other users, and if so, what rules arecatching them? Is there some way I can quickly add a rule that says ifthe subject contains 'Skin Tags', score it to 5 (without getting intocoding rule plugins, etc)? Or is there some place that is creatingthese rules for 'pretty obvious' subject lines as the new strains ofspam appear?

Absolutely. Start with downloading the KAM.cf into your/etc/mail/spamassassin once a day. Then look at the KAM.cf to findexamples on how to make custom rules. It's pretty easy:


https://wiki.apache.org/spamassassin/WritingRules

For example.  This can be put in /etc/mail/spamassassin/99_my_rules.cf

header          SUBJ_INVOICE        Subject =~ /Invoice/i
describe        SUBJ_INVOICE        Subject contains invoice
score           SUBJ_INVOICE        0.2

The you restart or reload whatever is launching SA (spamd, Amavis,Mimedefang, MailScanner, etc.) to load the new rule. If you are simplyusing it in a mail client like Thunderbird or Apple Mail, I don't thinkyou have to restart anything since it launches SpamAssassin probably foreach email. I am not that familiar with using SA in a mail client.

I still feel like I'm missing something obvious....

Thanks again.

Jerry

On 11/21/2017 3:52 PM, Bowie Bailey wrote:
On 11/21/2017 4:01 PM, Jerry Malcolm wrote:
I have been using SpamAssassin in my hosting environment for severalyears. It catches thousands of spam messages (thank you...). But myconcern is that it doesn't catch a couple of hundred messages perday. I have the Bayesian filter working, with a simple way to trainit. I have sent over 5000 training messages to it over the past 6-8months. I have set up a non-forwarding caching DNS, and the blacklist tests are working.
My question is with the scoring. I understand the general theory ofadding up 'votes' by all of the spam tests to determine if it'sindeed spam. But it appears that no one test, no matter how certainit is, has enough power to qualify the message as spam. The Bayesianfilter can say it's 80-100% certain it's spam. But some other testdecides it's not and even sometimes has a negative number thatsubtracts the Bayesian score from the total. But my biggest problemis that even if it's scored as coming from a BL URL, but if Bayesiandoesn't also say it's spam, then it's apparently still not spam. Ispend a couple of hours every day trying to tell the Bayesian filterabout today's new strains of spam that it hasn't yet seen.
Am I missing something obvious? Is this just the way it works, and Ishould expect to have to run a couple of hundred missed spams throughthe Bayesian filter each day? My threshold score was originally setto 5.0. I don't even remember where that came from. I dropped it to4.0 a couple of years ago, and that's where it is now. But (seeexample output below) when BL says it's spam and adds 2.5, thenBayesian says it's 40-60% spam and adds 0.8, and it's got a smallfont and gets another 0.5, and all other tests are neutral... it'snow 3.8 and STILL not spam with a threshold of 4.0.
Can someone tell me if this is by design and/or if my configurationshould be adjusted? I realize I can easily drop the threshold to 1.0or 2.0. But that would probably just shift the problem to tons offalse-positives which obviously is not a good solution.
The general philosophy is that no one rule should be able to mark amessage as spam on its own. However, quite a few of us have bumpedthe scores beyond that point for rules that we trust. Most commonlythe Bayes_99 or Bayes_99+Bayes_999 can be made to score 5+ points ifyour database is well trained. You can also keep an eye on negativescoring rules that seem to hit too frequently on spams and bump thescores for those.
You can also add third-party rules. I believe the onlyactively-maintained set at the moment is from KAM.
http://www.pccc.com/downloads/SpamAssassin/contrib/KAM.cf
http://www.pccc.com/downloads/SpamAssassin/contrib/nonKAMrules.cf

I have both of these in use on my server with good results.
You can use trusted blacklists to block spam at the MTA before it evengets to SA. I use zen.spamhaus.org for this on my server.

Yes. If you are running a mail filtering server with an MTA likePostfix, you should put as much filtering as possible in the MTA. Forexample, Postfix has postscreen that is very easy to enable and doeswonders without any tuning. Postscreen also has the ability to doweighted RBLs so you can combine the score of multiple RBLs into athreshold. Normally any single RBL hit will block an email but this istoo risky. With weighted RBLs, you can setup 20+ RBLs to work togetherfor better overall accuracy.

Greylisting is very effective too if your users can handle the delay onnew senders. I rolled out greylisting slowly where most users didn'teven know it and now it's helping a lot. You have to exclude Google'smail servers from greylisting.

You didn't specify the version of SA you are running, but you shouldmake sure you are upgraded to the newest version (3.4.1) for bestresults.
The amount of spam that gets through will be dependent on yourmailflow. I usually see 1 or 2 missed spams in my inbox per day.Today, the zen blacklist blocked 33 messages to my inbox and SA markedanother 20 as spam while delivering 95 clean messages. I think theremight have been one spam that made it to my inbox. I have thethreshold set at 4.0 for my mailbox.

--
David Jones

Re: Scoring Philosophy?

Reply via email to