Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

John Hardin Mon, 14 Jan 2013 17:17:07 -0800

On Mon, 14 Jan 2013, Ben Johnson wrote:

I understand that snowshoe spam may not hit any net tests. I guess my
confusion is around what, exactly, classifies spam as "snowshoe".


  http://www.spamhaus.org/faq/section/Glossary

Basically, a large number of spambots sending the message so that no onesending IP can be easily tagged as evil.

Question: do you have any SMTP-time hard-reject DNSBL tests in place? Orare they all performed by SA?

Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-rejectSMTP-time DNS check in your MTA. It is well-respected and very reliable.One thing it includes is ranges of IP addresses that should not ever besending email, so it may help reduce snowshoe spam.


  http://www.spamhaus.org/zen/

Another tactic that many report good results from is Greylisting. Do youhave greylisting in place? Does your userbase demand no delays in maildelivery? In addition to blocking spam from spambots that do not retry, itcan delay mail enough for the BLs to get a chance to list new IPs/domains,which can reduce the leakage if you happen to be at the leading edge of anew delivery campaign.


  http://www.greylisting.org/

Are most/all of the BL services hash-based?


Generally:

        DNSBL: Blacklist of IP addresses
        URIBL: Blacklist of domain and host names appearing in URIs
        EMAILBL: (not widely used) Blacklist of email addresses (e.g.
                phishing response addresses)
        Razor, Pyzor: Blacklist of message content checksums/hashes

In other words, if a known spam message was added yesterday, will it beconsidered "snowshoe" spam if the spammer sends the same message todayand changes only one character within the body?

No, the diverse IP addresses are the hallmark of "snowshoe", not so muchthe specific message content. If you see identical or generally-similar(e.g.) pharma spam coming from a wide range of different IP addresses,that's snowshoe.

If so, then I guess the only remedy here is to focus on why Bayes seems
to perform so miserably.


Agreed.

It must be a configuration issue, because I've sa-learn-ed messages thatare incredibly similar for two days now and not only do their Bayesscores not change significantly, but sometimes they decrease. And I havea hard time believing that one of my users is sa-train-ing thesemessages as ham and negating my efforts.

This is why you retain your Bayes training corpora: so that if Bayes goesoff the rails you can review your corpora for misclassifications, wipe andretrain. Do you have your training corpora? Or do you discard messagesonce you've trained them?

_Do_ you allow your users to train Bayes? Do they do so unsupervised or doyou review their submissions? And if the process is automated, do youretain what they have provided for training so that you can go back laterand do a troubleshooting review?

Do you have autolearn turned on? My opinion is that autolearn is onlyappropriate for a large and very diverse userbase where a sufficiently"common" corpus of ham can't be manually collected. but then, I don'tadmin a Really Large Install, so YMMV.

Do you use per-user or sitewide Bayes? If per-user, then you need to makesure that you're training Bayes as the same user that the MTA is runningSA as.


What user does your MTA run SA as? What user do you train Bayes as?

One possibility is that the MTA is running SA as a different user than youare training Bayes as, and you have autolearn turned on, and Bayes hasbeen running in its own little world since day one regardless of what youthink you're telling it to do.

I have ensured that the spam token count increases when I train these
messages. That said, I do notice that the token count does not *always*
change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
message(s) examined)". Does this mean that all tokens from these
messages have already been learned, thereby making it pointless to
continue feeding them to sa-learn?


No, it means that Message-ID has been learned from before.

Finally, I added the test you supplied to my SA configuration, restarted
Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.


So this proves DNS lookups are indeed working for all messages.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  One death is a tragedy; thirty is a media sensation;
  a million is a statistic.              -- Joseph Stalin, modernized
-----------------------------------------------------------------------
 3 days until Benjamin Franklin's 307th Birthday

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Reply via email to