On Mon, 14 Jan 2013, Ben Johnson wrote:

I understand that snowshoe spam may not hit any net tests. I guess my
confusion is around what, exactly, classifies spam as "snowshoe".

  http://www.spamhaus.org/faq/section/Glossary

Basically, a large number of spambots sending the message so that no one sending IP can be easily tagged as evil.

Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or are they all performed by SA?

Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject SMTP-time DNS check in your MTA. It is well-respected and very reliable. One thing it includes is ranges of IP addresses that should not ever be sending email, so it may help reduce snowshoe spam.

  http://www.spamhaus.org/zen/

Another tactic that many report good results from is Greylisting. Do you have greylisting in place? Does your userbase demand no delays in mail delivery? In addition to blocking spam from spambots that do not retry, it can delay mail enough for the BLs to get a chance to list new IPs/domains, which can reduce the leakage if you happen to be at the leading edge of a new delivery campaign.

  http://www.greylisting.org/

Are most/all of the BL services hash-based?

Generally:

        DNSBL: Blacklist of IP addresses
        URIBL: Blacklist of domain and host names appearing in URIs
        EMAILBL: (not widely used) Blacklist of email addresses (e.g.
                phishing response addresses)
        Razor, Pyzor: Blacklist of message content checksums/hashes

In other words, if a known spam message was added yesterday, will it be considered "snowshoe" spam if the spammer sends the same message today and changes only one character within the body?

No, the diverse IP addresses are the hallmark of "snowshoe", not so much the specific message content. If you see identical or generally-similar (e.g.) pharma spam coming from a wide range of different IP addresses, that's snowshoe.

If so, then I guess the only remedy here is to focus on why Bayes seems
to perform so miserably.

Agreed.

It must be a configuration issue, because I've sa-learn-ed messages that are incredibly similar for two days now and not only do their Bayes scores not change significantly, but sometimes they decrease. And I have a hard time believing that one of my users is sa-train-ing these messages as ham and negating my efforts.

This is why you retain your Bayes training corpora: so that if Bayes goes off the rails you can review your corpora for misclassifications, wipe and retrain. Do you have your training corpora? Or do you discard messages once you've trained them?

_Do_ you allow your users to train Bayes? Do they do so unsupervised or do you review their submissions? And if the process is automated, do you retain what they have provided for training so that you can go back later and do a troubleshooting review?

Do you have autolearn turned on? My opinion is that autolearn is only appropriate for a large and very diverse userbase where a sufficiently "common" corpus of ham can't be manually collected. but then, I don't admin a Really Large Install, so YMMV.

Do you use per-user or sitewide Bayes? If per-user, then you need to make sure that you're training Bayes as the same user that the MTA is running SA as.

What user does your MTA run SA as? What user do you train Bayes as?

One possibility is that the MTA is running SA as a different user than you are training Bayes as, and you have autolearn turned on, and Bayes has been running in its own little world since day one regardless of what you think you're telling it to do.

I have ensured that the spam token count increases when I train these
messages. That said, I do notice that the token count does not *always*
change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
message(s) examined)". Does this mean that all tokens from these
messages have already been learned, thereby making it pointless to
continue feeding them to sa-learn?

No, it means that Message-ID has been learned from before.

Finally, I added the test you supplied to my SA configuration, restarted
Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.

So this proves DNS lookups are indeed working for all messages.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  One death is a tragedy; thirty is a media sensation;
  a million is a statistic.              -- Joseph Stalin, modernized
-----------------------------------------------------------------------
 3 days until Benjamin Franklin's 307th Birthday

Reply via email to