Re: New plugin: reaper

Stevan Bajić Mon, 04 Jun 2012 15:15:33 -0700

On 04.06.2012 21:09, Matt Simerson wrote:

On Jun 4, 2012, at 6:40 AM, Stevan Bajić wrote:

On 02.06.2012 23:23, Matt Simerson wrote:

On Jun 2, 2012, at 11:15 AM, Jared Johnson wrote:

Yup. Part of the motivation for this plugin was to short circuit all the
intermediate plugins and handlers so I can feed the message to sa-learn
and dspam. Until dspam is trained, that's a very important step in
training it. But there's no gain in validating the HELO name, SPF,  or
DomainKeys. This plugin and associated changes adds that flexibility while
reducing the code and complexity of the plugins.

It might not be fair to say there's *no* gain.  Our HELO validation and
SPF plugins (we don't have a DKIM plugin at the moment, for shame) now do
their lookups unconditionally and add headers to the message so that our
bayes engine can tokenize the headers themselves.

Wait until you actually run DomainKeys before you decide if it's a gain. It requires more resources than I'd 
have guessed. And surprisingly (to me) is that the most reliably signed messages are spam. Or very big 
"mostly good" senders.  I've seen enough ham senders with broken DomainKeys so I don't consider it 
reliable enough to reject or train based on. Same goes for SPF. Spammers are far more likely to have good SPF 
than legit mailers. Spammers automate their SPF records, so they don't make typo mistakes like 
"ip:127..." (should be "ip4:127...") or missing spaces between the declarations and the 
~all. The errors are common enough, and affect ham often enough, that I'm tempted to fix them up in the SPF 
plugin before validation.

And SPF breaks legit forwarding servers that don't implement SRS. So I don't 
reject or train based on SPF alone.

I too have a custom HELO validation plugin (it needs more work, but I'll 
contribute it eventually), and it may actually provide some gain, but I think 
it's safe to say the one presently in plugins is not a gain.

How do you measure if the resources expended are worth the (likely small) 
benefit you would get from the additional bayes tokens? That will determine if 
it's a gain or not. I've placed my bet on the table, and I'd be pleased to be 
proven wrong.

Bayes is a little bit of a black box to me, so I can't really quantify
just how useful this is, but I'd say it's greater than zero. Dspam even
treats headers in a special way to ensure that their usefulness is
maximized.

Usefulness != gain.  There may be some gain, but I'm not familiar with bayes 
enough either. But I know someone who is. The dspam author (Stevan Bajić) 
noticed my plugin, contacted me, and will be submitting some improvements, like 
talking directly to the dspam server.  I'm BCC'ing him on this message, and 
hopefully we'll get a more informed opinion.

I don't 100% understand what you are trying to do with bayes? Is this 'reaper' 
plugin adding some additional data to the header of the mail and the other 
person posting is questioning if that additional header is beneficial to the 
bayes engine?

Care to explain little more to me what this is all about?

Hi again Stevan,

Hello Matt,

Here's an example of what I'm doing:

49237 250 mail.theartfarm.com Hi S0106001560c96a0b.wp.shawcable.net 
[50.72.202.227]; I am so happy to meet you.
49237 dispatching MAIL FROM:<no-re...@shawcable.net>
49237 (mail) badmailfrom: skip, naughty
49237 (mail) resolvable_fromhost: skip, naughty
49237 (mail) sender_permitted_from: skip, naughty
49237 250<no-re...@shawcable.net>, sender OK - how exciting to get mail from 
you!
49237 dispatching RCPT TO:<u...@example.com>
49237 (rcpt) rhsbl: pass
49237 (rcpt) dnsbl: skip, naughty
49237 (rcpt) resolvable_fromhost: skip, naughty
49237 (rcpt) sender_permitted_from: skip, naughty
49237 (rcpt) badrcptto: skip, naughty
49237 (rcpt) qmail_deliverable: skip, naughty
49237 (rcpt) rcpt_ok: pass: example.com found in morercpthosts
49237 250<u...@example.com>, recipient ok
49237 dispatching DATA
49237 354 go ahead
49237 (data_post) basicheaders: skip, naughty
49237 (data_post) bogus_bounce: skip, not a null sender
49237 (data_post) domainkeys: skip, naughty
49237 (data_post) spamassassin: skip, naughty
49237 (data_post) dspam: training naughty as spam
49237 spooling message to disk
49237 (data_post) virus::clamdscan: skip, naughty
49237 (data_post) naughty: disconnecting
49237 552 Blocked - see http://cbl.abuseat.org/lookup.cgi?ip=50.72.202.227
49237 click, disconnecting
49237 (post-connection) connection_time: 0.575 s.
86740 cleaning up after 49237

First, I renamed the reaper plugin to 'naughty'.  But it does exactly the same 
things. Lets other plugins identify a message as naughty, and then the 
'naughty' plugin handles disposal of the message at some predetermined time. I 
have added immunity tests to all the other plugins, so that they'll skip 
processing if one of the immunity conditions is met. (See is_immune() here: 
https://github.com/smtpd/qpsmtpd/pull/20/files) You can see above that most of 
the messages have skipped processing, saving much time and CPU.

In typical usage, I intend to run with 'naughty reject rcpt', so that dnsbl and 
karma hits are disposed of much sooner.  A week ago I truncated my dspam tables 
and started over. I have a script that feeds my users ham and spam into dspam 
to train it, but I'm fairly aggressive at cleaning out their spam folders, so 
users don't have much of a spam corpus. So, while dspam was solid at 
identifying ham, it wasn't recognizing spam at all. And most users don't both 
dragging their spam into their spam folder. So I need to train dspam. Training 
just the messages that spamassassin recognized works, bit it takes a very long 
time.

So I changed  'naughty reject data_post', so that naughty messages would be 
rejected after the body arrived and was fed to dspam, as you can see in this 
line:

49237 (data_post) dspam: training naughty as spam

Overnight, dpsam's spam detection accuracy improved from about 1% to 60%.  In 
another day or two, I expect training will no longer be necessary.  But again, 
I'm learning dspam as I go.

now I understand what you are trying to do.

It might make a lot of sense to add a header with the MAIL FROM information, 
before feeding it to dspam.  Is it worth the effort?

No. IMHO it is not worth the effort.

  Is there a standard header name DSPAM looks for?

You mean for the MAIL FROM information?

Any advice you offer is appreciated.

Feeding blindly any 'naughty' mail into DSPAM or SA blindly can resultin an over-training. I see often people thinking that the more theytrain the better it is for the software. But this is not the case inreal world. Better is to train less and only then when it is needed.Spam is usually easier to capture than Ham.

If you really want to automatically train 'naughty' mail then I would doit the following way:

1) I would create a global, merged group in DSPAM (I don't know if youare aware of DSPAMs group capabilities/concept?)

2) Any 'naughty' mail training would go automatically to that group

3) You need to ensure that the tokens for that global, merged group doesnot get to much biased towards Spam. So you should, MUST feed Hammessages too!

Personally I would even go one step ahead and mimic TONE training (trainon error or near error) with an asymmetric thickness threshold for spam/ham.

I used to do that kind of stuff in the past as well. I used a honey potand trained DSPAM using TONE training with an asymmetric thicknessthreshold for spam/ham and with double sided training. The problem withthis is that unsupervised training can lead (very often) to moreproblems than without doing unsupervised training.

If you really need SPAM messages to boost your capture rate then go here-> http://untroubled.org/spam/ <- download the last bunch of months anduse dspam_train to train your DSPAM instance. The corpus fromuntroubled.org is pretty good. If you look at the data then you willfind from time to time that he has classified a news letter as spam butmostly the coprus is very clean and accurate.

If you don't trust the untroubled.org corpi then go and make twodirectories on your file system:


~/ham
~/spam

In ham you add NN (where NN should be > 1000, if possible) messages thatyou verified manually that they are indeed innocent messages. In spamyou add the same amount of manually verified spam messages. Then you goand use dspam_train to train a certain DSPAM user (lets call that user'testmatt'). After you are finished with training you go and download abunch of months from untroubled.org and extract them (lets say you use2012-06.7z). That should give you a directory 2012/06. Now go on andcheck how many of those messages there are (according to your tokendata) not SPAM. Usually most messages will be properly identified asSPAM. If you have the time then check each message that DSPAM isclaiming not to be SPAM again and if it is indeed Spam then use dspam--source=corpus --class=spam to learn the message as spam. Don't forgetto check messages that your DSPAM is claiming to be Ham as well. If theyare Spam then I would learn them as Spam with dspam --source=inoculation--class=spam.

When you are finished with that month then take the next one and lookhow much difference you have there. I am very much confident that youwill have a very, very, very low FP/FN rate with the untroubled.orgdata. However... for proper training you need to have ham data too. Justspam data will not be sufficient. You have to think about DSPAM or SA orany other statistical anti-spam solution to be like a kid. It has noconcept of what is good and what is bad. So you need to learn it what isgood and what is bad. After that kid knows what is good and what is badyou are not going to again teach it the same thing. Right? You are onlygoing to correct it when it makes errors. So using that forced learningthat you intend to do with 'naughty' is going to do more damage thenbenefit. Better would be really to only learn when it is needed (aka:when the kid is making a mistake then you go on and explain that it hasmad a mistake and the kid learns the new situation). Forcing it everytime to learn (even if it has given you the right answer) is going todamage the kid. Another form of forced learning is when you let the kidtrain something without first asking for the result. So if you trainblindly any 'naughty' mail to DSPAM/SA without first asking whereDSPAM/SA would have classified 'naughty' mail then you make more damagethan benefit. Off course that kind of damage is not huge but over timethis can build up and completely destroy your result.



btw: I stopped doing that unsupervised training.

If you want my advice how to stop a lot of spammers then I would do thefollowing:

- Make the SMTP banner spawn more then just one line. Aka:
  220-This is my first line of my SMTPD banner
  220 localhost.localdomain ESMTP qpsmtpd .....

- Have a delay (configurable) between printing the first line of thebanner and the second line of the banner- Every idiot sending before the '220<space>FQDN ESMTP....' line getsrejected (early talker) and gets banned from connecting for the next NNseconds. Every reconnecting attempt gets punished with +30 Seconds.

- I would run DNSWL checks against the connecting IP

- If DNSWL had no positive result then I would run DNSBL checks againstthe IP. I would use weighted results and block the IP if it reaches acertain score/weight.- If you want to implement sender whitelisting then I would do thatDNSBL checks after the MAIL FROM stage and first check if the sender isin your white list (IMHO this is very dangerous but people often needthat stuff, even if it is easy forged).

- I would maybe do as well RHSBL (IMHO DNSBL is more than enough).

- If it is important to you then doing something like GeoIP lookupscould be interesting for certain users (either to block or whitelistbased on continent, region, country). I usually use that data to computethe distance between me and the sender. The bigger the distance is themore likely it is spam (search for SNARE if you need a research paper onthat topic).- I would run as well something like p0f and by default add punishingpoints to each connection coming from a desktop OS (aka: Windows XP,Windows 7, Windows CE, etc....)- I would as well look if the connecting IP is from a dynamic/dialuprange and add some punishing points to that connection (I think thereare DNSBL available to identify dialups or end-user DSLx lines. If thatkind of DNSBL is not available then I would suggest to use the simpleregexp from S25R -> http://www.gabacho-net.jp/en/anti-spam/ <- toidentify questionable clients)

If all points from above reach a certain score then I would disconnectthe client.



This should IMHO already block most Spam messages even reaching your queue.

If you use DSPAM then you should not forget to clean unused tokens fromtime to time. On 'all-in-one' script that can help you doing that isthis one here ->http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=tree;f=contrib/dspam_maintenance;hb=HEAD<-

ohhh and btw: I would disconnect those bastards as fast as I can. Forgetabout trying to be smart and doing fancy computations and such. It'smostly useless. Get them of your line and save your resources forconnections that have value.

btw2: Every IP passing the above test scenario should next time not beforced to go throw the whole evaluation again. I would cache the resultfor a bunch of hours and refresh the cache time each time the IPreconnects. I would delete the cache entry after NN hours/minuteswithout reconnection from the IP.

btw3: I would maybe even run something like fail2ban and use iptables toblock those idiots if they try to send spam for XX times in NNminutes/seconds. I would even block IPs trying to do directory attacksand such against my POP3/IMAPv4 server.

Matt



--
Kind Regards from Switzerland,

Stevan Bajić

Re: New plugin: reaper

Reply via email to