On 04.06.2012 21:09, Matt Simerson wrote:
On Jun 4, 2012, at 6:40 AM, Stevan Bajić wrote:

On 02.06.2012 23:23, Matt Simerson wrote:
On Jun 2, 2012, at 11:15 AM, Jared Johnson wrote:

Yup. Part of the motivation for this plugin was to short circuit all the
intermediate plugins and handlers so I can feed the message to sa-learn
and dspam. Until dspam is trained, that's a very important step in
training it. But there's no gain in validating the HELO name, SPF,  or
DomainKeys. This plugin and associated changes adds that flexibility while
reducing the code and complexity of the plugins.
It might not be fair to say there's *no* gain.  Our HELO validation and
SPF plugins (we don't have a DKIM plugin at the moment, for shame) now do
their lookups unconditionally and add headers to the message so that our
bayes engine can tokenize the headers themselves.
Wait until you actually run DomainKeys before you decide if it's a gain. It requires more resources than I'd 
have guessed. And surprisingly (to me) is that the most reliably signed messages are spam. Or very big 
"mostly good" senders.  I've seen enough ham senders with broken DomainKeys so I don't consider it 
reliable enough to reject or train based on. Same goes for SPF. Spammers are far more likely to have good SPF 
than legit mailers. Spammers automate their SPF records, so they don't make typo mistakes like 
"ip:127..." (should be "ip4:127...") or missing spaces between the declarations and the 
~all. The errors are common enough, and affect ham often enough, that I'm tempted to fix them up in the SPF 
plugin before validation.

And SPF breaks legit forwarding servers that don't implement SRS. So I don't 
reject or train based on SPF alone.

I too have a custom HELO validation plugin (it needs more work, but I'll 
contribute it eventually), and it may actually provide some gain, but I think 
it's safe to say the one presently in plugins is not a gain.

How do you measure if the resources expended are worth the (likely small) 
benefit you would get from the additional bayes tokens? That will determine if 
it's a gain or not. I've placed my bet on the table, and I'd be pleased to be 
proven wrong.

Bayes is a little bit of a black box to me, so I can't really quantify
just how useful this is, but I'd say it's greater than zero. Dspam even
treats headers in a special way to ensure that their usefulness is
maximized.
Usefulness != gain.  There may be some gain, but I'm not familiar with bayes 
enough either. But I know someone who is. The dspam author (Stevan Bajić) 
noticed my plugin, contacted me, and will be submitting some improvements, like 
talking directly to the dspam server.  I'm BCC'ing him on this message, and 
hopefully we'll get a more informed opinion.
I don't 100% understand what you are trying to do with bayes? Is this 'reaper' 
plugin adding some additional data to the header of the mail and the other 
person posting is questioning if that additional header is beneficial to the 
bayes engine?

Care to explain little more to me what this is all about?
Hi again Stevan,
Hello Matt,

Here's an example of what I'm doing:

49237 250 mail.theartfarm.com Hi S0106001560c96a0b.wp.shawcable.net 
[50.72.202.227]; I am so happy to meet you.
49237 dispatching MAIL FROM:<no-re...@shawcable.net>
49237 (mail) badmailfrom: skip, naughty
49237 (mail) resolvable_fromhost: skip, naughty
49237 (mail) sender_permitted_from: skip, naughty
49237 250<no-re...@shawcable.net>, sender OK - how exciting to get mail from 
you!
49237 dispatching RCPT TO:<u...@example.com>
49237 (rcpt) rhsbl: pass
49237 (rcpt) dnsbl: skip, naughty
49237 (rcpt) resolvable_fromhost: skip, naughty
49237 (rcpt) sender_permitted_from: skip, naughty
49237 (rcpt) badrcptto: skip, naughty
49237 (rcpt) qmail_deliverable: skip, naughty
49237 (rcpt) rcpt_ok: pass: example.com found in morercpthosts
49237 250<u...@example.com>, recipient ok
49237 dispatching DATA
49237 354 go ahead
49237 (data_post) basicheaders: skip, naughty
49237 (data_post) bogus_bounce: skip, not a null sender
49237 (data_post) domainkeys: skip, naughty
49237 (data_post) spamassassin: skip, naughty
49237 (data_post) dspam: training naughty as spam
49237 spooling message to disk
49237 (data_post) virus::clamdscan: skip, naughty
49237 (data_post) naughty: disconnecting
49237 552 Blocked - see http://cbl.abuseat.org/lookup.cgi?ip=50.72.202.227
49237 click, disconnecting
49237 (post-connection) connection_time: 0.575 s.
86740 cleaning up after 49237

First, I renamed the reaper plugin to 'naughty'.  But it does exactly the same 
things. Lets other plugins identify a message as naughty, and then the 
'naughty' plugin handles disposal of the message at some predetermined time. I 
have added immunity tests to all the other plugins, so that they'll skip 
processing if one of the immunity conditions is met. (See is_immune() here: 
https://github.com/smtpd/qpsmtpd/pull/20/files) You can see above that most of 
the messages have skipped processing, saving much time and CPU.

In typical usage, I intend to run with 'naughty reject rcpt', so that dnsbl and 
karma hits are disposed of much sooner.  A week ago I truncated my dspam tables 
and started over. I have a script that feeds my users ham and spam into dspam 
to train it, but I'm fairly aggressive at cleaning out their spam folders, so 
users don't have much of a spam corpus. So, while dspam was solid at 
identifying ham, it wasn't recognizing spam at all. And most users don't both 
dragging their spam into their spam folder. So I need to train dspam. Training 
just the messages that spamassassin recognized works, bit it takes a very long 
time.

So I changed  'naughty reject data_post', so that naughty messages would be 
rejected after the body arrived and was fed to dspam, as you can see in this 
line:

49237 (data_post) dspam: training naughty as spam

Overnight, dpsam's spam detection accuracy improved from about 1% to 60%.  In 
another day or two, I expect training will no longer be necessary.  But again, 
I'm learning dspam as I go.
now I understand what you are trying to do.


It might make a lot of sense to add a header with the MAIL FROM information, 
before feeding it to dspam.  Is it worth the effort?
No. IMHO it is not worth the effort.

  Is there a standard header name DSPAM looks for?
You mean for the MAIL FROM information?


Any advice you offer is appreciated.
Feeding blindly any 'naughty' mail into DSPAM or SA blindly can result in an over-training. I see often people thinking that the more they train the better it is for the software. But this is not the case in real world. Better is to train less and only then when it is needed. Spam is usually easier to capture than Ham.

If you really want to automatically train 'naughty' mail then I would do it the following way:

1) I would create a global, merged group in DSPAM (I don't know if you are aware of DSPAMs group capabilities/concept?)
2) Any 'naughty' mail training would go automatically to that group
3) You need to ensure that the tokens for that global, merged group does not get to much biased towards Spam. So you should, MUST feed Ham messages too!

Personally I would even go one step ahead and mimic TONE training (train on error or near error) with an asymmetric thickness threshold for spam/ham.

I used to do that kind of stuff in the past as well. I used a honey pot and trained DSPAM using TONE training with an asymmetric thickness threshold for spam/ham and with double sided training. The problem with this is that unsupervised training can lead (very often) to more problems than without doing unsupervised training.

If you really need SPAM messages to boost your capture rate then go here -> http://untroubled.org/spam/ <- download the last bunch of months and use dspam_train to train your DSPAM instance. The corpus from untroubled.org is pretty good. If you look at the data then you will find from time to time that he has classified a news letter as spam but mostly the coprus is very clean and accurate.

If you don't trust the untroubled.org corpi then go and make two directories on your file system:

~/ham
~/spam

In ham you add NN (where NN should be > 1000, if possible) messages that you verified manually that they are indeed innocent messages. In spam you add the same amount of manually verified spam messages. Then you go and use dspam_train to train a certain DSPAM user (lets call that user 'testmatt'). After you are finished with training you go and download a bunch of months from untroubled.org and extract them (lets say you use 2012-06.7z). That should give you a directory 2012/06. Now go on and check how many of those messages there are (according to your token data) not SPAM. Usually most messages will be properly identified as SPAM. If you have the time then check each message that DSPAM is claiming not to be SPAM again and if it is indeed Spam then use dspam --source=corpus --class=spam to learn the message as spam. Don't forget to check messages that your DSPAM is claiming to be Ham as well. If they are Spam then I would learn them as Spam with dspam --source=inoculation --class=spam.

When you are finished with that month then take the next one and look how much difference you have there. I am very much confident that you will have a very, very, very low FP/FN rate with the untroubled.org data. However... for proper training you need to have ham data too. Just spam data will not be sufficient. You have to think about DSPAM or SA or any other statistical anti-spam solution to be like a kid. It has no concept of what is good and what is bad. So you need to learn it what is good and what is bad. After that kid knows what is good and what is bad you are not going to again teach it the same thing. Right? You are only going to correct it when it makes errors. So using that forced learning that you intend to do with 'naughty' is going to do more damage then benefit. Better would be really to only learn when it is needed (aka: when the kid is making a mistake then you go on and explain that it has mad a mistake and the kid learns the new situation). Forcing it every time to learn (even if it has given you the right answer) is going to damage the kid. Another form of forced learning is when you let the kid train something without first asking for the result. So if you train blindly any 'naughty' mail to DSPAM/SA without first asking where DSPAM/SA would have classified 'naughty' mail then you make more damage than benefit. Off course that kind of damage is not huge but over time this can build up and completely destroy your result.


btw: I stopped doing that unsupervised training.


If you want my advice how to stop a lot of spammers then I would do the following:
- Make the SMTP banner spawn more then just one line. Aka:
  220-This is my first line of my SMTPD banner
  220 localhost.localdomain ESMTP qpsmtpd .....
- Have a delay (configurable) between printing the first line of the banner and the second line of the banner - Every idiot sending before the '220<space>FQDN ESMTP....' line gets rejected (early talker) and gets banned from connecting for the next NN seconds. Every reconnecting attempt gets punished with +30 Seconds.
- I would run DNSWL checks against the connecting IP
- If DNSWL had no positive result then I would run DNSBL checks against the IP. I would use weighted results and block the IP if it reaches a certain score/weight. - If you want to implement sender whitelisting then I would do that DNSBL checks after the MAIL FROM stage and first check if the sender is in your white list (IMHO this is very dangerous but people often need that stuff, even if it is easy forged).
- I would maybe do as well RHSBL (IMHO DNSBL is more than enough).
- If it is important to you then doing something like GeoIP lookups could be interesting for certain users (either to block or whitelist based on continent, region, country). I usually use that data to compute the distance between me and the sender. The bigger the distance is the more likely it is spam (search for SNARE if you need a research paper on that topic). - I would run as well something like p0f and by default add punishing points to each connection coming from a desktop OS (aka: Windows XP, Windows 7, Windows CE, etc....) - I would as well look if the connecting IP is from a dynamic/dialup range and add some punishing points to that connection (I think there are DNSBL available to identify dialups or end-user DSLx lines. If that kind of DNSBL is not available then I would suggest to use the simple regexp from S25R -> http://www.gabacho-net.jp/en/anti-spam/ <- to identify questionable clients)

If all points from above reach a certain score then I would disconnect the client.


This should IMHO already block most Spam messages even reaching your queue.

If you use DSPAM then you should not forget to clean unused tokens from time to time. On 'all-in-one' script that can help you doing that is this one here -> http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=tree;f=contrib/dspam_maintenance;hb=HEAD <-


ohhh and btw: I would disconnect those bastards as fast as I can. Forget about trying to be smart and doing fancy computations and such. It's mostly useless. Get them of your line and save your resources for connections that have value.

btw2: Every IP passing the above test scenario should next time not be forced to go throw the whole evaluation again. I would cache the result for a bunch of hours and refresh the cache time each time the IP reconnects. I would delete the cache entry after NN hours/minutes without reconnection from the IP.

btw3: I would maybe even run something like fail2ban and use iptables to block those idiots if they try to send spam for XX times in NN minutes/seconds. I would even block IPs trying to do directory attacks and such against my POP3/IMAPv4 server.



Matt


--
Kind Regards from Switzerland,

Stevan Bajić

Reply via email to