On 04.06.2012 21:09, Matt Simerson wrote:
On Jun 4, 2012, at 6:40 AM, Stevan Bajić wrote:
On 02.06.2012 23:23, Matt Simerson wrote:
On Jun 2, 2012, at 11:15 AM, Jared Johnson wrote:
Yup. Part of the motivation for this plugin was to short circuit all the
intermediate plugins and handlers so I can feed the message to sa-learn
and dspam. Until dspam is trained, that's a very important step in
training it. But there's no gain in validating the HELO name, SPF, or
DomainKeys. This plugin and associated changes adds that flexibility while
reducing the code and complexity of the plugins.
It might not be fair to say there's *no* gain. Our HELO validation and
SPF plugins (we don't have a DKIM plugin at the moment, for shame) now do
their lookups unconditionally and add headers to the message so that our
bayes engine can tokenize the headers themselves.
Wait until you actually run DomainKeys before you decide if it's a gain. It requires more resources than I'd
have guessed. And surprisingly (to me) is that the most reliably signed messages are spam. Or very big
"mostly good" senders. I've seen enough ham senders with broken DomainKeys so I don't consider it
reliable enough to reject or train based on. Same goes for SPF. Spammers are far more likely to have good SPF
than legit mailers. Spammers automate their SPF records, so they don't make typo mistakes like
"ip:127..." (should be "ip4:127...") or missing spaces between the declarations and the
~all. The errors are common enough, and affect ham often enough, that I'm tempted to fix them up in the SPF
plugin before validation.
And SPF breaks legit forwarding servers that don't implement SRS. So I don't
reject or train based on SPF alone.
I too have a custom HELO validation plugin (it needs more work, but I'll
contribute it eventually), and it may actually provide some gain, but I think
it's safe to say the one presently in plugins is not a gain.
How do you measure if the resources expended are worth the (likely small)
benefit you would get from the additional bayes tokens? That will determine if
it's a gain or not. I've placed my bet on the table, and I'd be pleased to be
proven wrong.
Bayes is a little bit of a black box to me, so I can't really quantify
just how useful this is, but I'd say it's greater than zero. Dspam even
treats headers in a special way to ensure that their usefulness is
maximized.
Usefulness != gain. There may be some gain, but I'm not familiar with bayes
enough either. But I know someone who is. The dspam author (Stevan Bajić)
noticed my plugin, contacted me, and will be submitting some improvements, like
talking directly to the dspam server. I'm BCC'ing him on this message, and
hopefully we'll get a more informed opinion.
I don't 100% understand what you are trying to do with bayes? Is this 'reaper'
plugin adding some additional data to the header of the mail and the other
person posting is questioning if that additional header is beneficial to the
bayes engine?
Care to explain little more to me what this is all about?
Hi again Stevan,
Hello Matt,
Here's an example of what I'm doing:
49237 250 mail.theartfarm.com Hi S0106001560c96a0b.wp.shawcable.net
[50.72.202.227]; I am so happy to meet you.
49237 dispatching MAIL FROM:<no-re...@shawcable.net>
49237 (mail) badmailfrom: skip, naughty
49237 (mail) resolvable_fromhost: skip, naughty
49237 (mail) sender_permitted_from: skip, naughty
49237 250<no-re...@shawcable.net>, sender OK - how exciting to get mail from
you!
49237 dispatching RCPT TO:<u...@example.com>
49237 (rcpt) rhsbl: pass
49237 (rcpt) dnsbl: skip, naughty
49237 (rcpt) resolvable_fromhost: skip, naughty
49237 (rcpt) sender_permitted_from: skip, naughty
49237 (rcpt) badrcptto: skip, naughty
49237 (rcpt) qmail_deliverable: skip, naughty
49237 (rcpt) rcpt_ok: pass: example.com found in morercpthosts
49237 250<u...@example.com>, recipient ok
49237 dispatching DATA
49237 354 go ahead
49237 (data_post) basicheaders: skip, naughty
49237 (data_post) bogus_bounce: skip, not a null sender
49237 (data_post) domainkeys: skip, naughty
49237 (data_post) spamassassin: skip, naughty
49237 (data_post) dspam: training naughty as spam
49237 spooling message to disk
49237 (data_post) virus::clamdscan: skip, naughty
49237 (data_post) naughty: disconnecting
49237 552 Blocked - see http://cbl.abuseat.org/lookup.cgi?ip=50.72.202.227
49237 click, disconnecting
49237 (post-connection) connection_time: 0.575 s.
86740 cleaning up after 49237
First, I renamed the reaper plugin to 'naughty'. But it does exactly the same
things. Lets other plugins identify a message as naughty, and then the
'naughty' plugin handles disposal of the message at some predetermined time. I
have added immunity tests to all the other plugins, so that they'll skip
processing if one of the immunity conditions is met. (See is_immune() here:
https://github.com/smtpd/qpsmtpd/pull/20/files) You can see above that most of
the messages have skipped processing, saving much time and CPU.
In typical usage, I intend to run with 'naughty reject rcpt', so that dnsbl and
karma hits are disposed of much sooner. A week ago I truncated my dspam tables
and started over. I have a script that feeds my users ham and spam into dspam
to train it, but I'm fairly aggressive at cleaning out their spam folders, so
users don't have much of a spam corpus. So, while dspam was solid at
identifying ham, it wasn't recognizing spam at all. And most users don't both
dragging their spam into their spam folder. So I need to train dspam. Training
just the messages that spamassassin recognized works, bit it takes a very long
time.
So I changed 'naughty reject data_post', so that naughty messages would be
rejected after the body arrived and was fed to dspam, as you can see in this
line:
49237 (data_post) dspam: training naughty as spam
Overnight, dpsam's spam detection accuracy improved from about 1% to 60%. In
another day or two, I expect training will no longer be necessary. But again,
I'm learning dspam as I go.
now I understand what you are trying to do.
It might make a lot of sense to add a header with the MAIL FROM information,
before feeding it to dspam. Is it worth the effort?
No. IMHO it is not worth the effort.
Is there a standard header name DSPAM looks for?
You mean for the MAIL FROM information?
Any advice you offer is appreciated.
Feeding blindly any 'naughty' mail into DSPAM or SA blindly can result
in an over-training. I see often people thinking that the more they
train the better it is for the software. But this is not the case in
real world. Better is to train less and only then when it is needed.
Spam is usually easier to capture than Ham.
If you really want to automatically train 'naughty' mail then I would do
it the following way:
1) I would create a global, merged group in DSPAM (I don't know if you
are aware of DSPAMs group capabilities/concept?)
2) Any 'naughty' mail training would go automatically to that group
3) You need to ensure that the tokens for that global, merged group does
not get to much biased towards Spam. So you should, MUST feed Ham
messages too!
Personally I would even go one step ahead and mimic TONE training (train
on error or near error) with an asymmetric thickness threshold for spam/ham.
I used to do that kind of stuff in the past as well. I used a honey pot
and trained DSPAM using TONE training with an asymmetric thickness
threshold for spam/ham and with double sided training. The problem with
this is that unsupervised training can lead (very often) to more
problems than without doing unsupervised training.
If you really need SPAM messages to boost your capture rate then go here
-> http://untroubled.org/spam/ <- download the last bunch of months and
use dspam_train to train your DSPAM instance. The corpus from
untroubled.org is pretty good. If you look at the data then you will
find from time to time that he has classified a news letter as spam but
mostly the coprus is very clean and accurate.
If you don't trust the untroubled.org corpi then go and make two
directories on your file system:
~/ham
~/spam
In ham you add NN (where NN should be > 1000, if possible) messages that
you verified manually that they are indeed innocent messages. In spam
you add the same amount of manually verified spam messages. Then you go
and use dspam_train to train a certain DSPAM user (lets call that user
'testmatt'). After you are finished with training you go and download a
bunch of months from untroubled.org and extract them (lets say you use
2012-06.7z). That should give you a directory 2012/06. Now go on and
check how many of those messages there are (according to your token
data) not SPAM. Usually most messages will be properly identified as
SPAM. If you have the time then check each message that DSPAM is
claiming not to be SPAM again and if it is indeed Spam then use dspam
--source=corpus --class=spam to learn the message as spam. Don't forget
to check messages that your DSPAM is claiming to be Ham as well. If they
are Spam then I would learn them as Spam with dspam --source=inoculation
--class=spam.
When you are finished with that month then take the next one and look
how much difference you have there. I am very much confident that you
will have a very, very, very low FP/FN rate with the untroubled.org
data. However... for proper training you need to have ham data too. Just
spam data will not be sufficient. You have to think about DSPAM or SA or
any other statistical anti-spam solution to be like a kid. It has no
concept of what is good and what is bad. So you need to learn it what is
good and what is bad. After that kid knows what is good and what is bad
you are not going to again teach it the same thing. Right? You are only
going to correct it when it makes errors. So using that forced learning
that you intend to do with 'naughty' is going to do more damage then
benefit. Better would be really to only learn when it is needed (aka:
when the kid is making a mistake then you go on and explain that it has
mad a mistake and the kid learns the new situation). Forcing it every
time to learn (even if it has given you the right answer) is going to
damage the kid. Another form of forced learning is when you let the kid
train something without first asking for the result. So if you train
blindly any 'naughty' mail to DSPAM/SA without first asking where
DSPAM/SA would have classified 'naughty' mail then you make more damage
than benefit. Off course that kind of damage is not huge but over time
this can build up and completely destroy your result.
btw: I stopped doing that unsupervised training.
If you want my advice how to stop a lot of spammers then I would do the
following:
- Make the SMTP banner spawn more then just one line. Aka:
220-This is my first line of my SMTPD banner
220 localhost.localdomain ESMTP qpsmtpd .....
- Have a delay (configurable) between printing the first line of the
banner and the second line of the banner
- Every idiot sending before the '220<space>FQDN ESMTP....' line gets
rejected (early talker) and gets banned from connecting for the next NN
seconds. Every reconnecting attempt gets punished with +30 Seconds.
- I would run DNSWL checks against the connecting IP
- If DNSWL had no positive result then I would run DNSBL checks against
the IP. I would use weighted results and block the IP if it reaches a
certain score/weight.
- If you want to implement sender whitelisting then I would do that
DNSBL checks after the MAIL FROM stage and first check if the sender is
in your white list (IMHO this is very dangerous but people often need
that stuff, even if it is easy forged).
- I would maybe do as well RHSBL (IMHO DNSBL is more than enough).
- If it is important to you then doing something like GeoIP lookups
could be interesting for certain users (either to block or whitelist
based on continent, region, country). I usually use that data to compute
the distance between me and the sender. The bigger the distance is the
more likely it is spam (search for SNARE if you need a research paper on
that topic).
- I would run as well something like p0f and by default add punishing
points to each connection coming from a desktop OS (aka: Windows XP,
Windows 7, Windows CE, etc....)
- I would as well look if the connecting IP is from a dynamic/dialup
range and add some punishing points to that connection (I think there
are DNSBL available to identify dialups or end-user DSLx lines. If that
kind of DNSBL is not available then I would suggest to use the simple
regexp from S25R -> http://www.gabacho-net.jp/en/anti-spam/ <- to
identify questionable clients)
If all points from above reach a certain score then I would disconnect
the client.
This should IMHO already block most Spam messages even reaching your queue.
If you use DSPAM then you should not forget to clean unused tokens from
time to time. On 'all-in-one' script that can help you doing that is
this one here ->
http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=tree;f=contrib/dspam_maintenance;hb=HEAD
<-
ohhh and btw: I would disconnect those bastards as fast as I can. Forget
about trying to be smart and doing fancy computations and such. It's
mostly useless. Get them of your line and save your resources for
connections that have value.
btw2: Every IP passing the above test scenario should next time not be
forced to go throw the whole evaluation again. I would cache the result
for a bunch of hours and refresh the cache time each time the IP
reconnects. I would delete the cache entry after NN hours/minutes
without reconnection from the IP.
btw3: I would maybe even run something like fail2ban and use iptables to
block those idiots if they try to send spam for XX times in NN
minutes/seconds. I would even block IPs trying to do directory attacks
and such against my POP3/IMAPv4 server.
Matt
--
Kind Regards from Switzerland,
Stevan Bajić