On Jun 4, 2012, at 6:40 AM, Stevan Bajić wrote:

> On 02.06.2012 23:23, Matt Simerson wrote:
>> On Jun 2, 2012, at 11:15 AM, Jared Johnson wrote:
>> 
>>>> Yup. Part of the motivation for this plugin was to short circuit all the
>>>> intermediate plugins and handlers so I can feed the message to sa-learn
>>>> and dspam. Until dspam is trained, that's a very important step in
>>>> training it. But there's no gain in validating the HELO name, SPF,  or
>>>> DomainKeys. This plugin and associated changes adds that flexibility while
>>>> reducing the code and complexity of the plugins.
>>> It might not be fair to say there's *no* gain.  Our HELO validation and
>>> SPF plugins (we don't have a DKIM plugin at the moment, for shame) now do
>>> their lookups unconditionally and add headers to the message so that our
>>> bayes engine can tokenize the headers themselves.
>> Wait until you actually run DomainKeys before you decide if it's a gain. It 
>> requires more resources than I'd have guessed. And surprisingly (to me) is 
>> that the most reliably signed messages are spam. Or very big "mostly good" 
>> senders.  I've seen enough ham senders with broken DomainKeys so I don't 
>> consider it reliable enough to reject or train based on. Same goes for SPF. 
>> Spammers are far more likely to have good SPF than legit mailers. Spammers 
>> automate their SPF records, so they don't make typo mistakes like 
>> "ip:127..." (should be "ip4:127...") or missing spaces between the 
>> declarations and the ~all. The errors are common enough, and affect ham 
>> often enough, that I'm tempted to fix them up in the SPF plugin before 
>> validation.
>> 
>> And SPF breaks legit forwarding servers that don't implement SRS. So I don't 
>> reject or train based on SPF alone.
>> 
>> I too have a custom HELO validation plugin (it needs more work, but I'll 
>> contribute it eventually), and it may actually provide some gain, but I 
>> think it's safe to say the one presently in plugins is not a gain.
>> 
>> How do you measure if the resources expended are worth the (likely small) 
>> benefit you would get from the additional bayes tokens? That will determine 
>> if it's a gain or not. I've placed my bet on the table, and I'd be pleased 
>> to be proven wrong.
>> 
>>> Bayes is a little bit of a black box to me, so I can't really quantify
>>> just how useful this is, but I'd say it's greater than zero. Dspam even
>>> treats headers in a special way to ensure that their usefulness is
>>> maximized.
>> Usefulness != gain.  There may be some gain, but I'm not familiar with bayes 
>> enough either. But I know someone who is. The dspam author (Stevan Bajić) 
>> noticed my plugin, contacted me, and will be submitting some improvements, 
>> like talking directly to the dspam server.  I'm BCC'ing him on this message, 
>> and hopefully we'll get a more informed opinion.
> I don't 100% understand what you are trying to do with bayes? Is this 
> 'reaper' plugin adding some additional data to the header of the mail and the 
> other person posting is questioning if that additional header is beneficial 
> to the bayes engine?
> 
> Care to explain little more to me what this is all about?

Hi again Stevan,

Here's an example of what I'm doing:

49237 250 mail.theartfarm.com Hi S0106001560c96a0b.wp.shawcable.net 
[50.72.202.227]; I am so happy to meet you.
49237 dispatching MAIL FROM: <no-re...@shawcable.net>
49237 (mail) badmailfrom: skip, naughty
49237 (mail) resolvable_fromhost: skip, naughty
49237 (mail) sender_permitted_from: skip, naughty
49237 250 <no-re...@shawcable.net>, sender OK - how exciting to get mail from 
you!
49237 dispatching RCPT TO: <u...@example.com>
49237 (rcpt) rhsbl: pass
49237 (rcpt) dnsbl: skip, naughty
49237 (rcpt) resolvable_fromhost: skip, naughty
49237 (rcpt) sender_permitted_from: skip, naughty
49237 (rcpt) badrcptto: skip, naughty
49237 (rcpt) qmail_deliverable: skip, naughty
49237 (rcpt) rcpt_ok: pass: example.com found in morercpthosts
49237 250 <u...@example.com>, recipient ok
49237 dispatching DATA
49237 354 go ahead
49237 (data_post) basicheaders: skip, naughty
49237 (data_post) bogus_bounce: skip, not a null sender
49237 (data_post) domainkeys: skip, naughty
49237 (data_post) spamassassin: skip, naughty
49237 (data_post) dspam: training naughty as spam
49237 spooling message to disk
49237 (data_post) virus::clamdscan: skip, naughty
49237 (data_post) naughty: disconnecting
49237 552 Blocked - see http://cbl.abuseat.org/lookup.cgi?ip=50.72.202.227
49237 click, disconnecting
49237 (post-connection) connection_time: 0.575 s.
86740 cleaning up after 49237

First, I renamed the reaper plugin to 'naughty'.  But it does exactly the same 
things. Lets other plugins identify a message as naughty, and then the 
'naughty' plugin handles disposal of the message at some predetermined time. I 
have added immunity tests to all the other plugins, so that they'll skip 
processing if one of the immunity conditions is met. (See is_immune() here: 
https://github.com/smtpd/qpsmtpd/pull/20/files) You can see above that most of 
the messages have skipped processing, saving much time and CPU. 

In typical usage, I intend to run with 'naughty reject rcpt', so that dnsbl and 
karma hits are disposed of much sooner.  A week ago I truncated my dspam tables 
and started over. I have a script that feeds my users ham and spam into dspam 
to train it, but I'm fairly aggressive at cleaning out their spam folders, so 
users don't have much of a spam corpus. So, while dspam was solid at 
identifying ham, it wasn't recognizing spam at all. And most users don't both 
dragging their spam into their spam folder. So I need to train dspam. Training 
just the messages that spamassassin recognized works, bit it takes a very long 
time.

So I changed  'naughty reject data_post', so that naughty messages would be 
rejected after the body arrived and was fed to dspam, as you can see in this 
line:

49237 (data_post) dspam: training naughty as spam

Overnight, dpsam's spam detection accuracy improved from about 1% to 60%.  In 
another day or two, I expect training will no longer be necessary.  But again, 
I'm learning dspam as I go.  

It might make a lot of sense to add a header with the MAIL FROM information, 
before feeding it to dspam.  Is it worth the effort? Is there a standard header 
name DSPAM looks for?

Any advice you offer is appreciated.

Matt

Reply via email to