On Jun 4, 2012, at 6:40 AM, Stevan Bajić wrote: > On 02.06.2012 23:23, Matt Simerson wrote: >> On Jun 2, 2012, at 11:15 AM, Jared Johnson wrote: >> >>>> Yup. Part of the motivation for this plugin was to short circuit all the >>>> intermediate plugins and handlers so I can feed the message to sa-learn >>>> and dspam. Until dspam is trained, that's a very important step in >>>> training it. But there's no gain in validating the HELO name, SPF, or >>>> DomainKeys. This plugin and associated changes adds that flexibility while >>>> reducing the code and complexity of the plugins. >>> It might not be fair to say there's *no* gain. Our HELO validation and >>> SPF plugins (we don't have a DKIM plugin at the moment, for shame) now do >>> their lookups unconditionally and add headers to the message so that our >>> bayes engine can tokenize the headers themselves. >> Wait until you actually run DomainKeys before you decide if it's a gain. It >> requires more resources than I'd have guessed. And surprisingly (to me) is >> that the most reliably signed messages are spam. Or very big "mostly good" >> senders. I've seen enough ham senders with broken DomainKeys so I don't >> consider it reliable enough to reject or train based on. Same goes for SPF. >> Spammers are far more likely to have good SPF than legit mailers. Spammers >> automate their SPF records, so they don't make typo mistakes like >> "ip:127..." (should be "ip4:127...") or missing spaces between the >> declarations and the ~all. The errors are common enough, and affect ham >> often enough, that I'm tempted to fix them up in the SPF plugin before >> validation. >> >> And SPF breaks legit forwarding servers that don't implement SRS. So I don't >> reject or train based on SPF alone. >> >> I too have a custom HELO validation plugin (it needs more work, but I'll >> contribute it eventually), and it may actually provide some gain, but I >> think it's safe to say the one presently in plugins is not a gain. >> >> How do you measure if the resources expended are worth the (likely small) >> benefit you would get from the additional bayes tokens? That will determine >> if it's a gain or not. I've placed my bet on the table, and I'd be pleased >> to be proven wrong. >> >>> Bayes is a little bit of a black box to me, so I can't really quantify >>> just how useful this is, but I'd say it's greater than zero. Dspam even >>> treats headers in a special way to ensure that their usefulness is >>> maximized. >> Usefulness != gain. There may be some gain, but I'm not familiar with bayes >> enough either. But I know someone who is. The dspam author (Stevan Bajić) >> noticed my plugin, contacted me, and will be submitting some improvements, >> like talking directly to the dspam server. I'm BCC'ing him on this message, >> and hopefully we'll get a more informed opinion. > I don't 100% understand what you are trying to do with bayes? Is this > 'reaper' plugin adding some additional data to the header of the mail and the > other person posting is questioning if that additional header is beneficial > to the bayes engine? > > Care to explain little more to me what this is all about?
Hi again Stevan, Here's an example of what I'm doing: 49237 250 mail.theartfarm.com Hi S0106001560c96a0b.wp.shawcable.net [50.72.202.227]; I am so happy to meet you. 49237 dispatching MAIL FROM: <no-re...@shawcable.net> 49237 (mail) badmailfrom: skip, naughty 49237 (mail) resolvable_fromhost: skip, naughty 49237 (mail) sender_permitted_from: skip, naughty 49237 250 <no-re...@shawcable.net>, sender OK - how exciting to get mail from you! 49237 dispatching RCPT TO: <u...@example.com> 49237 (rcpt) rhsbl: pass 49237 (rcpt) dnsbl: skip, naughty 49237 (rcpt) resolvable_fromhost: skip, naughty 49237 (rcpt) sender_permitted_from: skip, naughty 49237 (rcpt) badrcptto: skip, naughty 49237 (rcpt) qmail_deliverable: skip, naughty 49237 (rcpt) rcpt_ok: pass: example.com found in morercpthosts 49237 250 <u...@example.com>, recipient ok 49237 dispatching DATA 49237 354 go ahead 49237 (data_post) basicheaders: skip, naughty 49237 (data_post) bogus_bounce: skip, not a null sender 49237 (data_post) domainkeys: skip, naughty 49237 (data_post) spamassassin: skip, naughty 49237 (data_post) dspam: training naughty as spam 49237 spooling message to disk 49237 (data_post) virus::clamdscan: skip, naughty 49237 (data_post) naughty: disconnecting 49237 552 Blocked - see http://cbl.abuseat.org/lookup.cgi?ip=50.72.202.227 49237 click, disconnecting 49237 (post-connection) connection_time: 0.575 s. 86740 cleaning up after 49237 First, I renamed the reaper plugin to 'naughty'. But it does exactly the same things. Lets other plugins identify a message as naughty, and then the 'naughty' plugin handles disposal of the message at some predetermined time. I have added immunity tests to all the other plugins, so that they'll skip processing if one of the immunity conditions is met. (See is_immune() here: https://github.com/smtpd/qpsmtpd/pull/20/files) You can see above that most of the messages have skipped processing, saving much time and CPU. In typical usage, I intend to run with 'naughty reject rcpt', so that dnsbl and karma hits are disposed of much sooner. A week ago I truncated my dspam tables and started over. I have a script that feeds my users ham and spam into dspam to train it, but I'm fairly aggressive at cleaning out their spam folders, so users don't have much of a spam corpus. So, while dspam was solid at identifying ham, it wasn't recognizing spam at all. And most users don't both dragging their spam into their spam folder. So I need to train dspam. Training just the messages that spamassassin recognized works, bit it takes a very long time. So I changed 'naughty reject data_post', so that naughty messages would be rejected after the body arrived and was fed to dspam, as you can see in this line: 49237 (data_post) dspam: training naughty as spam Overnight, dpsam's spam detection accuracy improved from about 1% to 60%. In another day or two, I expect training will no longer be necessary. But again, I'm learning dspam as I go. It might make a lot of sense to add a header with the MAIL FROM information, before feeding it to dspam. Is it worth the effort? Is there a standard header name DSPAM looks for? Any advice you offer is appreciated. Matt