Might want to put this on the wiki too! Adding SASA group too for their input. -- Kevin A. McGrail VP Fundraising, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171
On Thu, Oct 4, 2018 at 10:28 AM Henrik K <h...@hege.li> wrote: > > Still hoping to get some conversation going on about reuse. > > Personally I create my corpus like this: > > - hacked amavisd-milter to save unmodified message copy to "pristine" > directory > > - run a separate clean install of trunk SA/spamd that has default rules, > razor/pyzor/dcc etc, and only runs all "reuse" flagged rules > (my recent trunk commit) > --pre "loadplugin Mail::SpamAssassin::Plugin::Reuse" > --pre "run_reuse_tests_only 1" > > - cron every minute: run messages from "pristine" directory through > above spamd to add X-Spam-Status header and move to "corpus" > > - a bit later get mailids and resulting ham/spam status from my main > amavis, > and sort out "corpus" to "corpus_ham/spam" (of course with some manual > vetting, dspam crosscheck etc) > > Since my main setup uses extreme whitelisting and shortcircuiting, this is > the only way to get 100% legit corpus. It takes very little resources > anyway, since that spamd just runs network lookups (which are mostly cached > already). > > Basically I'd like to see masscheckers do something similar. Doesn't > matter > where you source all the corpus, it is possible to clean them up to > "pristine status" and run ASAP though spamd setup like above. That way > they > have legit X-Spam-Status header that can be reused even years later. > > Of course if your corpus already has X-Spam-Status from mail receive time > (and all possible plugins and checks are enabled), then it's simply the > case > of enabling reuse. But shortcircuited messages should be skipped. > > I also recently added REUSE config here: > > http://svn.apache.org/viewvc/spamassassin/trunk/masses/contrib/automasscheck-minimal/ > > > > > On Mon, Sep 03, 2018 at 05:55:05PM +0300, Henrik Krohns wrote: > > > > If you look at the ancient mass-check code before Reuse.pm was split from > > it, it shows the original intention: > > > > > http://svn.apache.org/viewvc/spamassassin/trunk/masses/mass-check?revision=721962&view=markup > > > > # --reuse without --net means we need to just zero ALL net rules; skip > net > > # lookups entirely except for the reused ones. > > (then it proceeds to zero scores for all "tflags net" rules) > > > > Ok I'm not even sure why it's talking about --reuse withOUT --net, since > the > > point here is to do separate scoresets with and without network checks? > One > > would simply run local checks only, or --reuse --net. > > > > If everyone used reuse, would there even be need for "weekly" masschecks > as > > every day simply included the network checks!? If you ask me, without > > --reuse one would be only allowed to submit "nightly" masschecks (no > --net). > > > > Current Reuse.pm simply reads "reuse XXX" config clauses, and zeroes > scores > > for those. So it is important to remember to use "reuse XXX" for any net > > rules, since it doesn't automatically iterate through them anymore! > Which > > in my mind is silly, why not simply iterate again through "tflags net" > and > > forget "reuse" stanza completely. > > > > Cheers, > > Henrik > > > > > > > > > > On Mon, Sep 03, 2018 at 05:29:20PM +0300, Henrik K wrote: > > > > > > Hey guys, > > > > > > I'm wondering why pretty much no masscheck submitter is using --reuse? > > > > > > I just committed fixes for lots of missing reuse flags, and now I can > > > actually do a ./mass-check --reuse --net run without ANY dns lookups > > > launching. So it's super fast too. > > > > > > What reason would there be to prefer running without reuse? Is this > simply > > > a case of missing guidance/documentation? Looking at some corpus logs, > > > judging by Maildir file timestamps there are even few years old > messages run > > > through. How can that make any sense, I wouldn't run anything older > than > > > an hour through DNSBLs. > > > > > > Of course I understand if someones messages don't have a scantime > > > X-Spam-Status header for some reason, but even that could be easily > fixable > > > by simply running the messages through a dedicated spamd as soon as > possible > > > to add the headers. > > > > > > Cheers, > > > Henrik >