Re: Masscheck reuse
Might want to put this on the wiki too! Adding SASA group too for their input. -- Kevin A. McGrail VP Fundraising, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171 On Thu, Oct 4, 2018 at 10:28 AM Henrik K wrote: > > Still hoping to get some conversation going on about reuse. > > Personally I create my corpus like this: > > - hacked amavisd-milter to save unmodified message copy to "pristine" > directory > > - run a separate clean install of trunk SA/spamd that has default rules, > razor/pyzor/dcc etc, and only runs all "reuse" flagged rules > (my recent trunk commit) > --pre "loadplugin Mail::SpamAssassin::Plugin::Reuse" > --pre "run_reuse_tests_only 1" > > - cron every minute: run messages from "pristine" directory through > above spamd to add X-Spam-Status header and move to "corpus" > > - a bit later get mailids and resulting ham/spam status from my main > amavis, > and sort out "corpus" to "corpus_ham/spam" (of course with some manual > vetting, dspam crosscheck etc) > > Since my main setup uses extreme whitelisting and shortcircuiting, this is > the only way to get 100% legit corpus. It takes very little resources > anyway, since that spamd just runs network lookups (which are mostly cached > already). > > Basically I'd like to see masscheckers do something similar. Doesn't > matter > where you source all the corpus, it is possible to clean them up to > "pristine status" and run ASAP though spamd setup like above. That way > they > have legit X-Spam-Status header that can be reused even years later. > > Of course if your corpus already has X-Spam-Status from mail receive time > (and all possible plugins and checks are enabled), then it's simply the > case > of enabling reuse. But shortcircuited messages should be skipped. > > I also recently added REUSE config here: > > http://svn.apache.org/viewvc/spamassassin/trunk/masses/contrib/automasscheck-minimal/ > > > > > On Mon, Sep 03, 2018 at 05:55:05PM +0300, Henrik Krohns wrote: > > > > If you look at the ancient mass-check code before Reuse.pm was split from > > it, it shows the original intention: > > > > > http://svn.apache.org/viewvc/spamassassin/trunk/masses/mass-check?revision=721962&view=markup > > > > # --reuse without --net means we need to just zero ALL net rules; skip > net > > # lookups entirely except for the reused ones. > > (then it proceeds to zero scores for all "tflags net" rules) > > > > Ok I'm not even sure why it's talking about --reuse withOUT --net, since > the > > point here is to do separate scoresets with and without network checks? > One > > would simply run local checks only, or --reuse --net. > > > > If everyone used reuse, would there even be need for "weekly" masschecks > as > > every day simply included the network checks!? If you ask me, without > > --reuse one would be only allowed to submit "nightly" masschecks (no > --net). > > > > Current Reuse.pm simply reads "reuse XXX" config clauses, and zeroes > scores > > for those. So it is important to remember to use "reuse XXX" for any net > > rules, since it doesn't automatically iterate through them anymore! > Which > > in my mind is silly, why not simply iterate again through "tflags net" > and > > forget "reuse" stanza completely. > > > > Cheers, > > Henrik > > > > > > > > > > On Mon, Sep 03, 2018 at 05:29:20PM +0300, Henrik K wrote: > > > > > > Hey guys, > > > > > > I'm wondering why pretty much no masscheck submitter is using --reuse? > > > > > > I just committed fixes for lots of missing reuse flags, and now I can > > > actually do a ./mass-check --reuse --net run without ANY dns lookups > > > launching. So it's super fast too. > > > > > > What reason would there be to prefer running without reuse? Is this > simply > > > a case of missing guidance/documentation? Looking at some corpus logs, > > > judging by Maildir file timestamps there are even few years old > messages run > > > through. How can that make any sense, I wouldn't run anything older > than > > > an hour through DNSBLs. > > > > > > Of course I understand if someones messages don't have a scantime > > > X-Spam-Status header for some reason, but even that could be easily > fixable > > > by simply running the messages through a dedicated spamd as soon as > possible > > > to add the headers. > > > > > > Cheers, > > > Henrik >
Cron ~/svn/trunk/build/mkupdates/run_nightly | /usr/bin/tee /var/www/automc.spamassassin.org/mkupdates/mkupdates.txt
+ promote_active_rules + pwd + /usr/bin/perl build/mkupdates/listpromotable /usr/local/spamassassin/automc/svn/trunk HTTP get: http://ruleqa.spamassassin.org/1-days-ago?xml=1 HTTP get: http://ruleqa.spamassassin.org/2-days-ago?xml=1 HTTP get: http://ruleqa.spamassassin.org/3-days-ago?xml=1 + mv rules/active.list.new rules/active.list + svn diff rules + cat /var/www/ruleqa.spamassassin.org/reports/LATEST Index: rules/active.list === --- rules/active.list (revision 1842688) +++ rules/active.list (working copy) @@ -1,5 +1,5 @@ # active ruleset list, automatically generated from http://ruleqa.spamassassin.org/ -# with results from: day 1: darxus ena-week0 ena-week1 ena-week4 giovanni grenier hege jarif jbrooks mmiroslaw-mails-ham mmiroslaw-mails-spam sihde thendrikx; day 2: ena-week0 ena-week4 jbrooks llanga mmiroslaw-mails-ham mmiroslaw-mails-spam sihde; day 3: ena-week0 ena-week4 jarif jbrooks llanga mmiroslaw-mails-ham mmiroslaw-mails-spam sihde +# with results from: day 1: darxus ena-week0 ena-week1 ena-week4 giovanni grenier hege jarif jbrooks mmiroslaw-mails-ham mmiroslaw-mails-spam sihde thendrikx; day 2: ena-week0 ena-week4 jarif jbrooks llanga mmiroslaw-mails-ham mmiroslaw-mails-spam sihde; day 3: ena-week0 ena-week4 jbrooks llanga mmiroslaw-mails-ham mmiroslaw-mails-spam sihde # tflags publish AC_BR_BONANZA @@ -448,6 +448,9 @@ # good enough KB_FORGED_MOZ4 +# good enough +KHOP_DYNAMIC + # tflags publish LIST_PRTL_PUMPDUMP @@ -911,9 +914,6 @@ TO_NO_BRKTS_PCNT # good enough -TVD_INCREASE_SIZE - -# good enough TVD_PH_BODY_META # good enough @@ -929,53 +929,62 @@ TW_GIBBERISH_MANY # good enough -ADVANCE_FEE_3_NEW_FRM_MNY +ADMITS_SPAM # good enough -ADVANCE_FEE_5_NEW_FRM_MNY +ADVANCE_FEE_2_NEW_FRM_MNY # good enough -AXB_XMAILER_MIMEOLE_OL_1ECD5 +ADVANCE_FEE_4_NEW_FRM_MNY # good enough -BIGNUM_EMAILS +ADVANCE_FEE_4_NEW_MONEY # good enough -DATE_IN_FUTURE_96_Q +ADVANCE_FEE_5_NEW # good enough -DEAR_BENEFICIARY +BODY_SINGLE_URI # good enough -FILL_THIS_FORM_LONG +DUP_SUSP_HDR # good enough -FROM_MISSP_DYNIP +FILL_THIS_FORM_LOAN # good enough -FROM_MISSP_TO_UNDISC +FROM_MISSPACED # good enough -HK_CTE_RAW +FROM_MISSP_USER # good enough -KHOP_DYNAMIC +HDRS_MISSP # good enough -LOTTO_AGENT +HK_NAME_MR_MRS # good enough -MONEY_ATM_CARD +HK_SCAM_N3 # good enough -NSL_RCVD_HELO_USER +LIST_PARTIAL_SHORT_MSG # good enough -SERGIO_SUBJECT_VIAGRA01 +MILLION_USD # good enough -TO_EQ_FM_DIRECT_MX +MONEY_BARRISTER +# good enough +NSL_RCVD_FROM_USER + +# good enough +SHARE_50_50 + +# good enough +SHORTENED_URL_SRC + # tflags publish UC_GIBBERISH_OBFU + echo 'Committing promotions in rules/active.list...' + svn commit -m 'promotions validated' rules/active.list Committing promotions in rules/active.list... Sendingrules/active.list Transmitting file data .done Committing transaction... Committed revision 1842786. + /usr/bin/perl masses/rule-qa/list-bad-rules ++ date +%w + [[ 4 = 3 ]] + for VER in '$VERSIONS' + make_tarball_for_version 3.4.2 + version=3.4.2 + tmpdir=/usr/local/spamassassin/automc/tmp/stage/3.4.2 + rm -rf /usr/local/spamassassin/automc/tmp/stage/3.4.2 + mkdir -p /usr/local/spamassassin/automc/tmp/stage/3.4.2 + make clean rm -f \ SpamAssassin.bso SpamAssassin.def \ SpamAssassin.exp SpamAssassin.x \ blib/arch/auto/Mail/SpamAssassin/extralibs.all \ blib/arch/auto/Mail/SpamAssassin/extralibs.ld Makefile.aperl \ *.a *.o \ *perl.core MYMETA.json \ MYMETA.yml blibdirs.ts \ core core.*perl.*.? \ core.[0-9] core.[0-9][0-9] \ core.[0-9][0-9][0-9] core.[0-9][0-9][0-9][0-9] \ core.[0-9][0-9][0-9][0-9][0-9] libSpamAssassin.def \ mon.out perl \ perl perl.exe \ perlmain.c pm_to_blib \ pm_to_blib.ts so_locations \ tmon.out rm -rf \ *.cache blib \ doc pod2htm* \ qmail rules/*.pm \ rules/70_inactive.cf sa-awl \ sa-check_spamd sa-compile \ sa-learn sa-update \ spamassassin spamc/*.cache \ spamc/*.o* spamc/*.so \ spamc/Makefile spamc/config.h \ spamc/config.log spamc/config.status \ spamc/qmail-spamc spamc/replace/*.o* \ spamc/spamc spamc/spamc.h \ spamc/version.h spamd/*spamc* \ spamd/spamd t/bayessql.cf \ t/do_net t/log \ t/sql_based_whitelist.cf version.env mv Makefile Makefile.old > /dev/null 2>&1 + /usr/bin/perl Makefile.PL PREFIX=/usr/local/spamassassin/automc/tmp/stage/3.4.2 What email address or URL should be used in the suspected-spam report text for users who want more information on your filter installation? (In particular, ISPs should change this to a local Postmaster contact) default text: [the administrator of that system] the administrator of that system NOTE: settings for "make test" are now controlled using "t/config.dist". See that file if you wish to customize what tests are run, and how. checking module dependencies and their versions... ***