Thank you very much for your help! A few answers inline. -------- Original Message -------- Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal? From: Matus UHLAR - fantomas <uh...@fantomas.sk> To: users@spamassassin.apache.org Date: Tue Oct 31 2017 11:27:47 GMT+0300 (AST)
> On 31.10.17 01:35, David Gessel wrote: >> amavisd-new-2.11.0_2,1 >> I'm finding the command /usr/local/bin/sa-learn --spam --showdots >> /mail/blackrosetech.com/gessel/.Junk/{cur,new} is taking a while to > > if you use amavis, you must train amavis' bayes database > (/var/lib/amavis/.spamassassin/ here), not your own. huh, I was getting bayes filter results, as I =think= I'm training a global bayes database per https://wiki.apache.org/spamassassin/SiteWideBayesSetup > >> complete... by a while I mean it has been running for 3 days. The folder >> has a few months of spam in it, 4760 "conversations" according to >> Thunderbird, which is roughly the message count since spam doesn't tend to >> thread deeply. > > It's not needed to train on all spam you have. after initial training of > let's say 200-500 pieces of (different types of) logged spam, it may be > enough to train only spam that does not hit BAYES_99 > > It's much more important to train on ham, since SA must know the DIFFERENCES > between ham > and spam - otherwise all mail will of course look like spam. > (Also, SA won't hit if you don't have enough of ham trained). > > It's much worse to have FP than FN - train everything that does not hit > BAYES_00 > > I have trained my DB years ago and I rarely need new training now. Yes, I do understand that. The cron jobs I set up quite some time ago # learn ham and spam 17 3 * * 0 root /usr/local/bin/sa-learn --ham --no-sync /mail/blackrosetech.com/gessel/.archives.2017/{cur,new} 22 3 * * 0 root /usr/local/bin/sa-learn --ham --no-sync /mail/blackrosetech.com/gessel/.Sent/{cur,new} 27 3 * * 0 root /usr/local/bin/sa-learn --spam --no-sync /mail/blackrosetech.com/gessel/.ManJunk/{cur,new} 22 3 * * 0 root /usr/local/bin/sa-learn --ham --no-sync /mail/blackrosetech.com/carolyn/.Archives.2017/{cur,new} 32 3 * * 0 root /usr/local/bin/sa-learn --spam --no-sync /mail/blackrosetech.com/carolyn/.ManJunk/{cur,new} 37 3 * * 0 root /usr/local/bin/sa-learn --ham --no-sync /mail/blackrosetech.com/carolyn/.Sent/{cur,new} 55 3 * * 0 root /usr/local/bin/sa-learn --sync I disabled auto-learn because non-spam would occasionally get through to spam and I didn't want to train on that. The theory here was to wipe the database, then groom the huge automatic spam folder to clear any non-spam (manually moving it to the archives directory for later analysis as non-spam). Then sa-learn a large set of spam tokens on obvious spam, then on an ongoing basis keep it trained with the spam that slips through (which I move to ManJunk). The incentive to restart was registering a few domain names which triggers a deluge of "let me design your new logo" emails from rafts of hotmail and google accounts, I thought retraining the bayes database would improve detection of these linguistically distinctive spam messages. > >> I was trying to track progress and... >> # sa-learn --dump magic >> 0.000 0 3 0 non-token data: bayes db version >> 0.000 0 1646 0 non-token data: nspam >> 0.000 0 0 0 non-token data: nham > >> but then 24 hours later... >> >> # sa-learn --dump magic >> 0.000 0 3 0 non-token data: bayes db version >> 0.000 0 0 0 non-token data: nspam >> 0.000 0 0 0 non-token data: nham > > are you sure someone did not back up your spam DB Aside from the cron jobs above, no, but if they did that, then yes. > >> Two issues: >> >> 1) sa-learn seems really, really slow. Slow enough that spam sometimes >> comes in faster. This seems far slower than the benchmark results suggest >> is within the range of normal. I'm sure I'm doing something really wrong, >> but not sure what. >> >> 2) what happened to my hard won spam tokens? >> >> >> I know --no-sync should speed up the process and if the task ever completes >> (or can be killed) I'll test that for speed on a smaller collection. > > --no-sync only helps if you have "bayes_learn_to_journal 1" - it's 0 by > default. try turning it on. OK, will do this. It is not in my local.cf. bayes config read # Use Bayesian classifier (default: 1) # use_bayes 1 # Bayesian classifier auto-learning (default: 1) # bayes_auto_learn 0 # Set headers which may provide inappropriate cues to the Bayesian # classifier # # bayes_ignore_header X-Bogosity bayes_ignore_header X-Spam-Flag bayes_ignore_header X-Spam-Status # Set the default directory for the bayes classifier bayes_path /var/amavis/.spamassassin/bayes bayes_file_mode 0777 Just added this # If this option is set, whenever SpamAssassin does Bayes learning, it will # put the information into the journal instead of directly into the database. # This lowers contention for locking the database to execute an update, but # will also cause more access to the journal and cause a delay before the # updates are actually committed to the Bayes database. # bayes_learn_to_journal 1 > >> Would something like specifying the mailbox format also help? > > only if you use mbox format. No, maildir. Not really relevant (I don't think) but: dovecot2-2.2.31_1 dovecot-pigeonhole-0.4.19 postfix-3.2.2,1 Now that "bayes_learn_to_journal 1" is set, I've stopped the process, and.... # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 2326 0 non-token data: nspam 0.000 0 1 0 non-token data: nham 0.000 0 154919 0 non-token data: ntokens 0.000 0 1438503364 0 non-token data: oldest atime 0.000 0 1508964396 0 non-token data: newest atime 0.000 0 1508964658 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count Restarting with: # sa-learn --spam --showdots --no-sync /mail/blackrosetech.com/gessel/.Junk/{cur,new} And will let it run for a bit to see what the rate looks like.