Thank you very much for your help!  A few answers inline.  

-------- Original Message --------
Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per 
token really, really slow or roughly normal?
From: Matus UHLAR - fantomas <uh...@fantomas.sk>
To: users@spamassassin.apache.org
Date: Tue Oct 31 2017 11:27:47 GMT+0300 (AST)

> On 31.10.17 01:35, David Gessel wrote:
>> amavisd-new-2.11.0_2,1
>> I'm finding the command /usr/local/bin/sa-learn --spam --showdots
>> /mail/blackrosetech.com/gessel/.Junk/{cur,new} is taking a while to
> 
> if you use amavis, you must train amavis' bayes database
> (/var/lib/amavis/.spamassassin/ here), not your own.

huh, I was getting bayes filter results, as I =think= I'm training a global 
bayes database per 
https://wiki.apache.org/spamassassin/SiteWideBayesSetup


> 
>> complete...  by a while I mean it has been running for 3 days.  The folder
>> has a few months of spam in it, 4760 "conversations" according to
>> Thunderbird, which is roughly the message count since spam doesn't tend to
>> thread deeply.
> 
> It's not needed to train on all spam you have. after initial training of
> let's say 200-500 pieces of (different types of) logged spam, it may be
> enough to train only spam that does not hit BAYES_99
> 
> It's much more important to train on ham, since SA must know the DIFFERENCES 
> between ham
> and spam - otherwise all mail will of course look like spam.
> (Also, SA won't hit if you don't have enough of ham trained).
> 
> It's much worse to have FP than FN - train everything that does not hit
> BAYES_00
> 
> I have trained my DB years ago and I rarely need new training now.

Yes, I do understand that.  The cron jobs I set up quite some time ago
# learn ham and spam
17      3       *       *       0       root  /usr/local/bin/sa-learn --ham 
--no-sync /mail/blackrosetech.com/gessel/.archives.2017/{cur,new}
22      3       *       *       0       root  /usr/local/bin/sa-learn --ham 
--no-sync /mail/blackrosetech.com/gessel/.Sent/{cur,new}
27      3       *       *       0       root  /usr/local/bin/sa-learn --spam 
--no-sync /mail/blackrosetech.com/gessel/.ManJunk/{cur,new}
22      3       *       *       0       root  /usr/local/bin/sa-learn --ham 
--no-sync /mail/blackrosetech.com/carolyn/.Archives.2017/{cur,new}
32      3       *       *       0       root  /usr/local/bin/sa-learn --spam 
--no-sync /mail/blackrosetech.com/carolyn/.ManJunk/{cur,new}
37      3       *       *       0       root  /usr/local/bin/sa-learn --ham 
--no-sync /mail/blackrosetech.com/carolyn/.Sent/{cur,new}
55      3       *       *       0       root  /usr/local/bin/sa-learn --sync


I disabled auto-learn because non-spam would occasionally get through to spam 
and I didn't want to train on that.  The theory here was to wipe the database, 
then groom the huge automatic spam folder to clear any non-spam (manually 
moving it to the archives directory for later analysis as non-spam).  Then 
sa-learn a large set of spam tokens on obvious spam, then on an ongoing basis 
keep it trained with the spam that slips through (which I move to ManJunk).

The incentive to restart was registering a few domain names which triggers a 
deluge of "let me design your new logo" emails from rafts of hotmail and google 
accounts, I thought retraining the bayes database would improve detection of 
these linguistically distinctive spam messages.

> 
>> I was trying to track progress and...
>> # sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0       1646          0  non-token data: nspam
>> 0.000          0          0          0  non-token data: nham
> 
>> but then 24 hours later...
>>
>> # sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0          0          0  non-token data: nspam
>> 0.000          0          0          0  non-token data: nham
> 
> are you sure someone did not back up your spam DB

Aside from the cron jobs above, no, but if they did that, then yes.

> 
>> Two issues:
>>
>> 1) sa-learn seems really, really slow.  Slow enough that spam sometimes
>> comes in faster.  This seems far slower than the benchmark results suggest
>> is within the range of normal.  I'm sure I'm doing something really wrong,
>> but not sure what.
>>
>> 2)  what happened to my hard won spam tokens?
>>
>>
>> I know --no-sync should speed up the process and if the task ever completes
>> (or can be killed) I'll test that for speed on a smaller collection. 
> 
> --no-sync only helps if you have "bayes_learn_to_journal 1" - it's 0 by
> default.  try turning it on.

OK, will do this.  It is not in my local.cf.

bayes config read

#   Use Bayesian classifier (default: 1)
#
use_bayes 1


#   Bayesian classifier auto-learning (default: 1)
#
 bayes_auto_learn 0


#   Set headers which may provide inappropriate cues to the Bayesian
#   classifier
#
# bayes_ignore_header X-Bogosity
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status

#  Set the default directory for the bayes classifier
bayes_path /var/amavis/.spamassassin/bayes
bayes_file_mode 0777


Just added this
#  If this option is set, whenever SpamAssassin does Bayes learning, it will
#  put the information into the journal instead of directly into the database.
#  This lowers contention for locking the database to execute an update, but 
#  will also cause more access to the journal and cause a delay before the 
#  updates are actually committed to the Bayes database.
#
bayes_learn_to_journal 1


> 
>> Would something like specifying the mailbox format also help?
> 
> only if you use mbox format.

No, maildir.  Not really relevant (I don't think) but:

dovecot2-2.2.31_1
dovecot-pigeonhole-0.4.19
postfix-3.2.2,1  

Now that "bayes_learn_to_journal 1" is set, I've stopped the process, and....

# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       2326          0  non-token data: nspam
0.000          0          1          0  non-token data: nham
0.000          0     154919          0  non-token data: ntokens
0.000          0 1438503364          0  non-token data: oldest atime
0.000          0 1508964396          0  non-token data: newest atime
0.000          0 1508964658          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction 
count


Restarting with:
# sa-learn --spam --showdots --no-sync 
/mail/blackrosetech.com/gessel/.Junk/{cur,new}    

And will let it run for a bit to see what the rate looks like.

Reply via email to