On Sun, Nov 13, 2005 at 04:34:38PM +0100, Michael Monnerie wrote:
>    MASS_PARAM="--progress --after=-15552000 --net -j=10"
> cd masses
> mkdir spamassassin
> rm -f spamassassin/bayes*
> #echo "use_bayes 0" > spamassassin/user_prefs
> 
> As you can see, the bayes stuff is commented out. As I "rsync --delete", 
> there should be no options set here that are not set at the source. I 
> could enable the "use_bayes 0" line, but I thought I shouldn't change 
> it, and left the original settings.

Yeah, we should probably document things a little more.  Basically for the
nightly and weekly runs, all we really care about is the rule output -- things
like Bayes, AWL, etc, are superfluous and just waste CPU cycles.  Bayes *is*
important for a score generation run, but not so much during general
development.  (arguably, if bayes was run we could see what changes to the
tokenizer/etc do, but those have traditionally been handled in one-off cases
since it doesn't happen often)

For the masses/spamassassin/user_prefs file, I usually have this in there all
the time:

use_bayes 0
lock_method flock
bayes_path /home/felicity/SA/spamassassin-head/masses/spamassassin/bayes
use_pyzor 0
use_auto_whitelist 0
dns_available yes

along with any trusted or internal_networks settings, etc.  Then if I do want
to enable Bayes, it's a simple bit change at the top and I'm ready to go.
flock is better than the default if you're on local disk, btw.

> >Couldn't lock so it skipped the write.
> 
> Could that be because of "-j=10" in the weekly run? If yes, then it's I 
> bug I guess, as the -j option is not usable.

Yes, I'd imagine, and not a bug.  With 10 processes, potentially, all trying to
get the lock to do a write, you'll start running into contention pretty
quickly.  Anytime there's contention, a process may have to give up trying and
move on.

Personally, I find -j=10 pretty large.  On my fairly beefy server w/
very beefy network connection, I still only use -j=4 (2xCPU).  If you
arbitrarily chose 10, I'd play around and see what results you get with
other (probably lower) numbers.  Generally it's recommended to only go
2-3xCPU with network checks, and that largely depends on your network
connection -- if you're using --reuse (which you should with --net,
assuming the messages already have markup), then the number ought to go
down since you'll be waiting less for network traffic.

So just for completeness, here's what I tend to use for parameters to
mass-check:
--progress --cache --all -j 2 -n --after "-90 days"

progress is optional.  cache is also optional (it's really useful for
the multiple runs while tuning a rule locally before committing, but
since I already have the cache files built...)  all is optional.  "-j 2"
is for set0 since I have 2 CPUs.  -n saves time/etc for nightly/weekly
runs since we don't care about running the messages in received order.
I use --after to limit what messages are used from the list of files/dirs
I provide -- usually I'll list 90-365 days of dirs and let SA sort it out
(--cache is a huge win here too, btw...)

for weekly runs:
--progress --cache --all --net --reuse -j 4 -n --after "-45 days"

Basically the same thing.  add --net (obvious) and --reuse (reuses the markup
already in the message for network checks -- faster and also provides
"real-time" results).  Since there'll be some network traffic, I up the -j
parameter to 4.  However, since I found that network delays tend to take a
while, I drop the number of messages by setting --after to only look at 45
days instead of 90.

-- 
Randomly Generated Tagline:
"When it is not necessary to make a decision, it is necessary not to make a
 decision."               - From www.slashdot.org

Attachment: pgp0Gsq39eIkm.pgp
Description: PGP signature

Reply via email to