On Sun, Nov 13, 2005 at 04:34:38PM +0100, Michael Monnerie wrote: > MASS_PARAM="--progress --after=-15552000 --net -j=10" > cd masses > mkdir spamassassin > rm -f spamassassin/bayes* > #echo "use_bayes 0" > spamassassin/user_prefs > > As you can see, the bayes stuff is commented out. As I "rsync --delete", > there should be no options set here that are not set at the source. I > could enable the "use_bayes 0" line, but I thought I shouldn't change > it, and left the original settings.
Yeah, we should probably document things a little more. Basically for the nightly and weekly runs, all we really care about is the rule output -- things like Bayes, AWL, etc, are superfluous and just waste CPU cycles. Bayes *is* important for a score generation run, but not so much during general development. (arguably, if bayes was run we could see what changes to the tokenizer/etc do, but those have traditionally been handled in one-off cases since it doesn't happen often) For the masses/spamassassin/user_prefs file, I usually have this in there all the time: use_bayes 0 lock_method flock bayes_path /home/felicity/SA/spamassassin-head/masses/spamassassin/bayes use_pyzor 0 use_auto_whitelist 0 dns_available yes along with any trusted or internal_networks settings, etc. Then if I do want to enable Bayes, it's a simple bit change at the top and I'm ready to go. flock is better than the default if you're on local disk, btw. > >Couldn't lock so it skipped the write. > > Could that be because of "-j=10" in the weekly run? If yes, then it's I > bug I guess, as the -j option is not usable. Yes, I'd imagine, and not a bug. With 10 processes, potentially, all trying to get the lock to do a write, you'll start running into contention pretty quickly. Anytime there's contention, a process may have to give up trying and move on. Personally, I find -j=10 pretty large. On my fairly beefy server w/ very beefy network connection, I still only use -j=4 (2xCPU). If you arbitrarily chose 10, I'd play around and see what results you get with other (probably lower) numbers. Generally it's recommended to only go 2-3xCPU with network checks, and that largely depends on your network connection -- if you're using --reuse (which you should with --net, assuming the messages already have markup), then the number ought to go down since you'll be waiting less for network traffic. So just for completeness, here's what I tend to use for parameters to mass-check: --progress --cache --all -j 2 -n --after "-90 days" progress is optional. cache is also optional (it's really useful for the multiple runs while tuning a rule locally before committing, but since I already have the cache files built...) all is optional. "-j 2" is for set0 since I have 2 CPUs. -n saves time/etc for nightly/weekly runs since we don't care about running the messages in received order. I use --after to limit what messages are used from the list of files/dirs I provide -- usually I'll list 90-365 days of dirs and let SA sort it out (--cache is a huge win here too, btw...) for weekly runs: --progress --cache --all --net --reuse -j 4 -n --after "-45 days" Basically the same thing. add --net (obvious) and --reuse (reuses the markup already in the message for network checks -- faster and also provides "real-time" results). Since there'll be some network traffic, I up the -j parameter to 4. However, since I found that network delays tend to take a while, I drop the number of messages by setting --after to only look at 45 days instead of 90. -- Randomly Generated Tagline: "When it is not necessary to make a decision, it is necessary not to make a decision." - From www.slashdot.org
pgp0Gsq39eIkm.pgp
Description: PGP signature
