Here's an interesting observation.
I set bayes_auto_expire to 0 as a temporary solution, I thought, and
restarted spamd. The hogging occurs at least as often as before. Am I
looking in the wrong direction or wouldn't this have helped something?
Another observation:
# sa-learn --dump magic:
bayes: cannot open bayes databases
/usr/local/share/spamassassin/bayes/bayes_* R/W: lock failed:
Interrupted system call
0.000 0 3 0 non-token data: bayes db version
0.000 0 437041 0 non-token data: nspam
0.000 0 253396 0 non-token data: nham
0.000 0 4616765 0 non-token data: ntokens
0.000 0 1156977303 0 non-token data: oldest atime
0.000 0 1159200779 0 non-token data: newest atime
0.000 0 1159199860 0 non-token data: last journal
sync atime
0.000 0 1158904222 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire
atime delta
0.000 0 0 0 non-token data: last expire
reduction count
last expiry atime converts to september 22, the same day my problems
started. But if the hogging continues even with bayes_auto_expire set to
0, then where should I be looking instead?
Regards,
Andreas
Andreas Pettersson wrote:
Me again. Since I'm not getting any responses I better keep posting
more information as I've made some more investigating today.
Sometimes when I run sa-learn --force-expire I get this response
almost immediately:
Bus error (core dumped)
When I run again the process just hogs until I break it after about 15
minutes.
I have also changed bayes_learn_to_journal back to 0 and lock_method
to flock.
Now I get these in spamd.log:
Mon Sep 25 17:05:18 2006 [8853] warn: bayes: cannot open bayes
databases /usr/local/share/spamassassin/bayes/bayes_* R/W: lock
failed: Interrupted system call
I also lowered --max-children from 8 to 6 with this result:
Mon Sep 25 17:11:03 2006 [6702] info: prefork: server reached
--max-children setting, consider raising it
Here's some top output of a typical situation:
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
8287 spamd 132 0 48056K 44220K RUN 8:00 88.43% 88.43%
perl5.8.7
8853 spamd 20 0 40416K 38356K lockf 0:11 1.32% 1.32%
perl5.8.7
9128 spamd 20 0 38592K 36544K lockf 0:03 0.63% 0.63%
perl5.8.7
8879 spamd 20 0 40804K 38484K lockf 0:08 0.59% 0.59%
perl5.8.7
9103 spamd 20 0 39728K 37736K lockf 0:04 0.54% 0.54%
perl5.8.7
-rw------- 1 spamd wheel 45 Sep 25 17:04 bayes.mutex
-rw------- 1 spamd wheel 240024 Sep 25 17:15 bayes_journal
-rw------- 1 spamd wheel 1039920 Sep 25 17:04 bayes_journal.old
-rw-r--r-- 1 spamd wheel 83787776 Sep 25 16:09 bayes_seen
-rw------- 1 spamd wheel 85901312 Sep 25 17:04 bayes_toks
# cat bayes.mutex
8287
6708
6708
6708
6708
6708
6708
6708
6708
What is wrong?! What is making spamd go *kaboom* several times an hour?
Is it something with expiring tokens that's not working correctly?
Is it normal to have an bayes_journal.old laying around?
What more can I do to find the cause?
If the core dump (22 MB) is of any interrest, I'll upload it somewhere.
Best regards,
Andreas
Andreas Pettersson wrote:
Ok, more information here.
I found in spamd.log this line when the problem started:
Fri Sep 22 19:55:22 2006 [74581] warn: bayes: expire_old_tokens:
child processing timeout at /usr/local/bin/spamd line 1082
which was followed by lots of these:
Fri Sep 22 19:55:52 2006 [74581] warn: bayes: cannot open bayes
databases /usr/local/share/spamassassin/bayes/bayes_* R/W:
lock failed: File exists
In an attempt to find what's wrong I changed bayes_learn_to_journal
to 1. It didn't help, but at least I got rid of the 'lock failed:
File exist' error messages in spamd.log and bayes also keeps working.
For the moment I have a script that checks for bayes.lock existance
and kills the hogging process and removes the lock file. It runs
every minute..
I have tried change lock_method to flock, problem still there (but
with a new lock file name).
I also tried a sa-learn --force-expire. It took about 30 sec to
complete. It didn't solve my problem either.
Any ideas of what might be wrong?
Regards,
Andreas