Here's an interesting observation.
I set bayes_auto_expire to 0 as a temporary solution, I thought, and restarted spamd. The hogging occurs at least as often as before. Am I looking in the wrong direction or wouldn't this have helped something?

Another observation:
# sa-learn --dump magic:
bayes: cannot open bayes databases /usr/local/share/spamassassin/bayes/bayes_* R/W: lock failed: Interrupted system call
0.000          0          3          0  non-token data: bayes db version
0.000          0     437041          0  non-token data: nspam
0.000          0     253396          0  non-token data: nham
0.000          0    4616765          0  non-token data: ntokens
0.000          0 1156977303          0  non-token data: oldest atime
0.000          0 1159200779          0  non-token data: newest atime
0.000 0 1159199860 0 non-token data: last journal sync atime
0.000          0 1158904222          0  non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count

last expiry atime converts to september 22, the same day my problems started. But if the hogging continues even with bayes_auto_expire set to 0, then where should I be looking instead?

Regards,
Andreas



Andreas Pettersson wrote:

Me again. Since I'm not getting any responses I better keep posting more information as I've made some more investigating today.

Sometimes when I run sa-learn --force-expire I get this response almost immediately:
Bus error (core dumped)
When I run again the process just hogs until I break it after about 15 minutes.

I have also changed bayes_learn_to_journal back to 0 and lock_method to flock.

Now I get these in spamd.log:
Mon Sep 25 17:05:18 2006 [8853] warn: bayes: cannot open bayes databases /usr/local/share/spamassassin/bayes/bayes_* R/W: lock failed: Interrupted system call

I also lowered --max-children from 8 to 6 with this result:
Mon Sep 25 17:11:03 2006 [6702] info: prefork: server reached --max-children setting, consider raising it

Here's some top output of a typical situation:
 PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU COMMAND
8287 spamd 132 0 48056K 44220K RUN 8:00 88.43% 88.43% perl5.8.7 8853 spamd 20 0 40416K 38356K lockf 0:11 1.32% 1.32% perl5.8.7 9128 spamd 20 0 38592K 36544K lockf 0:03 0.63% 0.63% perl5.8.7 8879 spamd 20 0 40804K 38484K lockf 0:08 0.59% 0.59% perl5.8.7 9103 spamd 20 0 39728K 37736K lockf 0:04 0.54% 0.54% perl5.8.7

-rw-------  1 spamd  wheel        45 Sep 25 17:04 bayes.mutex
-rw-------  1 spamd  wheel    240024 Sep 25 17:15 bayes_journal
-rw-------  1 spamd  wheel   1039920 Sep 25 17:04 bayes_journal.old
-rw-r--r--  1 spamd  wheel  83787776 Sep 25 16:09 bayes_seen
-rw-------  1 spamd  wheel  85901312 Sep 25 17:04 bayes_toks

# cat bayes.mutex
8287
6708
6708
6708
6708
6708
6708
6708
6708


What is wrong?! What is making spamd go *kaboom* several times an hour?
Is it something with expiring tokens that's not working correctly?
Is it normal to have an bayes_journal.old laying around?
What more can I do to find the cause?

If the core dump (22 MB) is of any interrest, I'll upload it somewhere.



Best regards,
Andreas





Andreas Pettersson wrote:

Ok, more information here.

I found in spamd.log this line when the problem started:
Fri Sep 22 19:55:22 2006 [74581] warn: bayes: expire_old_tokens: child processing timeout at /usr/local/bin/spamd line 1082

which was followed by lots of these:
Fri Sep 22 19:55:52 2006 [74581] warn: bayes: cannot open bayes databases /usr/local/share/spamassassin/bayes/bayes_* R/W:
lock failed: File exists

In an attempt to find what's wrong I changed bayes_learn_to_journal to 1. It didn't help, but at least I got rid of the 'lock failed: File exist' error messages in spamd.log and bayes also keeps working. For the moment I have a script that checks for bayes.lock existance and kills the hogging process and removes the lock file. It runs every minute..


I have tried change lock_method to flock, problem still there (but with a new lock file name). I also tried a sa-learn --force-expire. It took about 30 sec to complete. It didn't solve my problem either.


Any ideas of what might be wrong?

Regards,
Andreas





Reply via email to