On May 24, 2005, at 19:30, Damian Menscher wrote:
On Tue, 24 May 2005, Doug Hardie wrote:
On May 24, 2005, at 13:21, Stephen Gran wrote:
On Tue, May 24, 2005 at 12:54:47PM -0700, Doug Hardie said:
http://www.lafn.org/clamav/ktrace.html
http://www.lafn.org/clamav/clamd.html
clamav-milter is only one process. It has multiple threads but
those are not visible to the kernel. The problem does not occur
immediately with a database reload. It takes 10 or so minutes
before it hangs/quits. I suspect that the problem occurs when
there are active messages that do not complete before some timeout
value. clamav-milter is waiting for everything to go quiet, but
on my receive mail server that never happens. There are always
30-40 active sendmail children. As a result it never goes quiet.
I suspect that clamav-milter eventually gives up and thats when
the problem occurs. On my outgoing mail server which handles
considerably less mail, most of the database updates do not cause
a problem. On my test server which handles 3 email daily it never
causes a problem.
Just to bring you (and anyone else joining us) up to speed, here's
a description of how it's supposed to work:
When there's a database update, the milter wants everything to be
quiet. So it stops accepting new connections. It then waits for
the currently-running children to finish. Once n_children drops to
0, it reloads the database and resumes accepting connections.
At least, that's the theory. In practice, n_children isn't ever
hitting 0, so it stays in the !accepting state forever. For
example, in the ktrace you posted, n_children dropped from 7 down
to 2. The fact that it never reached 0 is the entire problem. Of
course, nobody knows *why* it isn't reaching 0. It might be from a
hung scanner thread, or from a pthreads race condition, or even a
locking issue.
The hope was that getting an strace of each thread of a hung milter
would provide information on which of those causes was at fault,
and perhaps enable us to actually locate the bug.
I frequently see sendmail children alive for over 30 minutes and
sometimes considerably longer. Some connections are very slow at
transferring data. I would guess its just not waiting long enough.....
_______________________________________________
http://lurker.clamav.net/list/clamav-users.html