David Blank-Edelman schrieb:
Howdy-

I just wanted to pop in and provide the latest update on our saga (clamd 0.83 just stops playing nice after running for a while) with some more interesting information like stack traces.

Last we left off I had just upped the ulimit for the clamd process from the default of 256 fds to 1024. I can't tell if this truly helped things, but the number of times a day our babysitting process restarted clamd because it couldn't connect went down considerably. We have whole stretches of days at a time with nary a restart. Things got a little worse today for no reason I can discern.

Since the time since people began to help us, I have periodically checked in on the process descriptors with pfiles and the memory on the machine and neither seemed to be even close to being pegged. We also switched from using a network socket to a local unix socket just to eliminate any funny business. You may take comfort knowing that we're going to be rev'ing everything (all dependent libraries and our MTA) on Wednesday to eliminate all of those possibilities as well.

Today I managed to catch clamd in a hung state and so I poked and prodded at it with gdb. Btw, by hung I mean that attempts to contact clamd on the local socket failed with "connection refused" from clamdmon.

I wasn't quite sure what I was looking for, so the following might be too little, two much, or the wrong info. If there was something I should have done, please let me know and I'll do it next time. Here's what I found:

info threads
  2 LWP 271  0xfef1e878 in _read () from /usr/lib/libc.so.1
* 1 LWP 1  0xfee45dd4 in __lwp_park () from /usr/lib/libthread.so.1

Thread 1, presumably the thing that should be listening for new connects:

#0  0xfee45dd4 in __lwp_park () from /usr/lib/libthread.so.1
#1  0xfee430ec in cond_wait_queue () from /usr/lib/libthread.so.1
#2  0xfee438a8 in cond_wait () from /usr/lib/libthread.so.1
#3  0xfee438e4 in pthread_cond_wait () from /usr/lib/libthread.so.1
#4  0x0001864c in thrmgr_destroy ()
#5  0x0001a19c in acceptloop_th ()
#6  0x00017ac4 in localserver ()
#7  0x00017190 in clamd ()
#8  0x00015d5c in main ()

A truss confirmed that it just stayed parked like that.

Thread 2 (which was going like a busy bee, appearing to actually still be scanning based on a truss of the process):

thread 2
[Switching to thread 2 (LWP 271)]#0  0xfef1e878 in _read ()
   from /usr/lib/libc.so.1
(gdb) where
#0  0xfef1e878 in _read () from /usr/lib/libc.so.1
#1  0xfee3dd90 in read () from /usr/lib/libthread.so.1
#2  0xff30b570 in cli_scandesc ()
   from /priv/daemons/packages/clamav-0.83/lib/libclamav.so.1
[....]
#58 0xff3191d4 in cli_scanmail ()
   from /priv/daemons/packages/clamav-0.83/lib/libclamav.so.1
#59 0xff319cc4 in cli_magic_scandesc ()
   from /priv/daemons/packages/clamav-0.83/lib/libclamav.so.1
#60 0xff319ee4 in cl_scandesc ()
   from /priv/daemons/packages/clamav-0.83/lib/libclamav.so.1
#61 0xff31a008 in cl_scanfile ()
   from /priv/daemons/packages/clamav-0.83/lib/libclamav.so.1
#62 0x0001a850 in dirscan ()
#63 0x0001ad20 in scan ()
#64 0x00017ca4 in command ()
#65 0x00018dc4 in scanner_thread ()
#66 0x00018a20 in thrmgr_worker ()

> [ open files omitted ]

I'm not quite sure how to interpret this information. Does this mean the main thread was parked waiting for the second to complete what it was doing? Something else entirely going on?


Thanks again for any help you can offer.

This definitely looks like a mail scan with 17 attachments (or level of attachments?), and a threadmanager after a database update, waiting for the mail scan to finish.
Tomasz? Trog?



Thomas Lamy

PS: Sorry for the small confusion about attachments vs attachment levels, but I'm not too deep into Nigel's mail code for 0.82+.
_______________________________________________
http://lurker.clamav.net/list/clamav-users.html

Reply via email to