Dan - Thanks for the e-mail.
I'll work with these suggestions: >If you can get a stack trace out of the process (gdb it and run "thread >apply all bt), that would help narrow down what's hanging. Also try >upgrading to a less-buggy glibc, or set the environment variable >LD_ASSUME_KERNEL=2.4.1. Any time I see a process hang on a futex, >setting that has fixed it (it disables futexes entirely). > > and report what I find. First, I'll run the stack trace on the process as it is. I don't use gdb frequently, so if you have recommendations that are more specifc that what you've already provided, please e-mail me again. I'll try the env fix, then the glibc update if I can ( I'll have to check some dependencies ). It probably won't be until next week. Chris Crowley Dan Nelson wrote: >In the last episode (Feb 09), Chris Crowley said: > > >>My question, "Is CVS for the 0.3.0 branch improved from the distro, >>and stable for production use?" If not, I'll drill down into the >>problems with the 0.3.0 tar file that I've got, otherwise, I'll >>install the CVS version and see if the problems persist. >> >> > >Only minor changes have been made since 0.3.0; none should affect >stability one way or the other. My milters never seem to crash or >hang, but I only process 1 message every 5-10 seconds. Each milter >thread is independant, so (barring OS bugs) hangs/crashes due to race >conditions should not be possible. > > > >>...details... >>I've been running 0.2.0, and plan to upgrade soon. I've build 0.3.0, >>and have noticed in some high load testing that it fails differently >>than the 0.2.0 spamass-milter. By failure I mean that I see error >>messages in the log. For example: >><log> >>Milter (spamassassin): local socket name /var/run/sendmail/spamass.sock unsafe >>sendmail[10000]: ###ID: Milter (spamassassin): to error state >>spamass-milter[13360]: SpamAssassin, mi_rd_cmd: read returned -1: Connection >>reset by peer >>spamass-milter[19980]: SpamAssassin: thread_create() failed: 12, try again >></log> >> >>and a strace on the process shows that it is "hung": >><strace> >>strace -p 13360 >>Process 13360 attached - interrupt to quit >>futex(0xc9e20c, FUTEX_WAIT, 2, NULL <unfinished ...> >></strace> >> >> > >If you can get a stack trace out of the process (gdb it and run "thread >apply all bt), that would help narrow down what's hanging. Also try >upgrading to a less-buggy glibc, or set the environment variable >LD_ASSUME_KERNEL=2.4.1. Any time I see a process hang on a futex, >setting that has fixed it (it disables futexes entirely). > > > >>From the logs, and a quick non-scientific assessment, I don't think >>that 0.3.0 is failing any less frequently that 0.2.0 was. It's just >>that the 0.3.0 process actually persists after it fails, so my >>restart script (which looks if the socket exists) doesn't work to >>repair things. >> >>Thanks for any insight you can provide. Of course, I'm able to >>provide more details if they would be beneficial. >> >> > > > -- Christopher Crowley Network Administrator Tulane Technology Services [EMAIL PROTECTED] phone: (504) 324-2249 _______________________________________________ Spamass-milt-list mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/spamass-milt-list
