--On Tuesday, May 14, 2002 00:23:07 -0500 Dustin Puryear <[EMAIL PROTECTED]> wrote:
> At 11:13 PM 5/13/2002 -0400, Michael Bacon wrote: >> Sounds like what we're running into at the moment, which appears to be >> the master processes ending up with an incorrect count of available >> workers. The problem occurs when a worker process dies while in the >> "available" state, and doesn't notify the master. Jeremy Howard >> recently posted a patch which addresses this problem, by decrementing >> the "available workers" counter when receiving a SIGCLD, which strikes >> me as the right way to go. However, his patch is for 2.1.3, and like >> you, we're using 2.0.16 (the bleeding edge is a bad place > > This is extremely interesting. Michael, do you find this happens at > seemingly random times though? We can go a week or two with no problems, > and then bam, I get a 911. Of course, our volume is considerably lower > than yours. Another issue, and one that may differentiate our problems > from yours (but hopefully not as your at least have a work-around), is > that I can sometimes restart Cyrus, and even after a restart, no new > connections are serviced. (They connect, but get no service.) I've found > that when this happens Cyrus will often appear to work for a VERY short > while, and then revert back to the point where connections occur but no > service (pop3d) responds. > > Shouldn't a restart completely fix the problem? If so we may be fighting > something different. A reboot also doesn't always clear up the problem. > Again, Cyrus will come up, but then fail shortly thereafter. > > What is really odd is that the problem just goes away after a few hours. >From what we've seen, and from the best explanation for the behavior that we can come up with, the incorrect number of workers problem is a complicating factor that makes things a lot worse when something else goes wrong. First, it was the database locking problem that was inherent in the 2.0.11 and 2.0.14 builds, which greatly improved with 2.0.16. Then, for a while, it was database deadlocks on the mailboxes.db file -- we nailed that one down by using a flatfile mailboxes.db. Now, it only seems to happen when we run out of memory, or some other problem disrupts the normal flow. We could just upgrade the RAM (and plan to), but that doesn't address the core issue of problems with reliability following a problem. So, in a very roundabout way, I'm trying to say that the frequency with which we see the problem appears to be directly tied to the freuqency with which we see other problems. For us, a reboot almost always fixes the problem -- in fact, a restart of the master process does too -- but it may be that you're running into some other issue that we're not that's causing the worker miscount problem. A couple of things to try if you're experiencing frequent reliability problems, which may or may not be related to the symptoms you described above: * Recompile with the mailboxes.db set to use the flatfile format. (Instructions for doing so are somewhere in the list archive -- don't forget to dump you're mailboxes file before swapping out the binaries!) * Set all of your preforks to 0. This may not be as much of a problem with the flatfile mailboxes.db, but we definitely noticed a decrease in reliability with non-zero preforks when we were using the DB-3 code. * With the service stopped, clean out the /var/imap/db/ and /var/imap/deliverdb/db directories, particluarly the "log" files. Also, running ctl_mboxlist -r and ctl_deliver -r may help. However, you're probably going to have problems until you track down why your processes are dying in the first place, which is what Larry and Ken were getting at. Michael Bacon