--On Tuesday, May 14, 2002 00:23:07 -0500 Dustin Puryear <[EMAIL PROTECTED]> 
wrote:

> At 11:13 PM 5/13/2002 -0400, Michael Bacon wrote:
>> Sounds like what we're running into at the moment, which appears to be
>> the  master processes ending up with an incorrect count of available
>> workers.  The problem occurs when a worker process dies while in the
>> "available"  state, and doesn't notify the master.  Jeremy Howard
>> recently posted a  patch which addresses this problem, by decrementing
>> the "available  workers" counter when receiving a SIGCLD, which strikes
>> me as the right  way to go. However, his patch is for 2.1.3, and like
>> you, we're using  2.0.16 (the bleeding edge is a bad place
>
> This is extremely interesting. Michael, do you find this happens at
> seemingly random times though? We can go a week or two with no problems,
> and then bam, I get a 911. Of course, our volume is considerably lower
> than yours. Another issue, and one that may differentiate our problems
> from yours (but hopefully not as your at least have a work-around), is
> that I can sometimes restart Cyrus, and even after a restart, no new
> connections are serviced. (They connect, but get no service.) I've found
> that when this happens Cyrus will often appear to work for a VERY short
> while, and then revert back to the point where connections occur but no
> service (pop3d) responds.
>
> Shouldn't a restart completely fix the problem? If so we may be fighting
> something different. A reboot also doesn't always clear up the problem.
> Again, Cyrus will come up, but then fail shortly thereafter.
>
> What is really odd is that the problem just goes away after a few hours.


>From what we've seen, and from the best explanation for the behavior that 
we can come up with, the incorrect number of workers problem is a 
complicating factor that makes things a lot worse when something else goes 
wrong.  First, it was the database locking problem that was inherent in the 
2.0.11 and 2.0.14 builds, which greatly improved with 2.0.16.  Then, for a 
while, it was database deadlocks on the mailboxes.db file -- we nailed that 
one down by using a flatfile mailboxes.db.  Now, it only seems to happen 
when we run out of memory, or some other problem disrupts the normal flow. 
We could just upgrade the RAM (and plan to), but that doesn't address the 
core issue of problems with reliability following a problem.

So, in a very roundabout way, I'm trying to say that the frequency with 
which we see the problem appears to be directly tied to the freuqency with 
which we see other problems.  For us, a reboot almost always fixes the 
problem -- in fact, a restart of the master process does too -- but it may 
be that you're running into some other issue that we're not that's causing 
the worker miscount problem.

A couple of things to try if you're experiencing frequent reliability 
problems, which may or may not be related to the symptoms you described 
above:
        * Recompile with the mailboxes.db set to use the flatfile format. 
(Instructions for doing so are somewhere in the list archive -- don't 
forget to dump you're mailboxes file before swapping out the binaries!)
        * Set all of your preforks to 0.  This may not be as much of a problem 
with the flatfile mailboxes.db, but we definitely noticed a decrease in 
reliability with non-zero preforks when we were using the DB-3 code.
        * With the service stopped, clean out the /var/imap/db/ and 
/var/imap/deliverdb/db directories, particluarly the "log" files.  Also, 
running ctl_mboxlist -r and ctl_deliver -r may help.

However, you're probably going to have problems until you track down why 
your processes are dying in the first place, which is what Larry and Ken 
were getting at.

Michael Bacon

Reply via email to