I finally had some time to look into this.  It appears the issue is that when
the server says it has X messages, but some of them are errors (couldn't be
read, etc,) the client doesn't realize that it actually has less than X
messages to process.  This causes the parent/child mass-check client to
deadlock when the parent reads off the end of the list, doesn't therefore send
a message to the child, and then waits for the child to return a result while
the child continues to wait for a message.

Looking through the code, there are several other potential ways that this
could also happen (lots of "next" and "last" entries while processing the
server's response) beyond just the msg-error entries from the server.  I
submitted r721907 to (hopefully) deal with the issue generically.

Now I need to go through and find out why the server has so many errors
accessing a non-changing corpus. :(


On Fri, Oct 31, 2008 at 12:06:35PM -0400, Theo Van Dinter wrote:
> This week I noticed that my usual run took around 23h to complete, which
> is much (2x?) longer than usual.  Poking around, it seems that my second
> machine starts running through it's message queue and then stops at some
> point, leaving only the first machine to do the processing.
> 
> I'm not going to be able to deal with debugging it for a while, so I
> decided to just turn off the cronjobs for now and take a look in a few
> weeks when I get some time.


-- 
Randomly Selected Tagline:
"Dad, are you okay?  I see food on your plate instead of blurry motions."
         - Lisa on the Simpsons, "Husbands and Knives"

Attachment: pgpk5Qy8ZcLu3.pgp
Description: PGP signature

Reply via email to