[naviserver-devel] Fwd: [AOLSERVER] aolserver bug

Zoran Vasiljevic Thu, 12 Jan 2006 11:55:18 -0800

Vlad, Stephen,

What do you think?


Anfang der weitergeleiteten E-Mail:

Von: Jeff Rogers <[EMAIL PROTECTED]>
Datum: 12. Januar 2006 20:34:09 MEZ
An: [EMAIL PROTECTED]
Betreff: [AOLSERVER] aolserver bug
Antwort an: AOLserver Discussion <[EMAIL PROTECTED]>
I found a bug in aolserver 4.0.10 (and previous 4.x versions, notsure aboutearlier) that causes the server to lock up. I'm fairly certain Iunderstandthe cause, and my fix appears to work although I'm not sure it isthe best
approach.

The bug: when benchmarking the server with a program like ab with
concurrency=1 (that is, it issues a single request, waits for it to
complete, then immediately issues the next one) the server willlock up,
consuming no cpu, but not responding to any requests.

My explanation: when the max number of threads is hit then when a new
connection is queued (NsQueueConn) it will be unable to find a free
connection in the pool and the queueing fails, and the newconnection isadded to the wait list (waitPtr). If there is a wait list then nodrivers
are polled for new connections (driver.c:801), rather it waits to be
triggered (SockTrigger) to indicate that a thread is available tohandle theconnection. The triggering is done when the connection iscompleted, withinNsSockClose. NsSockClose in turn is going to be called somewherewithin therunning of the connection (ConnRun - queue.c:617). However, theavailablethread is not put back onto the queue free list until after ConnRunhascompleted (queue.c:638). So if the driver thread runs in the timesliceafter ConnRun has completed for all active connections but beforethey are
added back to the free list, then it attempts to queue the connection,
fails, adds it to the wait list, then waits for the trigger whichwill never
come, and everything stops.
The problem is a race condition, and as such is extremely timingsensitive;I cannot reproduce the problem on a generic setup, but when I'mbenchmarkingmy OpenACS setup it hits the bug very quickly and reliably. Theexplanationsuggests, and my testing confirms that it seems to occur much lessreliably
with concurrency > 1 or if there is a small delay between sending the
connections. Together these mean that the lockup is most likely toshow upin exactly my test case, while much less likely on a productionserver or
with high-concurrency load testing.

My solution is to register SockTrigger as a ready proc, which are run
immediately after the freed conns are put back on to the free queue
(queue.c:645). This fixes the problem by ensuring that the triggerpipe isnotified strictly after the free queue is updated and the waitingconn will
sucessfully be queued.  However I'm not sure this is best: NsSockClose
attempts to minimize the number of times SockTrigger is called inthe casewhen multiple connections are being closed at the same time; my fixmeans itis called exactly once for each connection, or twice counting thecall inNsSockClose. It's not clear to me what adverse impact this has, ifany, but
one thing that could be done is to remove the SockTrigger calls from
NsSockClose as redundant.  Some additional logic could be added into
SockTrigger to not send to the trigger pipe under certainconditions (i.e.,if it has been triggered and not acknowledged yet, or if there isnot waitinconnection), but that would require mutex protection which couldultimately
be more expensive than just blindly triggering the pipe.

Here's a context diff for my patch:
*** driver.c.orig       Thu Jan 12 11:39:05 2006
--- driver.c    Thu Jan 12 11:39:10 2006
***************
*** 773,778 ****
--- 773,781 ----
        drvPtr = nextDrvPtr;
      }

+     /* register a ready proc to trigger the poll */
+     Ns_RegisterAtReady(SockTrigger,NULL);
+
      /*
       * Loop forever until signalled to shutdown and all
       * connections are complete and gracefully closed.


-J


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list, simply send an email to<[EMAIL PROTECTED]> with thebody of "SIGNOFF AOLSERVER" in the email message. You can leave theSubject: field of your email blank.

[naviserver-devel] Fwd: [AOLSERVER] aolserver bug

Reply via email to