> I do not believe that we have a scalability problem in the worker MPM.
> I believe we have a scalability problem in our testing tool. I agree
> that there is a problem that can cause some new connections to appear
> to hang under certain unlikely conditions, but I do not believe this can
> cause the server to hang as a whole, nor do I believe that this problem
> can show up enough to cause a ceiling on concurrent request processing.

It doesn't cause a ceiling -- it causes the M+1 request to go hang out in
limbo when it could have been processed by the other child.  This isn't 
only
likely, it will happen consistently on any server that handles more than
M requests per second on a regular basis.  Any high-end site.

Go ahead and try it with flood -- it has nothing whatsoever to do with the
tool other than the fact that the tool is attempting to make simultaneous
connections very fast.  The same problem will be seen by a busy site trying
to serve many slow connections over time.  I'd show it to you on our own
apache.org server, but that one doesn't use worker.

> Since this is an important issue, and I do not want this to become a
> flame fest, I will describe what I think is happening here:
>
>  The worker MPM has N children.
>  Each child has M threads.
>  Each thread can handle exactly 1 concurrent request.
>
>  In the worse case imagine that M requests were all handled by the same
>  child, and that 1 additional request arrives and is to be handled by
>  that same child. In this case, that last request must wait for 1 of the 
> M
>  busy threads to finish with a previous thread before it can be processed.
>  The likelyhood of this happening, however, is a function of the ability
>  of the accept mutex to deal with lock contention, and the number of
>  concurrent requests. In my opinion, this likelyhood is very small,
>  so small that in normal testing I do not believe we will encounter
>  this scenario.

What do you mean by "worst case"?  That is almost every case.  You are
forgetting that the child has LIFO characteristics, which means it will
handle every request until the M+1 arrives.  Furthermore, you are assuming
that request arrival rates are normally distributed, which simply isn't
the case.  What will happen in the real case is a single child will
accept connections from a sequence of M/4 or so clients until its threads
are busy, and the last client will have one of its requests stuck waiting
for the other threads to be finished with serving a slow client or simply
waiting in lingering close or simply writing the log file.  In other words,
one in every M/4 clients will encounter two or three seconds of additional
latency because the MPM is broken.

On a server that is receiving 1000 reqs/sec with 25 threads/child, a child
will become impacted within 25 ms.  That means the 26th request is going
to be sitting around for at least

    (time required to close fastest connection) - 25ms.

which is a buttload of time in Web terms.  Furthermore, the actual 
long-lived,
multi-minute connections will pile-up over time, reducing the number of
"short-timer" threads available on average, and thus increasing the rate
of impact on clients.

Brian is right -- the worker MPM must be fixed to not accept connections
when it has no available threads.

....Roy

Reply via email to