> I do not believe that we have a scalability problem in the worker MPM. > I believe we have a scalability problem in our testing tool. I agree > that there is a problem that can cause some new connections to appear > to hang under certain unlikely conditions, but I do not believe this can > cause the server to hang as a whole, nor do I believe that this problem > can show up enough to cause a ceiling on concurrent request processing.
It doesn't cause a ceiling -- it causes the M+1 request to go hang out in limbo when it could have been processed by the other child. This isn't only likely, it will happen consistently on any server that handles more than M requests per second on a regular basis. Any high-end site. Go ahead and try it with flood -- it has nothing whatsoever to do with the tool other than the fact that the tool is attempting to make simultaneous connections very fast. The same problem will be seen by a busy site trying to serve many slow connections over time. I'd show it to you on our own apache.org server, but that one doesn't use worker. > Since this is an important issue, and I do not want this to become a > flame fest, I will describe what I think is happening here: > > The worker MPM has N children. > Each child has M threads. > Each thread can handle exactly 1 concurrent request. > > In the worse case imagine that M requests were all handled by the same > child, and that 1 additional request arrives and is to be handled by > that same child. In this case, that last request must wait for 1 of the > M > busy threads to finish with a previous thread before it can be processed. > The likelyhood of this happening, however, is a function of the ability > of the accept mutex to deal with lock contention, and the number of > concurrent requests. In my opinion, this likelyhood is very small, > so small that in normal testing I do not believe we will encounter > this scenario. What do you mean by "worst case"? That is almost every case. You are forgetting that the child has LIFO characteristics, which means it will handle every request until the M+1 arrives. Furthermore, you are assuming that request arrival rates are normally distributed, which simply isn't the case. What will happen in the real case is a single child will accept connections from a sequence of M/4 or so clients until its threads are busy, and the last client will have one of its requests stuck waiting for the other threads to be finished with serving a slow client or simply waiting in lingering close or simply writing the log file. In other words, one in every M/4 clients will encounter two or three seconds of additional latency because the MPM is broken. On a server that is receiving 1000 reqs/sec with 25 threads/child, a child will become impacted within 25 ms. That means the 26th request is going to be sitting around for at least (time required to close fastest connection) - 25ms. which is a buttload of time in Web terms. Furthermore, the actual long-lived, multi-minute connections will pile-up over time, reducing the number of "short-timer" threads available on average, and thus increasing the rate of impact on clients. Brian is right -- the worker MPM must be fixed to not accept connections when it has no available threads. ....Roy