Here's an outline of my latest thinking on how to build a multiple-connections-per-thread MPM for Apache 2.2. I'm eager to hear feedback from others who have been researching this topic.
Thanks, Brian Overview -------- The design described here is a hybrid sync/async architecture: * Do the slow part of request processing--network reads and writes--in an event loop for scalability. * Do the fast part of request processing--everything other than network I/O--in a one-request-per-thread mode so that module developers don't have to rewrite all their code as reentrant state machines. Basic structure --------------- Each httpd child process has four thread pools: 1. Listener thread A Listener thread accept(2)s a connection, creates a conn_rec for it, and sends it to the Reader thread. 2. Reader thread A Reader thread runs a poll loop to watch for incoming data on all connections that have been passed to it by a Listener or Writer. It reads the next request from each connection, builds a request_rec, and passes the conn_rec and the request_rec on to the Request Processor thread pool. 3. Request Processor threads Each Request Processor thread handles one request_rec at a time. When it receives a request from the Reader thread, the Request Processor runs all the request processing hooks (auth, map to storage, handler, etc) except the logger, plus the output filter stack except the core_output_filter. As the Request Processor produces output brigades, it sends them to the Writer thread pool. Once the Request processor has finished handling the request, it sends the last of the output data, plus the request_rec, to the Writer. 4. Writer thread The Writer thread runs a poll loop to output the data for all connections that have been passed to it. When it finishes writing the response for a request, the Writer calls the logger, destroys the request_rec, and either executes the lingering_close on the connection or sends the connection back to the Reader, depending on whether the connection is a keep-alive. Component details ----------------- * Listener thread: This thread will need to use an accept_mutex to serialize the accept, just like 2.0 does. * Passing connections from Listener to Reader: When the Listener creates a new connection, it adds it to a global queue and writes one byte to a pipe. The other end of the pipe is in the Reader's pollset. When the poll(2) in the Reader completes, the Reader detects the data available on the pipe, reads and discards the byte, and retrieves all the new connections in the queue. * Passing connections from Reader to Request Processor: When the Reader has consumed all the data in a connection, it adds the connection and the newly created request_rec to a global queue and signals a condition variable. The idle Request Processor threads take turns waiting on the condition variable (leader/followers model). * Passing output brigades from Request Processor to Writer: Same model as the Listener-to-Reader handoff: add to a queue, and write a byte to a pipe. * Bucket management: Implicit in this design is the idea that the Writer thread can be writing part of an HTTP response while a Request Processor thread is still generating more buckets for that request. This is a good thing because it means that the Request Processor thread won't ever find itself blocked on a network write, so it can produce all its output quickly and move on to another request (which is the key to keeping the number of threads low). However, it does mean that we need a thread-safe solution for allocating and destroying buckets and brigades. * request_rec lifetime: When a Request Processor thread has produced all of the output for a response, it adds a metadata bucket to the last output brigade. This bucket points to the request_rec. Upon sending the last of the request's output, the Writer thread is responsible for calling the logger and the destroying the request and its pool. This would be a major change from how 1.x and 2.0 work. The rationale for it is twofold: - Eliminate the need to set aside buckets from the request pool into the connection pool in the core_output_filter, which has been a source of many bugs in 2.0. - Allow for more accurate logging of bytes_sent (e.g., in mod_logio) by delaying the logger until the request has actually been sent. One implication of this change is that the request pool could no longer be a sub-pool of the connection pool, unless we make subpool creation a thread-safe operation. Open questions -------------- * Limiting the Reader and Writer pools to one thread each will simplify the design and implementation. But will this impair our ability to take advantage of lots of CPUs? * Can we eliminate the listener thread? It would be faster to just have the Reader thread include the listen socket(s) in its pollset. But if we did that, we'd need some new way to synchronize the accept handling among multiple child processes, because we can't have the Reader thread blocking on an accept mutex when it has existing connections to watch. * Is there a more efficient way to interrupt a thread that's blocked in a poll call? That's a crucial step in the Listener-to- Reader and Request Processor-to-Writer handoffs. Writing a byte to a pipe requires two extra syscalls (a read and a write) per handoff. Sending a signal to the target thread is the only other solution I can think of at the moment, but that's bad because the target thread might be in the middle of a read or write call, rather than a poll, at the moment when we hit it with a signal, so the read or write will fail with EINTR. Maybe the best solution would be a hybrid: using atomic operations, have the Reader maintain a flag that indicates whether it's blocked on a poll call or not. If the Listener sees that the reader is blocked in a poll, it sends a signal to the listener to interrupt the poll; otherwise, it just adds the new connection to the queue and expects the listener to check the queue again before its next poll call. * Do any major modules have a need to do blocking I/O or expensive computation within their input handlers? That would cause problems for the single Reader thread, which depends on input handlers running quickly so it can get back to its poll loop.