Hello,
For a while I have been noticing performance problems with CGI scripts
running under Apache (1.3.27) and Solaris (2.8). These are "normal" forked
CGI scripts: not FastCGI, mod_perl etc.
The symptom is that very often there is a three-second delay between
requesting the page and getting the completed response. It's actually
reasonably easy to replicate. Take a simple CGI script like this:
#!/bin/sh
echo "Content-Type: text/html"
echo ""
echo "Hello from shell"
Hit the webserver with 50 requests straight after each other for this page,
and I find that almost every alternate request takes 3 seconds. However,
with this example, trussing the process which answers the requests is enough
to make the problem go away (hinting at a race condition).
Running the equivalent script in Ruby did allow me to get a truss on it and
still fail occasionally, which showed me:
12458: 24.7093 waitid(P_PID, 12472, 0xFFBEF7E8, WEXITED|WTRAPPED|WNOHANG) = 0
12458: 24.7096 kill(12472, SIGTERM) = 0
...
12458: 24.7104 alarm(3) = 0
...
12458: sigsuspend(0xFFBEF7F8) (sleeping...)
...
12458: 27.7048 Received signal #14, SIGALRM, in sigsuspend() [caught]
12458: 27.7051 sigsuspend(0xFFBEF7F8) Err#4 EINTR
It then turns out that you can replicate the program 100% reliably by
introducing a small delay after closing stdin/stdout but before exiting:
#!/usr/local/bin/ruby
puts "Content-Type: text/html"
puts ""
puts "Hello from Ruby"
$stdin.close
$stdout.close
sleep 0.2
>From looking through the code, I believe the problem lies in free_proc_chain
in src/main/alloc.c. The logic is:
- call waitpid() for each process to be cleaned up, to see if it has
already died
- if it hasn't, then send it a kill SIGTERM
- if kill returns 0 (meaning the process is still alive) then sleep 3 seconds
then send it a SIGKILL
The race condition seems to as follows: when the CGI process has finished
its work, it closes stdin/stdout which completes the request. However it
takes a short period of time to tidy up after itself before it finally dies,
by which time Apache has already decided to send it a SIGTERM and is
committed to a 3 second wait.
This is the first time I have dug around the internals of Apache and I am
not certain as to what is _supposed_ to happen in this case; in particular I
don't know if the sleep(3) is meant to be interruptible by a SIGCHLD.
However if it is, it certainly isn't happening on my platform.
My proposed solution is to poll waitpid() a few times before deciding that
the process needs to be sent a TERM. I have put together a rough patch as
proof-of-concept (attached) which does this at intervals of 10ms, 20ms,
40ms, 80ms, giving a total potential wait of 150ms. Maybe it should have
another iteration to make 310ms. It's not particularly pretty but I wanted
to localise my changes as far as possible.
Given my lack of knowledge of the Apache codebase I can't say whether this
is The Right Way[TM] to deal with this problem, but it's certainly made a
huge improvement here. Comments gratefully received.
If this is the right approach, perhaps one of the main Apache developers
would like to produce a tidier patch?
Regards,
Brian Candler.
--- apache_1.3.27/src/main/alloc.c.orig Tue Mar 18 16:54:41 2003
+++ apache_1.3.27/src/main/alloc.c Tue Mar 18 17:05:14 2003
@@ -2917,9 +2917,32 @@
#else
#ifndef NEED_WAITPID
/* Pick up all defunct processes */
- for (p = procs; p; p = p->next) {
- if (waitpid(p->pid, (int *) 0, WNOHANG) > 0) {
- p->kill_how = kill_never;
+ {
+ int waiting;
+ int try = 0;
+ long interval = 10000; /* 10ms */
+ struct timeval timeout;
+ int status;
+
+ while (1) {
+ waiting = 0;
+ for (p = procs; p; p = p->next) {
+ if (try > 0 && p->kill_how == kill_never)
+ continue;
+ status = waitpid(p->pid, (int *) 0, WNOHANG);
+ if (status > 0) {
+ p->kill_how = kill_never;
+ } else if (status == 0) {
+ waiting = 1;
+ }
+ }
+ if (!waiting || ++try >= 5)
+ break;
+ timeout.tv_sec = 0;
+ timeout.tv_usec = interval;
+ select(0, 0, 0, 0, &timeout);
+ interval *= 2;
+ /* we will wait 10+20+40+80 = 150ms before sending TERM */
}
}
#endif