After a couple days of research (and talking with Ed Korthof), I think
the mysterious bug is solved. Recall the reproduction recipe:
* start a long running 'svn import'
* run 'apachectl graceful'
* a few seconds later, httpd just hangs.
It looks like the main bug is with the 'rotatelogs' process that comes
with apache. It has nothing to do with SSL at all -- it's
reproducible over HTTP, and even without subversion.
The theory (as I understand it) is this: if you set up your httpd.conf
appropriately, the httpd parent process launches a 'rotatelogs' child
process, along with N other httpd child processes. All of the httpd
children keep write-pipes open to the rotatelogs process, and write log
data into their pipes. 'rotatelogs' has the job of reading these pipes
and spewing the data into appropriate files, creating new logfiles when
necessary.
Here's what Ed Korthof thinks is happening:
* the svn client (using neon) opens a long-lived connection to do a
commit. Using 'keepalive', it sends a huge number of PUT and
PROPPATCH request over one connection to a single httpd child.
* when the 'graceful' signal hits, httpd children wait for their
current connection to close, then exit. Meanwhile, the httpd
parent spawns a new "generation" of httpd children. Obviously,
the httpd child servicing the svn commit sticks around a very long
time, because svn doesn't hang up until it's done sending
everything.
* for some reason, the 'rotatelogs' process dies. It's not clear
whether it's responding to a signal, or if the httpd parent is
killing it, or what. A new 'rotatelogs' takes its place, with new
httpd children connecting to it. Meanwhile, the "old" httpd child
continues to service svn, and continues to write logdata to a dead
pipe... there's nobody reading data from the pipe on the other end
anymore!
* Eventually the pipe fills up, and the httpd child just hangs
trying to write to it.
I think this theory is true, for a few reasons:
* every time I run 'strace -p PID' on the frozen httpd, it claims to
writing logdata. gdb confirms this as well.
* edk is able to reproduce the problem without subversion, simply by
hand-typing HTTP requests chained together by a Keep-Alive header.
* 'svn import' claims to have received 'success' repsonses on about
20 more files beyond what accesslog shows, implying a pipe-backup.
* The clincher: in all my testing on different platforms (7 or 8
different setups) this bug is reproducible *every* time httpd.conf
is using 'rotatelogs', and the bug vanishes when I stop using
'rotatelogs'.
Final analysis:
This looks like some kind of bug in Apache itself, not
related to SSL or Subversion at all... it looks like a bug in the
interaction between 'rotatelogs' process and clients that use
Keep-Alive.
Any comments or thoughts?