After a couple days of research (and talking with Ed Korthof), I think the mysterious bug is solved. Recall the reproduction recipe:
* start a long running 'svn import' * run 'apachectl graceful' * a few seconds later, httpd just hangs. It looks like the main bug is with the 'rotatelogs' process that comes with apache. It has nothing to do with SSL at all -- it's reproducible over HTTP, and even without subversion. The theory (as I understand it) is this: if you set up your httpd.conf appropriately, the httpd parent process launches a 'rotatelogs' child process, along with N other httpd child processes. All of the httpd children keep write-pipes open to the rotatelogs process, and write log data into their pipes. 'rotatelogs' has the job of reading these pipes and spewing the data into appropriate files, creating new logfiles when necessary. Here's what Ed Korthof thinks is happening: * the svn client (using neon) opens a long-lived connection to do a commit. Using 'keepalive', it sends a huge number of PUT and PROPPATCH request over one connection to a single httpd child. * when the 'graceful' signal hits, httpd children wait for their current connection to close, then exit. Meanwhile, the httpd parent spawns a new "generation" of httpd children. Obviously, the httpd child servicing the svn commit sticks around a very long time, because svn doesn't hang up until it's done sending everything. * for some reason, the 'rotatelogs' process dies. It's not clear whether it's responding to a signal, or if the httpd parent is killing it, or what. A new 'rotatelogs' takes its place, with new httpd children connecting to it. Meanwhile, the "old" httpd child continues to service svn, and continues to write logdata to a dead pipe... there's nobody reading data from the pipe on the other end anymore! * Eventually the pipe fills up, and the httpd child just hangs trying to write to it. I think this theory is true, for a few reasons: * every time I run 'strace -p PID' on the frozen httpd, it claims to writing logdata. gdb confirms this as well. * edk is able to reproduce the problem without subversion, simply by hand-typing HTTP requests chained together by a Keep-Alive header. * 'svn import' claims to have received 'success' repsonses on about 20 more files beyond what accesslog shows, implying a pipe-backup. * The clincher: in all my testing on different platforms (7 or 8 different setups) this bug is reproducible *every* time httpd.conf is using 'rotatelogs', and the bug vanishes when I stop using 'rotatelogs'. Final analysis: This looks like some kind of bug in Apache itself, not related to SSL or Subversion at all... it looks like a bug in the interaction between 'rotatelogs' process and clients that use Keep-Alive. Any comments or thoughts?