Am 20.01.2015 um 10:15 schrieb Rainer Jung:
Am 20.01.2015 um 08:45 schrieb Ruediger Pluem:
On 01/19/2015 11:40 PM, Rainer Jung wrote:
I noticed a hanging child process on our ASF server aurora.
It currently uses 2.4.11 (plus the post tag commit) and event MPM.
Most processes exiting due to MaxConnectionsPerChild get cleaned up
after some time but this one doesn't. It now hangs
for more than an hour. I'll let it hang. In case anyone has a good
question I can answer with gdb let me know.
It shows a strange connection view in the server status table:
PID Connections Threads Async connections
total accepting busy idle writing keep-alive closing
93557 1 yes 0 0 0 0 0
So it has 1 connection, but 0s in all other columns.
The connection can be seen by lsof:
FD TYPE DEVICE SIZE/OFF NODE NAME
txt VREG 183,3400335528 36497117 275235
/x1/www/archive.apache.org/dist/cordova/cordova-3.4.0-src.zip
9u PIPE 0xfffffe061ecfab60 16384 ->0xfffffe061ecfacb8
10u PIPE 0xfffffe061ecfacb8 0 ->0xfffffe061ecfab60
24u KQUEUE 0xfffffe033071be00 count=0, state=0x2
41u IPv4 0xfffffe01316243d0 0t0 TCP
127.0.0.1:35849->127.0.0.1:8050 (CLOSE_WAIT)
83u IPv4 0xfffffe0255d08b70 0t0 TCP
127.0.0.1:52023->127.0.0.1:8050 (CLOSE_WAIT)
108u IPv4 0xfffffe09990eeb70 0t0 TCP
127.0.0.1:22532->127.0.0.1:8050 (CLOSE_WAIT)
This is the established connectioN:
110u IPv4 0xfffffe0255ab4b70 0t0 TCP
192.87.106.229:http->179.206.174.192:65496 (ESTABLISHED)
And this is likely the file being served on that connection:
126r VREG 183,3400335528 36497117 275235
/x1/www/archive.apache.org/dist/cordova/cordova-3.4.0-src.zip
156u IPv4 0xfffffe048d0ff3d0 0t0 TCP
127.0.0.1:26685->127.0.0.1:8050 (CLOSE_WAIT)
229u IPv4 0xfffffe0131d013d0 0t0 TCP
127.0.0.1:31538->127.0.0.1:8050 (CLOSE_WAIT)
netstat shows:
Proto Recv-Q Send-Q Local Address Foreign Address
(state)
tcp4 0 87650 192.87.106.229.80 179.206.174.192.65496
ESTABLISHED
so there's 87650 bytes in the send-q. Most lilely the client hans't
acked what we send.
Isn't it weird that the connection remains in this state for an hour?
I would guess the OS tries to resent whats in the
buffer and if doesn't get ACK'ed it would somehow timeout the TCP
connection and assume the peer is gone.
I was misguided by the server-status table. I have another child process
now with one open connection, all zeroes in the other server status
columns and for that I checked network and process activity. And indeed
ktrace and tcpdump both show, that the process still does send some data
every few seconds. It just takes very long for the big file served and
the low rate of transfer. Thanks for putting me on the right track. I
will check for a few more "hanging" children during the next days
whether they actually still serve files.
So it all boils down to why the connection isn't shown in any of the
state columns in the server status table. Plus it could be interesting
to limit the time a closing child still serves requests, because it
keeps blocking a scoreboard entry and aborting one or two requests every
now and them might be better than running out of children.
And now another hour later, the same process actually shows the
connection in the "Async connection - writing" column. Even more
confusing. I wonder in which state the connection was in between.
Regards,
Rainer