On Thu, May 17, 2012 at 3:52 PM, Willy Tarreau <[email protected]> wrote: > > Hi, > > first, let me congratulate you for such a well documented analysis. > > On Thu, May 17, 2012 at 03:28:40PM -0700, Yehuda Sadeh wrote: > > Hi, > > > > We have a setup where there's haproxy (1.4.15) in front of a few apache > > servers. We have a test suite that occasionally fails with 502 when going > > through haproxy (this doesn't happen when going directly to one of the > > apache servers). I was able to have a tcp capture while that was happening, > > and I do see some strange behavior. From what I can see haproxy sends a > > request to the apache server, then hangs up the tcp connection (while > > response is arriving), and reopens a second tcp connection on which it > > sends nothing. After 20 seconds apache closes down the second connection > > and haproxy sends 502 to the client. > > > > The haproxy log shows the following: > > > > May 17 14:03:20 localhost haproxy[3695]: > > 10.214.131.22:38909[17/May/2012:14:02:59.347] api-http-ipv4 > > radosgw-http/plana93-1 > > 0/0/2/-1/21020 502 204 - - SH-- 0/0/0/0/1 0/0 "GET > > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1" > > > > Which means, if I decipher it correctly, that the server hung the > > connection after more than 20 seconds. > > ... and a retry! I think this is the key of the mystery. > > > However, looking at the server logs > > I couldn't find evidence for that happening. I do see the server responding > > immediately. > > I was taking a tcpdump while that was happening, and I do see some strange > > behavior. The haproxy log shows the following (captured on the haproxy > > machine): > > > > 10.214.131.22 client > > 10.214.131.25 haproxy > > 10.214.133.23 apache > > > > > > 1. client <-> haproxy > > > > Client sends request to haproxy > > > > 18073 2.787839 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [SYN] > > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25089868 TSER=0 WS=7 > > 18074 2.787867 10.214.131.25 -> 10.214.131.22 TCP http > 38909 [SYN, ACK] > > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=25177841 TSER=25089868 WS=7 > > 18075 2.787948 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [ACK] > > Seq=1 Ack=1 Win=14720 Len=0 TSV=25089868 TSER=25177841 > > 18076 2.788028 10.214.131.22 -> 10.214.131.25 HTTP GET > > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1 > > > > > > 2. haproxy <-> apache (connection 1) > > > > haproxy sends request to apache > > > > 18077 2.788185 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [SYN] > > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7 > > 18078 2.789498 10.214.133.23 -> 10.214.131.25 TCP http > 39579 [SYN, ACK] > > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7 > > 18079 2.789530 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [ACK] > > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204 > > 18080 2.789576 10.214.131.25 -> 10.214.133.23 HTTP GET > > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1 > > 18081 2.789605 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [FIN, ACK] > > Seq=333 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204 > > This one is really strange. The only reason why I could imagine the FIN here > would be if the client had used one *and* there was "option abortonclose", > but none of these hypothesis were true. > > Ah I just saw Cyril's discovery about your erroneous timeouts, he's plain > right ! > > > 3. haproxy <-> apache (connection 2) > > > > haproxy creates a second tcp connection to the same apache server > > > > 18082 2.789650 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [SYN] > > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7 > > 18083 2.790896 10.214.133.23 -> 10.214.131.25 TCP http > 39580 [SYN, ACK] > > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7 > > 18084 2.790926 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [ACK] > > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204 > > Here we have a bug in my opinion, despite the timeout issue. The new > connection should not have been attempted since the request had already > been sent. And if the connection remains open, it's precisely because > there is no more request to send ! > > > I'll have to dig through the code to find how it is possible to get > into this situation. > > Please in parallel recheck with more correct timeouts (at least 4s > each). >
Sure, checking now. It used to take a while to reproduce, so it might take a bit. Thanks, Yehuda

