On Thu, May 17, 2012 at 4:29 PM, Yehuda Sadeh <[email protected]> wrote:
> On Thu, May 17, 2012 at 3:52 PM, Willy Tarreau <[email protected]> wrote:
>>
>> Hi,
>>
>> first, let me congratulate you for such a well documented analysis.
>>
>> On Thu, May 17, 2012 at 03:28:40PM -0700, Yehuda Sadeh wrote:
>> > Hi,
>> >
>> >    We have a setup where there's haproxy (1.4.15) in front of a few apache
>> > servers. We have a test suite that occasionally fails with 502 when going
>> > through haproxy (this doesn't happen when going directly to one of the
>> > apache servers). I was able to have a tcp capture while that was happening,
>> > and I do see some strange behavior. From what I can see haproxy sends a
>> > request to the apache server, then hangs up the tcp connection (while
>> > response is arriving), and reopens a second tcp connection on which it
>> > sends nothing. After 20 seconds apache closes down the second connection
>> > and haproxy sends 502 to the client.
>> >
>> > The haproxy log shows the following:
>> >
>> > May 17 14:03:20 localhost haproxy[3695]:
>> > 10.214.131.22:38909[17/May/2012:14:02:59.347] api-http-ipv4
>> > radosgw-http/plana93-1
>> > 0/0/2/-1/21020 502 204 - - SH-- 0/0/0/0/1 0/0 "GET
>> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1"
>> >
>> > Which means, if I decipher it correctly, that the server hung the
>> > connection after more than 20 seconds.
>>
>> ... and a retry! I think this is the key of the mystery.
>>
>> > However, looking at the server logs
>> > I couldn't find evidence for that happening. I do see the server responding
>> > immediately.
>> > I was taking a tcpdump while that was happening, and I do see some strange
>> > behavior. The haproxy log shows the following (captured on the haproxy
>> > machine):
>> >
>> > 10.214.131.22 client
>> > 10.214.131.25 haproxy
>> > 10.214.133.23 apache
>> >
>> >
>> > 1. client <-> haproxy
>> >
>> >  Client sends request to haproxy
>> >
>> > 18073   2.787839 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [SYN]
>> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25089868 TSER=0 WS=7
>> > 18074   2.787867 10.214.131.25 -> 10.214.131.22 TCP http > 38909 [SYN, ACK]
>> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=25177841 TSER=25089868 WS=7
>> > 18075   2.787948 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [ACK]
>> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25089868 TSER=25177841
>> > 18076   2.788028 10.214.131.22 -> 10.214.131.25 HTTP GET
>> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1
>> >
>> >
>> > 2. haproxy <-> apache (connection 1)
>> >
>> > haproxy sends request to apache
>> >
>> > 18077   2.788185 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [SYN]
>> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7
>> > 18078   2.789498 10.214.133.23 -> 10.214.131.25 TCP http > 39579 [SYN, ACK]
>> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7
>> > 18079   2.789530 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [ACK]
>> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
>> > 18080   2.789576 10.214.131.25 -> 10.214.133.23 HTTP GET
>> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1
>> > 18081   2.789605 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [FIN, ACK]
>> > Seq=333 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
>>
>> This one is really strange. The only reason why I could imagine the FIN here
>> would be if the client had used one *and* there was "option abortonclose",
>> but none of these hypothesis were true.
>>
>> Ah I just saw Cyril's discovery about your erroneous timeouts, he's plain
>> right !
>>
>> > 3. haproxy <-> apache (connection 2)
>> >
>> > haproxy creates a second tcp connection to the same apache server
>> >
>> > 18082   2.789650 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [SYN]
>> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7
>> > 18083   2.790896 10.214.133.23 -> 10.214.131.25 TCP http > 39580 [SYN, ACK]
>> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7
>> > 18084   2.790926 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [ACK]
>> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
>>
>> Here we have a bug in my opinion, despite the timeout issue. The new
>> connection should not have been attempted since the request had already
>> been sent. And if the connection remains open, it's precisely because
>> there is no more request to send !
>>
>>
>> I'll have to dig through the code to find how it is possible to get
>> into this situation.
>>
>> Please in parallel recheck with more correct timeouts (at least 4s
>> each).
>>
>
> Sure, checking now. It used to take a while to reproduce, so it might
> take a bit.
>
Yeah, it definitely solved the issue.

Thanks!
Yehuda

Reply via email to