Re: Timing out FIN_WAIT_2 connections under 1.5.14

Richard Gray Wed, 18 Jan 2017 20:38:33 -0800

On 2017-01-17 19:58, Willy Tarreau wrote:

I'm surprized that it only kicks in "sometimes". That makes me wonder
whether the problem is not here.


I'm starting to wonder if it might be related to network latency between the 
client and HAProxy. More details below.

   - it would be useful to check (using netstat -atnp) if these eternal
     FIN_WAIT2 sockets still belong to haproxy or are real orphans. I
     suspect that most of them are still attached to haproxy given your
     timeouts ; if you manage to find some very old ones still attached
     to haproxy then it would mean the issue is in haproxy.

Yes, in most cases if I select a random socket in FIN_WAIT2 fromnetstat, there will be an HAProxy session associated with that socket. e.g.


$ netstat -an | grep FIN_WA | grep 122.56.198.232:26734
tcp        0      0 10.24.20.104:993        122.56.198.232:26734 FIN_WAIT2

$ echo "show sess" | sudo socat unix-connect:/var/lib/haproxy/statsstdio | grep 122.56.198.232:267340x7f906fdb7050: proto=tcpv4 src=122.56.198.232:26734fe=hosting-imaps_proxy-external-vip be=hosting-imaps_proxy-external-vipsrv=imap1 ts=08 age=25m42s calls=3rq[f=848202h,i=0,an=00h,rx=14m45s,wx=,ax=]rp[f=04a020h,i=0,an=00h,rx=,wx=,ax=] s0=[7,0h,fd=6684,ex=]s1=[7,8h,fd=10039,ex=] exp=14m45s

   - your idea of a race is interesting. I'm wondering what happens if
     the connection spends more than 60s still attached to haproxy and
     only then becomes orphaned. It could be possible that the kernel
     timeout only strikes when the socket age reaches the timeout while
     already being orphaned.

In my testing, I find that most of the time, 'timeout client-fin' isapplied. HAProxy sends a FIN to the client, which acks the FIN, but doesnot send a FIN of its own. The HAProxy session moves into the client-fintimeout period of 30 seconds. After 30 seconds, the HAProxy session isremoved. The FIN_WAIT2 connection state is then maintained by the kernelfor tcp_fin_timeout of 60 seconds, before being removed altogether.


Aha! Testing just now, and I managed to catch one.

To test, I open a connection to the HAProxy VIP with my python testclient. After 3 minutes, an idle time out on the backend server causesthe connection to be closed. My client doesn't close its end though,leaving the client side connection in FIN_WAIT2.


Here's the HAProxy session at 2m19s:

$ echo "show sess" | sudo socat unix-connect:/var/lib/haproxy/statsstdio | grep 146.185.142.1650x7f9075ba98c0: proto=tcpv4 src=146.185.142.165:56198fe=hosting-imap_proxy-external-vip be=hosting-imap_proxy-external-vipsrv=imap1 ts=08 age=2m19s calls=2rq[f=808000h,i=0,an=00h,rx=32m41s,wx=,ax=]rp[f=048202h,i=0,an=00h,rx=32m41s,wx=,ax=] s0=[7,8h,fd=14490,ex=]s1=[7,8h,fd=14919,ex=] exp=32m41s

And then again at 3m16s after the backend has closed its side of theconnection:

$ echo "show sess" | sudo socat unix-connect:/var/lib/haproxy/statsstdio | grep 146.185.142.1650x7f9075ba98c0: proto=tcpv4 src=146.185.142.165:56198fe=hosting-imap_proxy-external-vip be=hosting-imap_proxy-external-vipsrv=imap1 ts=08 age=3m16s calls=3rq[f=808000h,i=0,an=00h,rx=31m44s,wx=,ax=]rp[f=04a020h,i=0,an=00h,rx=,wx=,ax=] s0=[7,0h,fd=14490,ex=]s1=[7,8h,fd=14919,ex=] exp=31m44s

$ netstat -an | grep 146.185.142.165:56198
tcp        0      0 10.24.20.104:143        146.185.142.165:56198 FIN_WAIT2

Note that the exp value is 31m44s, not ~14s as we'd expect if timeoutclient-fin had been applied.

To compare, here's what usually happens. At age=2m59s just before theserver disconnection:

$ echo "show sess" | sudo socat unix-connect:/var/lib/haproxy/statsstdio | grep 146.185.142.1650x7f907f8e1ef0: proto=tcpv4 src=146.185.142.165:56202fe=hosting-imap_proxy-external-vip be=hosting-imap_proxy-external-vipsrv=imap1 ts=08 age=2m59s calls=2rq[f=808000h,i=0,an=00h,rx=32m,wx=,ax=]rp[f=048202h,i=0,an=00h,rx=32m,wx=,ax=] s0=[7,8h,fd=20385,ex=]s1=[7,8h,fd=20440,ex=] exp=32m


And then at 3m:

$ echo "show sess" | sudo socat unix-connect:/var/lib/haproxy/statsstdio | grep 146.185.142.1650x7f907f8e1ef0: proto=tcpv4 src=146.185.142.165:56202fe=hosting-imap_proxy-external-vip be=hosting-imap_proxy-external-vipsrv=imap1 ts=08 age=3m calls=3 rq[f=808000h,i=0,an=00h,rx=29s,wx=,ax=]rp[f=04a020h,i=0,an=00h,rx=,wx=,ax=] s0=[7,0h,fd=20385,ex=]s1=[7,8h,fd=20440,ex=] exp=29s

Here you can see exp=29s, suggesting that the client-fin timeout haskicked in.

I still don't know what causes the two different behaviours. It's quitedifficult to trigger the abnormal behaviour in my testing (only happensmaybe 1 time in 10), but with a session rate of up to 80/s on theproduction service, it occurs frequently.

I don't have enough data to back this up yet, but I wonder if networklatency might be involved somehow. For most of today, testing from amachine with a 3ms RTT to the service, I was unable to reproduce theproblem. Just now though, I've tested from a VM in Amsterdam with a350ms RTT, and have been able to reproduce the problem 2 times in 4attempts. Anecdata at this point, but it would lend itself to the racecondition theory.


Thanks,
Richard
_____________________________________________________________________________

This email has been filtered by SMX. For more info visit http://smxemail.com
_____________________________________________________________________________

Re: Timing out FIN_WAIT_2 connections under 1.5.14

Reply via email to