Re: Increase in sockets in TIME_WAIT with 1.9.x
Hi again, On Fri, Jun 14, 2019 at 05:25:34AM +0200, Willy Tarreau wrote: > On Thu, Jun 13, 2019 at 05:31:09PM -0500, Dave Chiluk wrote: > > I used the number of calls to setsockopt with SO_LINGER in them using the > > following command. > > $ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail > > -n 1 | awk -e '{print $4}') 2>&1 | tee 1.9-${V} ; grep LINGER 1.9-${V} | wc > > -l > > > > 53216e7 = 1 > > 81a15af6b = 69 > > > > Interesting to note is that 1.8.17 only has roughly 17. I'll see if I can > > do a bisection for that tomorrow. > > This is interesting because there is no equivalent commit in 1.8 so it > may be possible that we've created a bug long ago that triggers more > easily in certain situations. So in fact it was not that black or white between versions, but we've seen that depending on the check and its behaviour, a shutw could be emitted before setting the NOLINGER flag, which is problematic because at this point the flag will be ignored. Olivier addressed this by ensuring we don't shutw() when the flag is there so that only close() does the job. I'm seeing that in some cases we call conn_sock_drain() which will disable the flag. Thus I'm interested in knowing if what is present in the lastest development tree 2.0 master branch) works fine for you or if you're still experiencing the issue. thanks, Willy
Re: Increase in sockets in TIME_WAIT with 1.9.x
On Thu, Jun 13, 2019 at 05:31:09PM -0500, Dave Chiluk wrote: > I was able to bisect this down to 53216e7 being the problematic commit, > when using calls to setsockopt(... SO_LINGER ...) as the test metric. Oh great, thank you for doing this! > I used the number of calls to setsockopt with SO_LINGER in them using the > following command. > $ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail > -n 1 | awk -e '{print $4}') 2>&1 | tee 1.9-${V} ; grep LINGER 1.9-${V} | wc > -l > > 53216e7 = 1 > 81a15af6b = 69 > > Interesting to note is that 1.8.17 only has roughly 17. I'll see if I can > do a bisection for that tomorrow. This is interesting because there is no equivalent commit in 1.8 so it may be possible that we've created a bug long ago that triggers more easily in certain situations. > Hope that helps. Of course it does :-) Cheers, Willy
Re: Increase in sockets in TIME_WAIT with 1.9.x
I was able to bisect this down to 53216e7 being the problematic commit, when using calls to setsockopt(... SO_LINGER ...) as the test metric. I used the number of calls to setsockopt with SO_LINGER in them using the following command. $ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail -n 1 | awk -e '{print $4}') 2>&1 | tee 1.9-${V} ; grep LINGER 1.9-${V} | wc -l 53216e7 = 1 81a15af6b = 69 Interesting to note is that 1.8.17 only has roughly 17. I'll see if I can do a bisection for that tomorrow. Hope that helps. Dave. On Thu, Jun 13, 2019 at 3:30 PM Willy Tarreau wrote: > On Thu, Jun 13, 2019 at 03:20:20PM -0500, Dave Chiluk wrote: > > I've attached an haproxy.cfg that is as minimal as I felt comfortable. > (...) > > many thanks for this, Dave, I truly appreciate it. I'll have a look at > it hopefully tomorrow morning. > > Willy >
Re: Increase in sockets in TIME_WAIT with 1.9.x
On Thu, Jun 13, 2019 at 03:20:20PM -0500, Dave Chiluk wrote: > I've attached an haproxy.cfg that is as minimal as I felt comfortable. (...) many thanks for this, Dave, I truly appreciate it. I'll have a look at it hopefully tomorrow morning. Willy
Re: Increase in sockets in TIME_WAIT with 1.9.x
I've attached an haproxy.cfg that is as minimal as I felt comfortable. We are using admin sockets for dynamic configuration of backends so left the server-templating in, but no other application was configured to orchestrate haproxy at the time of testing. I've also attached output from $ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail -n 1 | awk -e '{print $4}') 2>&1 | tee 1.8.17 Which shows the significant decrease in setting of SO_LINGER. I guess I lied earlier when I said there were none, but over 60s it looks like I 1.9.8 had 1/17 the number of SO_LINGER setsockopt calls vs 1.8.17. Unfortunately the number of sockets sitting in TIME_WAIT fluctuates to the point where there's not a great metric to use. Looking at SO_LINGER settings does appear to be consistent though. I bet if I spawned 700 backend instances instead of 7 it would be more pronounced. I got perf stack traces for setsockopt from both versions on our production servers, but inlining made those traces mostly useless. Let me know if there's anything else i can grab. Dave. On Thu, Jun 13, 2019 at 1:30 AM Willy Tarreau wrote: > On Wed, Jun 12, 2019 at 12:08:03PM -0500, Dave Chiluk wrote: > > I did a bit more introspection on our TIME_WAIT connections. The > increase > > in sockets in TIME_WAIT is definitely from old connections to our backend > > server instances. Considering the fact that this server is doesn't > > actually serve real traffic we can make a reasonable assumptions that > this > > is almost entirely due to increases in healthchecks. > > Great! > > > Doing an strace on haproxy 1.8.17 we see > > > > sudo strace -e setsockopt,close -p 15743 > > strace: Process 15743 attached > > setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0 > > setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 > > close(17) = 0 > > > > > > Doing the same strace on 1.9.8 we see > > > > sudo strace -e setsockopt,close -p 6670 > > strace: Process 6670 attached > > setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 > > close(4)= 0 > > > > > > The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, > 8) > > = 0 > > appear to be missing. > > Awesome, that's exactly the info I was missing. I suspected that for > whatever reason the lingering was not disabled, at least now we have > a proof of this! Now the trick is to figure why :-/ > > > We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64. > > OK, and with the setsockopt it should behave properly. > > > I'll keep digging into this, and see if I can get stack traces that > result > > in teh setsockopt calls on 1.8.17 so the stack can be more closely > > inspected. > > Don't worry for this now, this is something we at least need to resolve > before issuing 2.0 or it will cause some trouble. Then we'll backport the > fix once the cause is figured out. > > However when I try here I don't have the problem, either in 1.9.8 or > 2.0-dev7 : > > 08:27:30.212570 connect(14, {sa_family=AF_INET, sin_port=htons(9003), > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > 08:27:30.212590 recvfrom(14, NULL, 2147483647, > MSG_TRUNC|MSG_DONTWAIT|MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource > temporarily unavailable) > 08:27:30.212610 setsockopt(14, SOL_SOCKET, SO_LINGER, {l_onoff=1, > l_linger=0}, 8) = 0 > 08:27:30.212630 close(14) = 0 > 08:27:30.212659 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, > tv_nsec=6993282}) = 0 > > So it must depend on the type of check. Could you please share a > minimalistic > config that reproduces this ? > > Thanks, > Willy > 1.8.17 Description: Binary data domain_map Description: Binary data 1.9.8 Description: Binary data haproxy.cfg Description: Binary data
Re: Increase in sockets in TIME_WAIT with 1.9.x
On Wed, Jun 12, 2019 at 12:08:03PM -0500, Dave Chiluk wrote: > I did a bit more introspection on our TIME_WAIT connections. The increase > in sockets in TIME_WAIT is definitely from old connections to our backend > server instances. Considering the fact that this server is doesn't > actually serve real traffic we can make a reasonable assumptions that this > is almost entirely due to increases in healthchecks. Great! > Doing an strace on haproxy 1.8.17 we see > > sudo strace -e setsockopt,close -p 15743 > strace: Process 15743 attached > setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0 > setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 > close(17) = 0 > > > Doing the same strace on 1.9.8 we see > > sudo strace -e setsockopt,close -p 6670 > strace: Process 6670 attached > setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 > close(4)= 0 > > > The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) > = 0 > appear to be missing. Awesome, that's exactly the info I was missing. I suspected that for whatever reason the lingering was not disabled, at least now we have a proof of this! Now the trick is to figure why :-/ > We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64. OK, and with the setsockopt it should behave properly. > I'll keep digging into this, and see if I can get stack traces that result > in teh setsockopt calls on 1.8.17 so the stack can be more closely > inspected. Don't worry for this now, this is something we at least need to resolve before issuing 2.0 or it will cause some trouble. Then we'll backport the fix once the cause is figured out. However when I try here I don't have the problem, either in 1.9.8 or 2.0-dev7 : 08:27:30.212570 connect(14, {sa_family=AF_INET, sin_port=htons(9003), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 08:27:30.212590 recvfrom(14, NULL, 2147483647, MSG_TRUNC|MSG_DONTWAIT|MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) 08:27:30.212610 setsockopt(14, SOL_SOCKET, SO_LINGER, {l_onoff=1, l_linger=0}, 8) = 0 08:27:30.212630 close(14) = 0 08:27:30.212659 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=6993282}) = 0 So it must depend on the type of check. Could you please share a minimalistic config that reproduces this ? Thanks, Willy
Re: Increase in sockets in TIME_WAIT with 1.9.x
I did a bit more introspection on our TIME_WAIT connections. The increase in sockets in TIME_WAIT is definitely from old connections to our backend server instances. Considering the fact that this server is doesn't actually serve real traffic we can make a reasonable assumptions that this is almost entirely due to increases in healthchecks. Doing an strace on haproxy 1.8.17 we see sudo strace -e setsockopt,close -p 15743 strace: Process 15743 attached setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 close(17) = 0 Doing the same strace on 1.9.8 we see sudo strace -e setsockopt,close -p 6670 strace: Process 6670 attached setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 close(4)= 0 The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 appear to be missing. We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64. I'll keep digging into this, and see if I can get stack traces that result in teh setsockopt calls on 1.8.17 so the stack can be more closely inspected. Thanks for any help, Dave On Tue, Jun 11, 2019 at 2:29 AM Willy Tarreau wrote: > On Mon, Jun 10, 2019 at 04:01:27PM -0500, Dave Chiluk wrote: > > We are in the process of evaluating upgrading to 1.9.8 from 1.8.17, > > and we are seeing a roughly 70% increase in sockets in TIME_WAIT on > > our haproxy servers with a mostly idle server cluster > > $ sudo netstat | grep 'TIME_WAIT' | wc -l > > Be careful, TIME_WAIT on the frontend is neither important nor > representative of anything, only the backend counts. > > > Looking at the source/destination of this it seems likely that this > > comes from healthchecks. We also see a corresponding load increase on > > the backend applications serving the healthchecks. > > It's very possible and problematic at the same time. > > > Checking the git logs for healthcheck was unfruitful. Any clue what > > might be going on? > > Normally we make lots of efforts to close health-check responses with > a TCP RST (by disabling lingering before closing). I don't see why it > wouldn't be done here. What OS are you running on and what do your > health checks look like in the configuration ? > > Thanks, > Willy >
Re: Increase in sockets in TIME_WAIT with 1.9.x
On Mon, Jun 10, 2019 at 04:01:27PM -0500, Dave Chiluk wrote: > We are in the process of evaluating upgrading to 1.9.8 from 1.8.17, > and we are seeing a roughly 70% increase in sockets in TIME_WAIT on > our haproxy servers with a mostly idle server cluster > $ sudo netstat | grep 'TIME_WAIT' | wc -l Be careful, TIME_WAIT on the frontend is neither important nor representative of anything, only the backend counts. > Looking at the source/destination of this it seems likely that this > comes from healthchecks. We also see a corresponding load increase on > the backend applications serving the healthchecks. It's very possible and problematic at the same time. > Checking the git logs for healthcheck was unfruitful. Any clue what > might be going on? Normally we make lots of efforts to close health-check responses with a TCP RST (by disabling lingering before closing). I don't see why it wouldn't be done here. What OS are you running on and what do your health checks look like in the configuration ? Thanks, Willy
Increase in sockets in TIME_WAIT with 1.9.x
We are in the process of evaluating upgrading to 1.9.8 from 1.8.17, and we are seeing a roughly 70% increase in sockets in TIME_WAIT on our haproxy servers with a mostly idle server cluster $ sudo netstat | grep 'TIME_WAIT' | wc -l Looking at the source/destination of this it seems likely that this comes from healthchecks. We also see a corresponding load increase on the backend applications serving the healthchecks. Checking the git logs for healthcheck was unfruitful. Any clue what might be going on? Thanks, Dave.