Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-14 Thread Willy Tarreau
Hi again,

On Fri, Jun 14, 2019 at 05:25:34AM +0200, Willy Tarreau wrote:
> On Thu, Jun 13, 2019 at 05:31:09PM -0500, Dave Chiluk wrote:
> > I used the number of calls to setsockopt with SO_LINGER in them using the
> > following command.
> > $ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail
> > -n 1 | awk -e '{print $4}') 2>&1 | tee 1.9-${V} ; grep LINGER 1.9-${V} | wc
> > -l
> > 
> > 53216e7 = 1
> > 81a15af6b = 69
> > 
> > Interesting to note is that 1.8.17 only has roughly 17.  I'll see if I can
> > do a bisection for that tomorrow.
> 
> This is interesting because there is no equivalent commit in 1.8 so it
> may be possible that we've created a bug long ago that triggers more
> easily in certain situations.

So in fact it was not that black or white between versions, but we've
seen that depending on the check and its behaviour, a shutw could
be emitted before setting the NOLINGER flag, which is problematic
because at this point the flag will be ignored. Olivier addressed
this by ensuring we don't shutw() when the flag is there so that only
close() does the job. I'm seeing that in some cases we call
conn_sock_drain() which will disable the flag. Thus I'm interested in
knowing if what is present in the lastest development tree 2.0 master
branch) works fine for you or if you're still experiencing the issue.

thanks,
Willy



Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-13 Thread Willy Tarreau
On Thu, Jun 13, 2019 at 05:31:09PM -0500, Dave Chiluk wrote:
> I was able to bisect this down to 53216e7 being the problematic commit,
> when using calls to setsockopt(... SO_LINGER ...) as the test metric.

Oh great, thank you for doing this!

> I used the number of calls to setsockopt with SO_LINGER in them using the
> following command.
> $ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail
> -n 1 | awk -e '{print $4}') 2>&1 | tee 1.9-${V} ; grep LINGER 1.9-${V} | wc
> -l
> 
> 53216e7 = 1
> 81a15af6b = 69
> 
> Interesting to note is that 1.8.17 only has roughly 17.  I'll see if I can
> do a bisection for that tomorrow.

This is interesting because there is no equivalent commit in 1.8 so it
may be possible that we've created a bug long ago that triggers more
easily in certain situations.

>  Hope that helps.

Of course it does :-)

Cheers,
Willy



Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-13 Thread Dave Chiluk
I was able to bisect this down to 53216e7 being the problematic commit,
when using calls to setsockopt(... SO_LINGER ...) as the test metric.

I used the number of calls to setsockopt with SO_LINGER in them using the
following command.
$ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail
-n 1 | awk -e '{print $4}') 2>&1 | tee 1.9-${V} ; grep LINGER 1.9-${V} | wc
-l

53216e7 = 1
81a15af6b = 69

Interesting to note is that 1.8.17 only has roughly 17.  I'll see if I can
do a bisection for that tomorrow.  Hope that helps.

Dave.

On Thu, Jun 13, 2019 at 3:30 PM Willy Tarreau  wrote:

> On Thu, Jun 13, 2019 at 03:20:20PM -0500, Dave Chiluk wrote:
> > I've attached an haproxy.cfg that is as minimal as I felt comfortable.
> (...)
>
> many thanks for this, Dave, I truly appreciate it. I'll have a look at
> it hopefully tomorrow morning.
>
> Willy
>


Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-13 Thread Willy Tarreau
On Thu, Jun 13, 2019 at 03:20:20PM -0500, Dave Chiluk wrote:
> I've attached an haproxy.cfg that is as minimal as I felt comfortable.
(...)

many thanks for this, Dave, I truly appreciate it. I'll have a look at
it hopefully tomorrow morning.

Willy



Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-13 Thread Dave Chiluk
I've attached an haproxy.cfg that is as minimal as I felt comfortable.  We
are using admin sockets for dynamic configuration of backends so left the
server-templating in, but no other application was configured to
orchestrate haproxy at the time of testing.

I've also attached output from
$ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail
-n 1 | awk -e '{print $4}') 2>&1 | tee 1.8.17

Which shows the significant decrease in setting of SO_LINGER.  I guess I
lied earlier when I said there were none, but over 60s it looks like I
1.9.8 had 1/17 the number of SO_LINGER setsockopt calls vs 1.8.17.

Unfortunately the number of sockets sitting in TIME_WAIT fluctuates to the
point where there's not a great metric to use.  Looking at SO_LINGER
settings does appear to be consistent though.  I bet if I spawned 700
backend instances instead of 7 it would be more pronounced.

I got perf stack traces for setsockopt from both versions on our production
servers, but inlining made those traces mostly useless.

Let me know if there's anything else i can grab.

Dave.


On Thu, Jun 13, 2019 at 1:30 AM Willy Tarreau  wrote:

> On Wed, Jun 12, 2019 at 12:08:03PM -0500, Dave Chiluk wrote:
> > I did a bit more introspection on our TIME_WAIT connections.  The
> increase
> > in sockets in TIME_WAIT is definitely from old connections to our backend
> >  server instances.  Considering the fact that this server is doesn't
> > actually serve real traffic we can make a reasonable assumptions that
> this
> > is almost entirely due to increases in healthchecks.
>
> Great!
>
> > Doing an strace on haproxy 1.8.17 we see
> > 
> > sudo strace -e setsockopt,close -p 15743
> > strace: Process 15743 attached
> > setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
> > close(17)   = 0
> > 
> >
> > Doing the same strace on 1.9.8 we see
> > 
> > sudo strace -e setsockopt,close -p 6670
> > strace: Process 6670 attached
> > setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > close(4)= 0
> > 
> >
> > The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0},
> 8)
> > = 0
> > appear to be missing.
>
> Awesome, that's exactly the info I was missing. I suspected that for
> whatever reason the lingering was not disabled, at least now we have
> a proof of this! Now the trick is to figure why :-/
>
> > We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64.
>
> OK, and with the setsockopt it should behave properly.
>
> > I'll keep digging into this, and see if I can get stack traces that
> result
> > in teh setsockopt calls on 1.8.17 so the stack can be more closely
> > inspected.
>
> Don't worry for this now, this is something we at least need to resolve
> before issuing 2.0 or it will cause some trouble. Then we'll backport the
> fix once the cause is figured out.
>
> However when I try here I don't have the problem, either in 1.9.8 or
> 2.0-dev7 :
>
> 08:27:30.212570 connect(14, {sa_family=AF_INET, sin_port=htons(9003),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> 08:27:30.212590 recvfrom(14, NULL, 2147483647,
> MSG_TRUNC|MSG_DONTWAIT|MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource
> temporarily unavailable)
> 08:27:30.212610 setsockopt(14, SOL_SOCKET, SO_LINGER, {l_onoff=1,
> l_linger=0}, 8) = 0
> 08:27:30.212630 close(14)   = 0
> 08:27:30.212659 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0,
> tv_nsec=6993282}) = 0
>
> So it must depend on the type of check. Could you please share a
> minimalistic
> config that reproduces this ?
>
> Thanks,
> Willy
>


1.8.17
Description: Binary data


domain_map
Description: Binary data


1.9.8
Description: Binary data


haproxy.cfg
Description: Binary data


Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-12 Thread Willy Tarreau
On Wed, Jun 12, 2019 at 12:08:03PM -0500, Dave Chiluk wrote:
> I did a bit more introspection on our TIME_WAIT connections.  The increase
> in sockets in TIME_WAIT is definitely from old connections to our backend
>  server instances.  Considering the fact that this server is doesn't
> actually serve real traffic we can make a reasonable assumptions that this
> is almost entirely due to increases in healthchecks.

Great!

> Doing an strace on haproxy 1.8.17 we see
> 
> sudo strace -e setsockopt,close -p 15743
> strace: Process 15743 attached
> setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
> close(17)   = 0
> 
> 
> Doing the same strace on 1.9.8 we see
> 
> sudo strace -e setsockopt,close -p 6670
> strace: Process 6670 attached
> setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> close(4)= 0
> 
> 
> The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8)
> = 0
> appear to be missing.

Awesome, that's exactly the info I was missing. I suspected that for
whatever reason the lingering was not disabled, at least now we have
a proof of this! Now the trick is to figure why :-/

> We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64.

OK, and with the setsockopt it should behave properly.

> I'll keep digging into this, and see if I can get stack traces that result
> in teh setsockopt calls on 1.8.17 so the stack can be more closely
> inspected.

Don't worry for this now, this is something we at least need to resolve
before issuing 2.0 or it will cause some trouble. Then we'll backport the
fix once the cause is figured out.

However when I try here I don't have the problem, either in 1.9.8 or 2.0-dev7 :

08:27:30.212570 connect(14, {sa_family=AF_INET, sin_port=htons(9003), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
08:27:30.212590 recvfrom(14, NULL, 2147483647, 
MSG_TRUNC|MSG_DONTWAIT|MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
08:27:30.212610 setsockopt(14, SOL_SOCKET, SO_LINGER, {l_onoff=1, l_linger=0}, 
8) = 0
08:27:30.212630 close(14)   = 0
08:27:30.212659 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, 
tv_nsec=6993282}) = 0

So it must depend on the type of check. Could you please share a minimalistic
config that reproduces this ?

Thanks,
Willy



Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-12 Thread Dave Chiluk
I did a bit more introspection on our TIME_WAIT connections.  The increase
in sockets in TIME_WAIT is definitely from old connections to our backend
 server instances.  Considering the fact that this server is doesn't
actually serve real traffic we can make a reasonable assumptions that this
is almost entirely due to increases in healthchecks.

Doing an strace on haproxy 1.8.17 we see

sudo strace -e setsockopt,close -p 15743
strace: Process 15743 attached
setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
close(17)   = 0


Doing the same strace on 1.9.8 we see

sudo strace -e setsockopt,close -p 6670
strace: Process 6670 attached
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
close(4)= 0


The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8)
= 0
appear to be missing.

We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64.

I'll keep digging into this, and see if I can get stack traces that result
in teh setsockopt calls on 1.8.17 so the stack can be more closely
inspected.

Thanks for any help,
Dave


On Tue, Jun 11, 2019 at 2:29 AM Willy Tarreau  wrote:

> On Mon, Jun 10, 2019 at 04:01:27PM -0500, Dave Chiluk wrote:
> > We are in the process of evaluating upgrading to 1.9.8 from 1.8.17,
> > and we are seeing a roughly 70% increase in sockets in TIME_WAIT on
> > our haproxy servers with a mostly idle server cluster
> > $ sudo netstat | grep 'TIME_WAIT' | wc -l
>
> Be careful, TIME_WAIT on the frontend is neither important nor
> representative of anything, only the backend counts.
>
> > Looking at the source/destination of this it seems likely that this
> > comes from healthchecks.  We also see a corresponding load increase on
> > the backend applications serving the healthchecks.
>
> It's very possible and problematic at the same time.
>
> > Checking the git logs for healthcheck was unfruitful.  Any clue what
> > might be going on?
>
> Normally we make lots of efforts to close health-check responses with
> a TCP RST (by disabling lingering before closing). I don't see why it
> wouldn't be done here. What OS are you running on and what do your
> health checks look like in the configuration ?
>
> Thanks,
> Willy
>


Re: Increase in sockets in TIME_WAIT with 1.9.x

2019-06-11 Thread Willy Tarreau
On Mon, Jun 10, 2019 at 04:01:27PM -0500, Dave Chiluk wrote:
> We are in the process of evaluating upgrading to 1.9.8 from 1.8.17,
> and we are seeing a roughly 70% increase in sockets in TIME_WAIT on
> our haproxy servers with a mostly idle server cluster
> $ sudo netstat | grep 'TIME_WAIT' | wc -l

Be careful, TIME_WAIT on the frontend is neither important nor
representative of anything, only the backend counts.

> Looking at the source/destination of this it seems likely that this
> comes from healthchecks.  We also see a corresponding load increase on
> the backend applications serving the healthchecks.

It's very possible and problematic at the same time.

> Checking the git logs for healthcheck was unfruitful.  Any clue what
> might be going on?

Normally we make lots of efforts to close health-check responses with
a TCP RST (by disabling lingering before closing). I don't see why it
wouldn't be done here. What OS are you running on and what do your
health checks look like in the configuration ?

Thanks,
Willy



Increase in sockets in TIME_WAIT with 1.9.x

2019-06-10 Thread Dave Chiluk
We are in the process of evaluating upgrading to 1.9.8 from 1.8.17,
and we are seeing a roughly 70% increase in sockets in TIME_WAIT on
our haproxy servers with a mostly idle server cluster
$ sudo netstat | grep 'TIME_WAIT' | wc -l

Looking at the source/destination of this it seems likely that this
comes from healthchecks.  We also see a corresponding load increase on
the backend applications serving the healthchecks.

Checking the git logs for healthcheck was unfruitful.  Any clue what
might be going on?

Thanks,
Dave.