Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-19 Thread Vincent Bernat
 ❦ 19 mai 2017 08:28 +0200, Willy Tarreau  :

> The problem is that it's what was being attempted during the days of 1.3,
> resulting in still highly bogus versions being deployed in field and
> users being very confident in them because they were recently updated.
> These days, every patch going into a stable release MUST be applied.
> What is considered major for some has no impact for others and what is
> minor for some is business critical for others. In all cases it ends up
> with reports here on the list.

While there exists some exceptions, the rule of thumb in distributions
is to stick to a version and backport only very critical (and security)
patches. While this creates some friction with upstreams, this seems to
match the expectation of many users (no random breakages during
upgrades). This is a bit unfair to you since you maintain a bugfix-only
branch which is also the goal we have (we just have more draconian
rules about the accepted patches).

I said there are exceptions (maybe 5 packages) where packages are kept
up-to-date with the latest version of a branch, but they only apply to
unfriendly upstreams (Oracle for example). I let you draw any conclusion
on that. :)

In Debian, we have acknowledged a long time ago this doesn't fit the
need of all our users, so we have official backports. Ubuntu has them
also but this is not widely used (Trusty got a backport for HAProxy
1.5.14 because they needed it for a customer).

>> FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in
>> our next release (to happen in July I hope).
>
> Do you think there's an opportunity to get 1.7.6 if I release it next week ?
> It provides -fwrapv which will likely avoid certain bugs with more recent
> compilers, and there's a fix for a segfault in Lua.

We need to take the decision with the release team. We are in freeze
since January but we were able to push the latest version nonetheless. I
think it is likely we will be able to push 1.7.6 as well. Otherwise,
we'll push the patches marked "BUG/MAJOR" (this is what we have done
with 1.5.8 currently in Jessie, this is 1.5.8 + patches from 1.5.9 and
1.5.10 + a security patch).
-- 
Format a program to help the reader understand it.
- The Elements of Programming Style (Kernighan & Plauger)



Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-19 Thread Willy Tarreau
Hi Vincent,

On Fri, May 19, 2017 at 07:38:20AM +0200, Vincent Bernat wrote:
>  ? 19 mai 2017 07:04 +0200, Willy Tarreau  :
> 
> >> I saw many similar issues posted earlier by others, but could not
> >> find a thread where this is resolved or fixed in a newer release. We
> >> are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that
> >> HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a
> >> stale connection that does not seem to belong to any frontend
> >> session. Strace with -T shows the folllowing:
> >
> > In fact a few bugs have caused this situation and all known ones were
> > fixed, which doesn't mean there is none left of course. However your
> > version is totally outdated and contains tons of known bugs which were
> > later fixed (196 total, 22 major, 78 medium, 96 minor) :
> >
> >http://www.haproxy.org/bugs/bugs-1.6.3.html
> 
> Those pages are quite useful!

I made them to help everyone know when they're using a bogus version and
to encourage any user to upgrade (including by using your packages for
those on debian/ubuntu).

> That's the version in Ubuntu Xenial. It is possible to add some patches
> and push a new release. However, we have to select the patches (all the
> MAJOR ones?) and create this hybrid version. It could be useful for
> people not allowed to use third party packages (like the ones on
> haproxy.debian.net) or for those that just don't know they exist. While
> I think this would be useful for many, the gap is so wide that it even
> seems risky. If we are able to identify a couple of patches, I can walk
> the process of pushing them.

The problem is that it's what was being attempted during the days of 1.3,
resulting in still highly bogus versions being deployed in field and
users being very confident in them because they were recently updated.
These days, every patch going into a stable release MUST be applied.
What is considered major for some has no impact for others and what is
minor for some is business critical for others. In all cases it ends up
with reports here on the list.

In fact if I were a bit itchy, I would suggest that another update to
the package shipped by default would systematically cause haproxy to
emit a warning on startup saying "this version is outdated and cannot
be upgraded for internal backport policy reasons, please check
haproxy.debian.net for well-maintained, up-to-date packages").

At the very least we could point the "updates" link on the stats page to
haproxy.debian.net.

> This version is in Ubuntu because this was the version in Debian
> unstable a few months before the freeze. It's always a bit random as we
> (in Debian) don't really care about that when choosing the version we
> push in unstable (we care about our own release).

I see. This is also what helps us push for better versions in future
releases :-)

> FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in
> our next release (to happen in July I hope).

Do you think there's an opportunity to get 1.7.6 if I release it next week ?
It provides -fwrapv which will likely avoid certain bugs with more recent
compilers, and there's a fix for a segfault in Lua.

Cheers,
Willy



Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Vincent Bernat
 ❦ 19 mai 2017 07:04 +0200, Willy Tarreau  :

>> I saw many similar issues posted earlier by others, but could not
>> find a thread where this is resolved or fixed in a newer release. We
>> are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that
>> HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a
>> stale connection that does not seem to belong to any frontend
>> session. Strace with -T shows the folllowing:
>
> In fact a few bugs have caused this situation and all known ones were
> fixed, which doesn't mean there is none left of course. However your
> version is totally outdated and contains tons of known bugs which were
> later fixed (196 total, 22 major, 78 medium, 96 minor) :
>
>http://www.haproxy.org/bugs/bugs-1.6.3.html

Those pages are quite useful!

That's the version in Ubuntu Xenial. It is possible to add some patches
and push a new release. However, we have to select the patches (all the
MAJOR ones?) and create this hybrid version. It could be useful for
people not allowed to use third party packages (like the ones on
haproxy.debian.net) or for those that just don't know they exist. While
I think this would be useful for many, the gap is so wide that it even
seems risky. If we are able to identify a couple of patches, I can walk
the process of pushing them.

This version is in Ubuntu because this was the version in Debian
unstable a few months before the freeze. It's always a bit random as we
(in Debian) don't really care about that when choosing the version we
push in unstable (we care about our own release).

FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in
our next release (to happen in July I hope).
-- 
There's small choice in rotten apples.
-- William Shakespeare, "The Taming of the Shrew"



Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your response/debug details.

> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?

You are right! We have this: "timeout client-fin 3ms"

> So at least you have 3 times 196 bugs in production :-)

And many 'x' times that, we have *lots* of servers to handle the Flipkart
traffic. Thanks for pointing out this information.

So we will upgrade after internal processes are sorted out. Thanks once
again for this quick information on the source of the problem.

Regards,
- Krishna


On Fri, May 19, 2017 at 10:34 AM, Willy Tarreau  wrote:

> Hi Krishna,
>
> On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering)
> wrote:
> > I saw many similar issues posted earlier by others, but could not find a
> > thread
> > where this is resolved or fixed in a newer release. We are using Ubuntu
> > 16.04
> > with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
> > TCP
> > connections, sometimes just 1 - a stale connection that does not seem to
> > belong
> > to any frontend session. Strace with -T shows the folllowing:
>
> In fact a few bugs have caused this situation and all known ones were
> fixed, which doesn't mean there is none left of course. However your
> version is totally outdated and contains tons of known bugs which were
> later fixed (196 total, 22 major, 78 medium, 96 minor) :
>
>http://www.haproxy.org/bugs/bugs-1.6.3.html
>
> > The single connection has this session information:
> > 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
> > source=a.a.a.a:35297
> >   flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil)
> >   frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1)
> > addr=b.b.b.b:5667
> >   backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870
> >   server=d.d.d.d (id=4) addr=d.d.d.d:5667
> >   task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running
> > age=12d11h)
> >   si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=,
> > et=0x000)
> >   si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=,
> > et=0x000)
> >   co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0
> >   flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0
> >   co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000
> >   flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0
> >   req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
> >   an_exp= rex=? wex=
> >   buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0
> >   res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
> >   an_exp= rex= wex=
> >   buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0
>
>
> That's quite useful, thanks!
>
>  - connection with client is closed
>  - connection with server is still established and theorically stopped from
>polling
>  - the request channel is closed in both directions
>  - the response channel is closed in both directions
>  - both buffers are empty
>
> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?
>
> I'm also noticing that the session is aged 12.5 days. So either it has
> been looping for this long (after all the function has been called 1
> billion times), or it was a long session which recently timed out.
>
> > We have 3 systems running the identical configuration and haproxy binary,
>
> So at least you have 3 times 196 bugs in production :-)
>
> > and
> > the 100% cpu is ongoing for the last 17 days on one system. The client
> > connection is no longer present. I am assuming that a haproxy reload
> would
> > solve this as the frontend connection is not present, but have not tested
> > it out yet. Since this box is in production, I am unable to do invasive
> > debugging
> > (e.g. gdb).
>
> For sure. At least an upgrade to 1.6.12 would get rid of most of these
> known bugs. You could perform a rolling upgrade, starting with the machine
> having been in that situation for the longest time.
>
> > Please let me know if this is fixed in a latter release, or any more
> > information that
> > can help find the root cause.
>
> For me everything here looks like the client-fin/server-fin bug that was
> fixed two months ago, so if you're using this it's very likely fixed. If
> not, there's still a small probability that the fixes made to better
> deal with wakeup events in the case of the server-fin bug could have
> addressed a wider class of bugs : often we find one way to enter a
> certain bogus condition and hardly imagine all other possibilities.
>
> Regards,
> Willy
>


Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Willy Tarreau
Hi Krishna,

On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering) wrote:
> I saw many similar issues posted earlier by others, but could not find a
> thread
> where this is resolved or fixed in a newer release. We are using Ubuntu
> 16.04
> with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
> TCP
> connections, sometimes just 1 - a stale connection that does not seem to
> belong
> to any frontend session. Strace with -T shows the folllowing:

In fact a few bugs have caused this situation and all known ones were
fixed, which doesn't mean there is none left of course. However your
version is totally outdated and contains tons of known bugs which were
later fixed (196 total, 22 major, 78 medium, 96 minor) :

   http://www.haproxy.org/bugs/bugs-1.6.3.html

> The single connection has this session information:
> 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
> source=a.a.a.a:35297
>   flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil)
>   frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1)
> addr=b.b.b.b:5667
>   backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870
>   server=d.d.d.d (id=4) addr=d.d.d.d:5667
>   task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running
> age=12d11h)
>   si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=,
> et=0x000)
>   si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=,
> et=0x000)
>   co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0
>   flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0
>   co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000
>   flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0
>   req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
>   an_exp= rex=? wex=
>   buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0
>   res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
>   an_exp= rex= wex=
>   buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0


That's quite useful, thanks!

 - connection with client is closed
 - connection with server is still established and theorically stopped from
   polling
 - the request channel is closed in both directions
 - the response channel is closed in both directions
 - both buffers are empty

It seems that something is preventing the connection close from being
considered, while the task is woken up on a timeout and on I/O. This
exactly reminds me of the client-fin/server-fin bug in fact. Do you
have any of these timeouts in your config ?

I'm also noticing that the session is aged 12.5 days. So either it has
been looping for this long (after all the function has been called 1
billion times), or it was a long session which recently timed out.

> We have 3 systems running the identical configuration and haproxy binary,

So at least you have 3 times 196 bugs in production :-)

> and
> the 100% cpu is ongoing for the last 17 days on one system. The client
> connection is no longer present. I am assuming that a haproxy reload would
> solve this as the frontend connection is not present, but have not tested
> it out yet. Since this box is in production, I am unable to do invasive
> debugging
> (e.g. gdb).

For sure. At least an upgrade to 1.6.12 would get rid of most of these
known bugs. You could perform a rolling upgrade, starting with the machine
having been in that situation for the longest time.

> Please let me know if this is fixed in a latter release, or any more
> information that
> can help find the root cause.

For me everything here looks like the client-fin/server-fin bug that was
fixed two months ago, so if you're using this it's very likely fixed. If
not, there's still a small probability that the fixes made to better
deal with wakeup events in the case of the server-fin bug could have
addressed a wider class of bugs : often we find one way to enter a
certain bogus condition and hardly imagine all other possibilities.

Regards,
Willy



HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Krishna Kumar (Engineering)
Hi,

First of all, thanks for a great product that is working extremely well for
Flipkart!

I saw many similar issues posted earlier by others, but could not find a
thread
where this is resolved or fixed in a newer release. We are using Ubuntu
16.04
with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
TCP
connections, sometimes just 1 - a stale connection that does not seem to
belong
to any frontend session. Strace with -T shows the folllowing:

epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.20>
epoll_wait(0, [], 200, 0)   = 0 <0.09>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=2, u64=2}}], 200, 0) = 1
<0.06>
epoll_wait(0, [{EPOLLIN, {u32=11, u64=11}}], 200, 0) = 1 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.29>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.21>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.11>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [{EPOLLIN, {u32=7, u64=7}}], 200, 0) = 1 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.07>
epoll_wait(0, [{EPOLLOUT, {u32=2, u64=2}}], 200, 0) = 1 <0.15>
epoll_wait(0, [], 200, 0)   = 0 <0.07>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.16>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.08>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.17>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [{EPOLLIN, {u32=10, u64=10}}], 200, 0) = 1 <0.09>
epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=10, u64=10}}], 200, 0) = 1
<0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.16>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.17>

The single connection has this session information:
0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
source=a.a.a.a:35297
  flags=0x1ce, conn_retries=0, srv_conn=0xca4000,