Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Willy Tarreau
Hi Vincent,

On Fri, May 19, 2017 at 07:38:20AM +0200, Vincent Bernat wrote:
>  ? 19 mai 2017 07:04 +0200, Willy Tarreau  :
> 
> >> I saw many similar issues posted earlier by others, but could not
> >> find a thread where this is resolved or fixed in a newer release. We
> >> are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that
> >> HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a
> >> stale connection that does not seem to belong to any frontend
> >> session. Strace with -T shows the folllowing:
> >
> > In fact a few bugs have caused this situation and all known ones were
> > fixed, which doesn't mean there is none left of course. However your
> > version is totally outdated and contains tons of known bugs which were
> > later fixed (196 total, 22 major, 78 medium, 96 minor) :
> >
> >http://www.haproxy.org/bugs/bugs-1.6.3.html
> 
> Those pages are quite useful!

I made them to help everyone know when they're using a bogus version and
to encourage any user to upgrade (including by using your packages for
those on debian/ubuntu).

> That's the version in Ubuntu Xenial. It is possible to add some patches
> and push a new release. However, we have to select the patches (all the
> MAJOR ones?) and create this hybrid version. It could be useful for
> people not allowed to use third party packages (like the ones on
> haproxy.debian.net) or for those that just don't know they exist. While
> I think this would be useful for many, the gap is so wide that it even
> seems risky. If we are able to identify a couple of patches, I can walk
> the process of pushing them.

The problem is that it's what was being attempted during the days of 1.3,
resulting in still highly bogus versions being deployed in field and
users being very confident in them because they were recently updated.
These days, every patch going into a stable release MUST be applied.
What is considered major for some has no impact for others and what is
minor for some is business critical for others. In all cases it ends up
with reports here on the list.

In fact if I were a bit itchy, I would suggest that another update to
the package shipped by default would systematically cause haproxy to
emit a warning on startup saying "this version is outdated and cannot
be upgraded for internal backport policy reasons, please check
haproxy.debian.net for well-maintained, up-to-date packages").

At the very least we could point the "updates" link on the stats page to
haproxy.debian.net.

> This version is in Ubuntu because this was the version in Debian
> unstable a few months before the freeze. It's always a bit random as we
> (in Debian) don't really care about that when choosing the version we
> push in unstable (we care about our own release).

I see. This is also what helps us push for better versions in future
releases :-)

> FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in
> our next release (to happen in July I hope).

Do you think there's an opportunity to get 1.7.6 if I release it next week ?
It provides -fwrapv which will likely avoid certain bugs with more recent
compilers, and there's a fix for a segfault in Lua.

Cheers,
Willy



Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Vincent Bernat
 ❦ 19 mai 2017 07:04 +0200, Willy Tarreau  :

>> I saw many similar issues posted earlier by others, but could not
>> find a thread where this is resolved or fixed in a newer release. We
>> are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that
>> HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a
>> stale connection that does not seem to belong to any frontend
>> session. Strace with -T shows the folllowing:
>
> In fact a few bugs have caused this situation and all known ones were
> fixed, which doesn't mean there is none left of course. However your
> version is totally outdated and contains tons of known bugs which were
> later fixed (196 total, 22 major, 78 medium, 96 minor) :
>
>http://www.haproxy.org/bugs/bugs-1.6.3.html

Those pages are quite useful!

That's the version in Ubuntu Xenial. It is possible to add some patches
and push a new release. However, we have to select the patches (all the
MAJOR ones?) and create this hybrid version. It could be useful for
people not allowed to use third party packages (like the ones on
haproxy.debian.net) or for those that just don't know they exist. While
I think this would be useful for many, the gap is so wide that it even
seems risky. If we are able to identify a couple of patches, I can walk
the process of pushing them.

This version is in Ubuntu because this was the version in Debian
unstable a few months before the freeze. It's always a bit random as we
(in Debian) don't really care about that when choosing the version we
push in unstable (we care about our own release).

FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in
our next release (to happen in July I hope).
-- 
There's small choice in rotten apples.
-- William Shakespeare, "The Taming of the Shrew"



Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your response/debug details.

> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?

You are right! We have this: "timeout client-fin 3ms"

> So at least you have 3 times 196 bugs in production :-)

And many 'x' times that, we have *lots* of servers to handle the Flipkart
traffic. Thanks for pointing out this information.

So we will upgrade after internal processes are sorted out. Thanks once
again for this quick information on the source of the problem.

Regards,
- Krishna


On Fri, May 19, 2017 at 10:34 AM, Willy Tarreau  wrote:

> Hi Krishna,
>
> On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering)
> wrote:
> > I saw many similar issues posted earlier by others, but could not find a
> > thread
> > where this is resolved or fixed in a newer release. We are using Ubuntu
> > 16.04
> > with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
> > TCP
> > connections, sometimes just 1 - a stale connection that does not seem to
> > belong
> > to any frontend session. Strace with -T shows the folllowing:
>
> In fact a few bugs have caused this situation and all known ones were
> fixed, which doesn't mean there is none left of course. However your
> version is totally outdated and contains tons of known bugs which were
> later fixed (196 total, 22 major, 78 medium, 96 minor) :
>
>http://www.haproxy.org/bugs/bugs-1.6.3.html
>
> > The single connection has this session information:
> > 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
> > source=a.a.a.a:35297
> >   flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil)
> >   frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1)
> > addr=b.b.b.b:5667
> >   backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870
> >   server=d.d.d.d (id=4) addr=d.d.d.d:5667
> >   task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running
> > age=12d11h)
> >   si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=,
> > et=0x000)
> >   si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=,
> > et=0x000)
> >   co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0
> >   flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0
> >   co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000
> >   flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0
> >   req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
> >   an_exp= rex=? wex=
> >   buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0
> >   res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
> >   an_exp= rex= wex=
> >   buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0
>
>
> That's quite useful, thanks!
>
>  - connection with client is closed
>  - connection with server is still established and theorically stopped from
>polling
>  - the request channel is closed in both directions
>  - the response channel is closed in both directions
>  - both buffers are empty
>
> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?
>
> I'm also noticing that the session is aged 12.5 days. So either it has
> been looping for this long (after all the function has been called 1
> billion times), or it was a long session which recently timed out.
>
> > We have 3 systems running the identical configuration and haproxy binary,
>
> So at least you have 3 times 196 bugs in production :-)
>
> > and
> > the 100% cpu is ongoing for the last 17 days on one system. The client
> > connection is no longer present. I am assuming that a haproxy reload
> would
> > solve this as the frontend connection is not present, but have not tested
> > it out yet. Since this box is in production, I am unable to do invasive
> > debugging
> > (e.g. gdb).
>
> For sure. At least an upgrade to 1.6.12 would get rid of most of these
> known bugs. You could perform a rolling upgrade, starting with the machine
> having been in that situation for the longest time.
>
> > Please let me know if this is fixed in a latter release, or any more
> > information that
> > can help find the root cause.
>
> For me everything here looks like the client-fin/server-fin bug that was
> fixed two months ago, so if you're using this it's very likely fixed. If
> not, there's still a small probability that the fixes made to better
> deal with wakeup events in the case of the server-fin bug could have
> addressed a wider class of bugs : often we find one way to enter a
> certain bogus condition and hardly imagine all other possibilities.
>
> Regards,
> Willy
>


Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Willy Tarreau
Hi Krishna,

On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering) wrote:
> I saw many similar issues posted earlier by others, but could not find a
> thread
> where this is resolved or fixed in a newer release. We are using Ubuntu
> 16.04
> with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
> TCP
> connections, sometimes just 1 - a stale connection that does not seem to
> belong
> to any frontend session. Strace with -T shows the folllowing:

In fact a few bugs have caused this situation and all known ones were
fixed, which doesn't mean there is none left of course. However your
version is totally outdated and contains tons of known bugs which were
later fixed (196 total, 22 major, 78 medium, 96 minor) :

   http://www.haproxy.org/bugs/bugs-1.6.3.html

> The single connection has this session information:
> 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
> source=a.a.a.a:35297
>   flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil)
>   frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1)
> addr=b.b.b.b:5667
>   backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870
>   server=d.d.d.d (id=4) addr=d.d.d.d:5667
>   task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running
> age=12d11h)
>   si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=,
> et=0x000)
>   si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=,
> et=0x000)
>   co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0
>   flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0
>   co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000
>   flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0
>   req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
>   an_exp= rex=? wex=
>   buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0
>   res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
>   an_exp= rex= wex=
>   buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0


That's quite useful, thanks!

 - connection with client is closed
 - connection with server is still established and theorically stopped from
   polling
 - the request channel is closed in both directions
 - the response channel is closed in both directions
 - both buffers are empty

It seems that something is preventing the connection close from being
considered, while the task is woken up on a timeout and on I/O. This
exactly reminds me of the client-fin/server-fin bug in fact. Do you
have any of these timeouts in your config ?

I'm also noticing that the session is aged 12.5 days. So either it has
been looping for this long (after all the function has been called 1
billion times), or it was a long session which recently timed out.

> We have 3 systems running the identical configuration and haproxy binary,

So at least you have 3 times 196 bugs in production :-)

> and
> the 100% cpu is ongoing for the last 17 days on one system. The client
> connection is no longer present. I am assuming that a haproxy reload would
> solve this as the frontend connection is not present, but have not tested
> it out yet. Since this box is in production, I am unable to do invasive
> debugging
> (e.g. gdb).

For sure. At least an upgrade to 1.6.12 would get rid of most of these
known bugs. You could perform a rolling upgrade, starting with the machine
having been in that situation for the longest time.

> Please let me know if this is fixed in a latter release, or any more
> information that
> can help find the root cause.

For me everything here looks like the client-fin/server-fin bug that was
fixed two months ago, so if you're using this it's very likely fixed. If
not, there's still a small probability that the fixes made to better
deal with wakeup events in the case of the server-fin bug could have
addressed a wider class of bugs : often we find one way to enter a
certain bogus condition and hardly imagine all other possibilities.

Regards,
Willy



Re: [Patches] TLS methods configuration reworked

2017-05-18 Thread Willy Tarreau
Hi Cyril,

On Thu, May 18, 2017 at 11:02:29PM +0200, Cyril Bonté wrote:
> Hi all,
> 
> Le 12/05/2017 à 15:13, Willy Tarreau a écrit :
> > Hi guys,
> > 
> > On Tue, May 09, 2017 at 11:21:36AM +0200, Emeric Brun wrote:
> > > It seems to do what we want, so we can merge it.
> > 
> > So the good news is that this patch set now got merged :-)
> 
> Commit 5db33cbdc4 [1] seems to have broken the compilation when
> OPENSSL_NO_SSL3 is defined : SSLv3_server_method() and SSLv3_client_method()
> won't exist in this case.
> Previously there was a condition to verify this, which has disappeared with
> this patch set.

Ah, thanks for the report. If you've diagnosed this and you know what is
missing, do you think you could provide a patch ?

Thanks,
Willy



Re: haproxy consuming 100% cpu - epoll loop

2017-05-18 Thread Willy Tarreau
Hi Patrick,

On Thu, May 18, 2017 at 05:44:30PM -0400, Patrick Hemmer wrote:
> 
> On 2017/1/17 17:02, Willy Tarreau wrote:
> > Hi Patrick,
> >
> > On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote:
> >> So on one of my local development machines haproxy started pegging the
> >> CPU at 100%
> >> `strace -T` on the process just shows:
> >>
> >> ...
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> ...
> > Hmm not good.
> >
> >> Opening it up with gdb, the backtrace shows:
> >>
> >> (gdb) bt
> >> #0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> >> #1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
> >> at src/ev_epoll.c:125
> >> #2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
> >> #3  0x7f4d1a4cf2c0 in main (argc=, argv= >> out>) at src/haproxy.c:2097
> > Ok so an event is not being processed correctly.
> >
> >> This is haproxy 1.7.0 on CentOS/7
> > Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
> > and 1.7.1. One of them is responsible for the few outages on haproxy.org
> > (last one happened today, I left it running to get the core to confirm).
> > One of them is an issue with the condition to wake up an applet when it
> > failed to get a buffer first and it could be what you're seeing. The
> > other ones could possibly cause some memory corruption resulting in
> > anything.
> >
> > Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
> > to do on haproxy.org now that I could get a core). Continue to monitor
> > it but I'd feel much safer after this update.
> >
> > Thanks for your report!
> > Willy
> >
> So I just had this issue recur, this time on version 1.7.2.

OK. If it's still doing it, capturing the output of "show sess all" on
the CLI could help a lot.

I've looked at the changelog since 1.7.2 and other bugs have since been
fixed possibly responsible for this :
  - c691781 ("BUG/MEDIUM: stream: fix client-fin/server-fin handling")
=> fixes a cause of 100% CPU when timeout client-fin/server-fin are
   used
  - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into 
buffe
=> fixes the computation of free buffer space, it's unknown if we've
   ever hit this bug. It could be possible that it causes some applets
   like stats to fail to write and to be called again immediately for
   example.
  - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into 
buffe
=> some filters might be woken up to do nothing. Compression might trigger
   this.

  - there are a bunch of polling-related fixes which might have got rid of
such a bad situation on certain connect() cases, possibly over unix
sockets.

There are a few other fixes in the queue that I need to backport but none
of them is related to this. Thus if you're in emergency, 1.7.5 could help
by bringing the fixes above. If you can wait a few more days, I expect to
issue 1.7.6 early next week.

Cheers,
Willy



HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Krishna Kumar (Engineering)
Hi,

First of all, thanks for a great product that is working extremely well for
Flipkart!

I saw many similar issues posted earlier by others, but could not find a
thread
where this is resolved or fixed in a newer release. We are using Ubuntu
16.04
with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
TCP
connections, sometimes just 1 - a stale connection that does not seem to
belong
to any frontend session. Strace with -T shows the folllowing:

epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.20>
epoll_wait(0, [], 200, 0)   = 0 <0.09>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=2, u64=2}}], 200, 0) = 1
<0.06>
epoll_wait(0, [{EPOLLIN, {u32=11, u64=11}}], 200, 0) = 1 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.29>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.21>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.11>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [{EPOLLIN, {u32=7, u64=7}}], 200, 0) = 1 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.07>
epoll_wait(0, [{EPOLLOUT, {u32=2, u64=2}}], 200, 0) = 1 <0.15>
epoll_wait(0, [], 200, 0)   = 0 <0.07>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.16>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.08>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.17>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [{EPOLLIN, {u32=10, u64=10}}], 200, 0) = 1 <0.09>
epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=10, u64=10}}], 200, 0) = 1
<0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.16>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.17>

The single connection has this session information:
0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
source=a.a.a.a:35297
  flags=0x1ce, conn_retries=0, srv_conn=0xca4000, 

Re: 1.7.5 503 Timeouts with SNI backend

2017-05-18 Thread Ryan Schlesinger
That’s incredibly insightful of you.  I’ll set up a resolver for all of my
CF uses and report back if I can repro this apart from that config fix.

Thanks!


On May 18, 2017 at 3:42:35 PM, Michael Ezzell (mich...@ezzell.net) wrote:



On May 18, 2017 3:07 PM, "Ryan Schlesinger" 
wrote:

We have the following backend configuration:

backend clientsite_ember
  server cf foobar.cloudfront.net:443 ssl verify required sni str(
foobar.cloudfront.net) ca-file /etc/ssl/certs/ca-certificates.crt

This has been working great with 1.7.2 since February.  I upgraded to 1.7.5
yesterday and today found that all requests through that backend were
returning 503.  Testing the cloudfront url manually loaded the site.

Sample Logs:
May 18 10:13:47 ip-10-4-13-35 haproxy:  :46924
[18/May/2017:17:13:32.237] http-in~ clientsite_ember/cf 0/0/-1/-1/14969 503
212 - - CC--


That second C is significant:

the proxy was waiting for the CONNECTION to establish on theserver.
The server might at most have noticed a connection attempt.


You don't have a healthcheck configured.  You don't want option httpchk
with CloudFront, but you do need at least a TCP check.  The place where you
were connecting to could have been unavailable.

To understand how, take a look at the results of dig
dzzzexample.cloudfront.net.  There will be several responses.  But, without
a DNS resolver section configured on the proxy and attached to each backend
server to continually re-resolve the addresses, the proxy will latch to
just one, and stick to it until restarted.

The DNS responses from CloudFront can vary from day to day or hour to hour,
since the DNS is dynamically derived from their system's current notion of
the "closest" (most optimal) location from where you query DNS from.  From
Cincinnati, Ohio, I see DNS responses indicating I'm connecting to South
Bend, IN, one day,  Chicago, IL, another,  then Ashburn, VA.  As I type
this, I'm actually seeing New York, NY.  (Do a reverse lookup on the IP
addresses currently associated with the CloudFront hostname.  An
alphanumeric code in the hostname gives you the IATA code of the nearest
airport to the CloudFront edge in question -- IADx is Ashburn, JFKx is NYC,
etc.)

If CloudFront lost an edge or took one out of DNS rotation and shut it down
for maintenance, what you saw would potentially be one behavior HAProxy
could be expected to exhibit, because it wouldn't know.  Unless I missed a
memo, HAProxy only resolves DNS at startup unless configured otherwise.

The browser you tested with would have resolved a different address.

I'm not saying there can't be an issue in 1.7.5 but your configuration
seems vulnerable to service disruptions, since it can't take advantage of
CloudFront's fault tolerance mechanisms.


Re: 1.7.5 503 Timeouts with SNI backend

2017-05-18 Thread Michael Ezzell
On May 18, 2017 3:07 PM, "Ryan Schlesinger" 
wrote:

We have the following backend configuration:

backend clientsite_ember
  server cf foobar.cloudfront.net:443 ssl verify required sni str(
foobar.cloudfront.net) ca-file /etc/ssl/certs/ca-certificates.crt

This has been working great with 1.7.2 since February.  I upgraded to 1.7.5
yesterday and today found that all requests through that backend were
returning 503.  Testing the cloudfront url manually loaded the site.

Sample Logs:
May 18 10:13:47 ip-10-4-13-35 haproxy:  :46924
[18/May/2017:17:13:32.237] http-in~ clientsite_ember/cf 0/0/-1/-1/14969 503
212 - - CC--


That second C is significant:

the proxy was waiting for the CONNECTION to establish on theserver.
The server might at most have noticed a connection attempt.


You don't have a healthcheck configured.  You don't want option httpchk
with CloudFront, but you do need at least a TCP check.  The place where you
were connecting to could have been unavailable.

To understand how, take a look at the results of dig
dzzzexample.cloudfront.net.  There will be several responses.  But, without
a DNS resolver section configured on the proxy and attached to each backend
server to continually re-resolve the addresses, the proxy will latch to
just one, and stick to it until restarted.

The DNS responses from CloudFront can vary from day to day or hour to hour,
since the DNS is dynamically derived from their system's current notion of
the "closest" (most optimal) location from where you query DNS from.  From
Cincinnati, Ohio, I see DNS responses indicating I'm connecting to South
Bend, IN, one day,  Chicago, IL, another,  then Ashburn, VA.  As I type
this, I'm actually seeing New York, NY.  (Do a reverse lookup on the IP
addresses currently associated with the CloudFront hostname.  An
alphanumeric code in the hostname gives you the IATA code of the nearest
airport to the CloudFront edge in question -- IADx is Ashburn, JFKx is NYC,
etc.)

If CloudFront lost an edge or took one out of DNS rotation and shut it down
for maintenance, what you saw would potentially be one behavior HAProxy
could be expected to exhibit, because it wouldn't know.  Unless I missed a
memo, HAProxy only resolves DNS at startup unless configured otherwise.

The browser you tested with would have resolved a different address.

I'm not saying there can't be an issue in 1.7.5 but your configuration
seems vulnerable to service disruptions, since it can't take advantage of
CloudFront's fault tolerance mechanisms.


Re: haproxy consuming 100% cpu - epoll loop

2017-05-18 Thread Patrick Hemmer

On 2017/1/17 17:02, Willy Tarreau wrote:
> Hi Patrick,
>
> On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote:
>> So on one of my local development machines haproxy started pegging the
>> CPU at 100%
>> `strace -T` on the process just shows:
>>
>> ...
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> ...
> Hmm not good.
>
>> Opening it up with gdb, the backtrace shows:
>>
>> (gdb) bt
>> #0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
>> at src/ev_epoll.c:125
>> #2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
>> #3  0x7f4d1a4cf2c0 in main (argc=, argv=> out>) at src/haproxy.c:2097
> Ok so an event is not being processed correctly.
>
>> This is haproxy 1.7.0 on CentOS/7
> Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
> and 1.7.1. One of them is responsible for the few outages on haproxy.org
> (last one happened today, I left it running to get the core to confirm).
> One of them is an issue with the condition to wake up an applet when it
> failed to get a buffer first and it could be what you're seeing. The
> other ones could possibly cause some memory corruption resulting in
> anything.
>
> Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
> to do on haproxy.org now that I could get a core). Continue to monitor
> it but I'd feel much safer after this update.
>
> Thanks for your report!
> Willy
>
So I just had this issue recur, this time on version 1.7.2.

-Patrick


haproxy doesn't restart after segfault on systemd

2017-05-18 Thread Patrick Hemmer
So we had an incident today where haproxy segfaulted and our site went
down. Unfortunately we did not capture a core, and the segfault message
logged to dmesg just showed it inside libc. So there's likely not much
we can do here. We'll be making changes to ensure we capture a core in
the future.

However the issue I am reporting that is reproducible (on version 1.7.5)
is that haproxy did not auto restart, which would have minimized the
downtime to the site. We use nbproc > 1, so we have multiple haproxy
processes running, and when one of them dies, neither the
"haproxy-master" process or the "haproxy-systemd-wrapper" process exits,
which prevents systemd from starting the service back up.

While I think this behavior would be fine, a possible alternative would
be for the "haproxy-master" process to restart the dead worker without
having to kill all the other processes.

Another possible action would be to leave the workers running, but
signal them to stop accepting new connections, and then let the
"haproxy-master" exit so systemd will restart it.

But in any case, I think we need some way of handling this so that site
interruption is minimal.

-Patrick


Re: [Patches] TLS methods configuration reworked

2017-05-18 Thread Cyril Bonté

Hi all,

Le 12/05/2017 à 15:13, Willy Tarreau a écrit :

Hi guys,

On Tue, May 09, 2017 at 11:21:36AM +0200, Emeric Brun wrote:

It seems to do what we want, so we can merge it.


So the good news is that this patch set now got merged :-)


Commit 5db33cbdc4 [1] seems to have broken the compilation when 
OPENSSL_NO_SSL3 is defined : SSLv3_server_method() and 
SSLv3_client_method() won't exist in this case.
Previously there was a condition to verify this, which has disappeared 
with this patch set.




Thanks for your time and efforts back-and-forth on this one!
Willy



[1] 
http://www.haproxy.org/git?p=haproxy.git;a=commit;h=5db33cbdc4f2952cbd3c140edce0eda84e1447b4


--
Cyril Bonté



1.7.5 503 Timeouts with SNI backend

2017-05-18 Thread Ryan Schlesinger
We have the following backend configuration:

backend clientsite_ember
  server cf foobar.cloudfront.net:443 ssl verify required sni str(
foobar.cloudfront.net) ca-file /etc/ssl/certs/ca-certificates.crt

This has been working great with 1.7.2 since February.  I upgraded to 1.7.5
yesterday and today found that all requests through that backend were
returning 503.  Testing the cloudfront url manually loaded the site.

Sample Logs:
May 18 10:13:47 ip-10-4-13-35 haproxy:  :46924
[18/May/2017:17:13:32.237] http-in~ clientsite_ember/cf 0/0/-1/-1/14969 503
212 - - CC-- 10/10/1/1/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:13:54 ip-10-4-13-35 haproxy:  :33235
[18/May/2017:17:13:22.354] http-in~ clientsite_ember/cf 0/30004/-1/-1/32296
503 212 - - CC-- 12/12/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_11_1) AppleWebKit/601.} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:14:45 ip-10-4-13-35 haproxy:  :9313
[18/May/2017:17:14:07.198] http-in~ clientsite_ember/cf 0/30003/-1/-1/38336
503 212 - - CC-- 13/13/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:15:30 ip-10-4-135-120 haproxy:  :37948
[18/May/2017:17:14:59.850] http-in~ clientsite_ember/cf 0/30004/-1/-1/30400
503 212 - - CC-- 9/9/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_11_1) AppleWebKit/601.} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:15:32 ip-10-4-69-34 haproxy:  :38451
[18/May/2017:17:15:17.652] http-in~ clientsite_ember/cf 0/0/-1/-1/14714 503
212 - - CC-- 12/12/0/0/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:16:12 ip-10-4-135-120 haproxy:  :52747
[18/May/2017:17:15:32.824] http-in~ clientsite_ember/cf 0/30004/-1/-1/40005
503 212 - - sC-- 12/12/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:17:45 ip-10-4-135-120 haproxy:  :60096
[18/May/2017:17:17:05.314] http-in~ clientsite_ember/cf 0/30005/-1/-1/40007
503 212 - - sC-- 9/9/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible;
YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1"
May 18 10:18:25 ip-10-4-69-34 haproxy:  :63513
[18/May/2017:17:17:45.827] http-in~ clientsite_ember/cf 0/30005/-1/-1/40006
503 212 - - sC-- 13/13/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible;
YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1"
May 18 10:18:27 ip-10-4-13-35 haproxy:  :57858
[18/May/2017:17:18:15.384] http-in~ clientsite_ember/cf 0/0/-1/-1/11631 503
212 - - CC-- 15/15/1/1/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:18:34 ip-10-4-135-120 haproxy:  :55173
[18/May/2017:17:18:14.921] http-in~ clientsite_ember/cf 0/0/-1/-1/19973 503
212 - - CC-- 11/11/0/0/1 0/0 {clientsite.com||Mozilla/5.0 (compatible;
Cliqzbot/1.0; +http://cliqz.com/company} "GET /path5 HTTP/1.1"
May 18 10:18:49 ip-10-4-69-34 haproxy:  :49219
[18/May/2017:17:18:34.138] http-in~ clientsite_ember/cf 0/0/-1/-1/15309 503
212 - - CC-- 16/16/0/0/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:18:55 ip-10-4-135-120 haproxy:  :58221
[18/May/2017:17:18:35.904] http-in~ clientsite_ember/cf 0/0/-1/-1/19988 503
212 - - CC-- 14/14/1/1/1 0/0 {clientsite.com||Mozilla/5.0 (compatible;
Cliqzbot/1.0; +http://cliqz.com/company} "GET /path5 HTTP/1.1"
May 18 10:19:06 ip-10-4-13-35 haproxy:  :36125
[18/May/2017:17:18:26.333] http-in~ clientsite_ember/cf 0/30005/-1/-1/40007
503 212 - - sC-- 19/19/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible;
YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1"
May 18 10:19:26 ip-10-4-135-120 haproxy:  :23388
[18/May/2017:17:18:47.167] http-in~ clientsite_ember/cf 0/30005/-1/-1/39090
503 212 - - CC-- 15/15/1/1/3 0/0 {clientsite.com||Mozilla/5.0 (Windows NT
6.2; Win64; x64) AppleWebKit/537.36 (KHT} "GET /path3 HTTP/1.1"
May 18 10:19:46 ip-10-4-135-120 haproxy:  :39212
[18/May/2017:17:19:06.835] http-in~ clientsite_ember/cf 0/30005/-1/-1/40006
503 212 - - sC-- 13/13/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible;
YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1"
May 18 10:19:47 ip-10-4-69-34 haproxy:  :43670
[18/May/2017:17:19:38.573] http-in~ clientsite_ember/cf 0/0/-1/-1/9047 503
212 - - CC-- 18/18/0/0/0 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1"
May 18 10:19:55 ip-10-4-13-35 haproxy:  :20040
[18/May/2017:17:19:15.429] http-in~ clientsite_ember/cf 0/30004/-1/-1/40006
503 212 - - sC-- 18/18/1/1/3 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU
iPhone OS 10_2_1 like Mac OS X) AppleWe} "GET /path4/?slide=1 HTTP/1.1"
May 18 10:20:06 ip-10-4-13-35 haproxy:  :48559
[18/May

Re: Bug: DNS changes in 1.7.3+ break UNIX socket stats in daemon mode with resolvers on FreeBSD

2017-05-18 Thread Jim Pingle
On 05/12/2017 09:50 AM, Willy Tarreau wrote:
> On Fri, May 12, 2017 at 10:20:56AM +0200, Frederic Lecaille wrote:
>> Here is a more well-formed patch.
>> Feel free to amend the commit message if not enough clear ;)
> 
> It was clear enough, thanks. I added the mention of the faulty commit,
> that helps tracking backports and credited Jim and Lukas for the
> investigations.

Thanks for getting this in! Everything still appears to be good here
running with that patch applied.

I don't see it in the 1.7 tree yet, will it be backported there?

Is there an ETA on 1.7.6?

Jim P.



Re: [PATCH] MINOR: ssl: support ssl-min-ver and ssl-max-ver with crt-list

2017-05-18 Thread Emmanuel Hocdet
Hi,

Same patch, split in 3 parts for better understanding.

> Le 12 mai 2017 à 15:05, Emmanuel Hocdet  a écrit :
> 
> Hi,
> 
> This patch depend of " [Patches] TLS methods configuration reworked ».
> 
> Actually it will only work with BoringSSL because haproxy use a special 
> ssl_sock_switchctx_cbk
> with a BoringSSL callback to select certificat before any handshake 
> negotiation.
> This feature (and others depend of this ssl_sock_switchctx_cbk) could work 
> with openssl 1.1.1 and
> the new callback 
> https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_early_cb.html.
> 
> ++
> Manu
> 


0001-REORG-ssl-move-defines-and-methodVersions-table-uppe.patch
Description: Binary data


0002-MEDIUM-ssl-ctx_set_version-ssl_set_version-func-for-.patch
Description: Binary data


0003-MINOR-ssl-support-ssl-min-ver-and-ssl-max-ver-with-c.patch
Description: Binary data




Re: haproxy "inter" and "timeout check", retries and "fall"

2017-05-18 Thread Jiafan Zhou

Hi Bryan,

For reference:


defaults
modehttp
log global
option  httplog
option  dontlognull
option http-server-close
option forwardfor   except 127.0.0.0/8
option  redispatch
retries 3
timeout http-request10s
timeout queue   1m
timeout connect 10s
timeout client  1m
timeout server  1m
timeout http-keep-alive 10s
*timeout check   10s*
maxconn 3000


But in the backend setting, I have NOT defined the "inter", like below:

backend apache_http
balance roundrobin
cookie iPlanetDirectoryPro prefix nocache
*server httpdserver_80_1 httpd-1-internal:80 cookie S1 check**
**server httpdserver_80_2 httpd-2-internal:80 cookie S2 check*
log global



Thank you for your comment, really appreciate.

- In relation to the version of haproxy, this is installed as the 
supported package on rhel6.6 from where we get official support. You are 
right, it is too old, will seek upgrade from RedHat.


- For "timeout check" and "inter", it was for some troubleshooting and 
would like to understand the behaviour a bit more. By reading haproxy 
official document, it is not clear to me.


I think in my case, it uses the "timeout check" as 10 seconds. There is 
no "inter" parameter in the configuration.


But here I try to understand which value will use if "timeout check" is 
present, but "inter" is not. I already set the timeout check".


- Great for clarifying the "retries" parameter.

- Finally, I think I am still right about the "fall" (default to 3) and 
"rise" (default to 2).


It takes up to 50 seconds to converge the server, as far as the haproxy 
is concerned.


Is that correct to say?

Kindly inform of anything wrong or incorrect here.

Regards,
Jiafan




On 05/15/2017 09:10 PM, Bryan Talbot wrote:


On May 13, 2017, at May 13, 10:59 PM, Jiafan Zhou 
mailto:jiafan.z...@ericsson.com>> wrote:



Hi all,

The version of haproxy I use is:

# haproxy -version
HA-Proxy version 1.5.2 2014/07/12
Copyright 2000-2014 Willy Tarreau 



This version is so old. I’m sure there must be hundreds of bugs fixed 
over the last 3 years. Why not use a properly current version?



I have a question regarding the Health Check. In the documentation of 
haproxy, it mentions the below for the "timeout check" and "inter":


Now I am wondering here which one and what value will be used for 
healthcheck interval. Is it "timeout check" as 10 seconds, or the 
"inter" as the default 2 seconds?





Why not just set the health check values that you care about and not 
worry about guessing what they’ll end up being when only some are set 
and some are using defaults? If you need / expect them to be a 
particular value for proper system operation, I’d set them no matter 
what the defaults may be declared to be.



Another question, since I defined the "retries" to be 3, in the case 
of server connection failure, will it reconnect 3 times? Or does it 
use the "fall" parameter (which defaults to 3 here as well) instead 
for healthcheck retry?






“retries” is for dispatching requests and is not used for health checks.


So in this configuration, in the case of server failure, does it wait 
for up to 30 seconds (3 fall or retries), then 20 seconds (2 rise), 
before the server is considered operational? (in total 50 seconds)





retries are not considered, only health check specific settings like 
“fail”, “inter"



Thanks,

Jiafan








Re: [PATCH] MINOR: boringssl: basic support for OCSP Stapling

2017-05-18 Thread Emmanuel Hocdet
Hi Willy,

This patch only applies to boringssl. Could you merge them?

++
Emmanuel

> Le 29 mars 2017 à 16:46, Emmanuel Hocdet  a écrit :
> 
> 
> Use boringssl SSL_CTX_set_ocsp_response to set OCSP response from file with
> '.ocsp' extension. CLI update is not supported.
> 
> <0001-MINOR-boringssl-basic-support-for-OCSP-Stapling.patch>
> 



Re: truncated request in log lines

2017-05-18 Thread Willy Tarreau
On Thu, May 18, 2017 at 08:58:41AM +0200, Stéphane Cottin wrote:
> > Nice, that was fast :-)
> 
> Nobody have time, I just take care of things as they flow :)

you're right!

> Sorry, I didn't read the CONTRIBUTING, RTFM me.

no pb.

> Hope this one is better.

Definitely. The most suitable form is the git-format-patch, which we
can easily apply using git-am. But this one has everything I need so
I'll apply it.

Thank you!
Willy