Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection
Hi Vincent, On Fri, May 19, 2017 at 07:38:20AM +0200, Vincent Bernat wrote: > ? 19 mai 2017 07:04 +0200, Willy Tarreau : > > >> I saw many similar issues posted earlier by others, but could not > >> find a thread where this is resolved or fixed in a newer release. We > >> are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that > >> HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a > >> stale connection that does not seem to belong to any frontend > >> session. Strace with -T shows the folllowing: > > > > In fact a few bugs have caused this situation and all known ones were > > fixed, which doesn't mean there is none left of course. However your > > version is totally outdated and contains tons of known bugs which were > > later fixed (196 total, 22 major, 78 medium, 96 minor) : > > > >http://www.haproxy.org/bugs/bugs-1.6.3.html > > Those pages are quite useful! I made them to help everyone know when they're using a bogus version and to encourage any user to upgrade (including by using your packages for those on debian/ubuntu). > That's the version in Ubuntu Xenial. It is possible to add some patches > and push a new release. However, we have to select the patches (all the > MAJOR ones?) and create this hybrid version. It could be useful for > people not allowed to use third party packages (like the ones on > haproxy.debian.net) or for those that just don't know they exist. While > I think this would be useful for many, the gap is so wide that it even > seems risky. If we are able to identify a couple of patches, I can walk > the process of pushing them. The problem is that it's what was being attempted during the days of 1.3, resulting in still highly bogus versions being deployed in field and users being very confident in them because they were recently updated. These days, every patch going into a stable release MUST be applied. What is considered major for some has no impact for others and what is minor for some is business critical for others. In all cases it ends up with reports here on the list. In fact if I were a bit itchy, I would suggest that another update to the package shipped by default would systematically cause haproxy to emit a warning on startup saying "this version is outdated and cannot be upgraded for internal backport policy reasons, please check haproxy.debian.net for well-maintained, up-to-date packages"). At the very least we could point the "updates" link on the stats page to haproxy.debian.net. > This version is in Ubuntu because this was the version in Debian > unstable a few months before the freeze. It's always a bit random as we > (in Debian) don't really care about that when choosing the version we > push in unstable (we care about our own release). I see. This is also what helps us push for better versions in future releases :-) > FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in > our next release (to happen in July I hope). Do you think there's an opportunity to get 1.7.6 if I release it next week ? It provides -fwrapv which will likely avoid certain bugs with more recent compilers, and there's a fix for a segfault in Lua. Cheers, Willy
Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection
❦ 19 mai 2017 07:04 +0200, Willy Tarreau : >> I saw many similar issues posted earlier by others, but could not >> find a thread where this is resolved or fixed in a newer release. We >> are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that >> HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a >> stale connection that does not seem to belong to any frontend >> session. Strace with -T shows the folllowing: > > In fact a few bugs have caused this situation and all known ones were > fixed, which doesn't mean there is none left of course. However your > version is totally outdated and contains tons of known bugs which were > later fixed (196 total, 22 major, 78 medium, 96 minor) : > >http://www.haproxy.org/bugs/bugs-1.6.3.html Those pages are quite useful! That's the version in Ubuntu Xenial. It is possible to add some patches and push a new release. However, we have to select the patches (all the MAJOR ones?) and create this hybrid version. It could be useful for people not allowed to use third party packages (like the ones on haproxy.debian.net) or for those that just don't know they exist. While I think this would be useful for many, the gap is so wide that it even seems risky. If we are able to identify a couple of patches, I can walk the process of pushing them. This version is in Ubuntu because this was the version in Debian unstable a few months before the freeze. It's always a bit random as we (in Debian) don't really care about that when choosing the version we push in unstable (we care about our own release). FYI, we are likely to release 1.7.5 (with USE_GETADDRINFO=1 enabled) in our next release (to happen in July I hope). -- There's small choice in rotten apples. -- William Shakespeare, "The Taming of the Shrew"
Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection
Hi Willy, Thanks for your response/debug details. > It seems that something is preventing the connection close from being > considered, while the task is woken up on a timeout and on I/O. This > exactly reminds me of the client-fin/server-fin bug in fact. Do you > have any of these timeouts in your config ? You are right! We have this: "timeout client-fin 3ms" > So at least you have 3 times 196 bugs in production :-) And many 'x' times that, we have *lots* of servers to handle the Flipkart traffic. Thanks for pointing out this information. So we will upgrade after internal processes are sorted out. Thanks once again for this quick information on the source of the problem. Regards, - Krishna On Fri, May 19, 2017 at 10:34 AM, Willy Tarreau wrote: > Hi Krishna, > > On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering) > wrote: > > I saw many similar issues posted earlier by others, but could not find a > > thread > > where this is resolved or fixed in a newer release. We are using Ubuntu > > 16.04 > > with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10 > > TCP > > connections, sometimes just 1 - a stale connection that does not seem to > > belong > > to any frontend session. Strace with -T shows the folllowing: > > In fact a few bugs have caused this situation and all known ones were > fixed, which doesn't mean there is none left of course. However your > version is totally outdated and contains tons of known bugs which were > later fixed (196 total, 22 major, 78 medium, 96 minor) : > >http://www.haproxy.org/bugs/bugs-1.6.3.html > > > The single connection has this session information: > > 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4 > > source=a.a.a.a:35297 > > flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil) > > frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1) > > addr=b.b.b.b:5667 > > backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870 > > server=d.d.d.d (id=4) addr=d.d.d.d:5667 > > task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running > > age=12d11h) > > si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=, > > et=0x000) > > si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=, > > et=0x000) > > co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0 > > flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0 > > co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000 > > flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0 > > req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0) > > an_exp= rex=? wex= > > buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0 > > res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0) > > an_exp= rex= wex= > > buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0 > > > That's quite useful, thanks! > > - connection with client is closed > - connection with server is still established and theorically stopped from >polling > - the request channel is closed in both directions > - the response channel is closed in both directions > - both buffers are empty > > It seems that something is preventing the connection close from being > considered, while the task is woken up on a timeout and on I/O. This > exactly reminds me of the client-fin/server-fin bug in fact. Do you > have any of these timeouts in your config ? > > I'm also noticing that the session is aged 12.5 days. So either it has > been looping for this long (after all the function has been called 1 > billion times), or it was a long session which recently timed out. > > > We have 3 systems running the identical configuration and haproxy binary, > > So at least you have 3 times 196 bugs in production :-) > > > and > > the 100% cpu is ongoing for the last 17 days on one system. The client > > connection is no longer present. I am assuming that a haproxy reload > would > > solve this as the frontend connection is not present, but have not tested > > it out yet. Since this box is in production, I am unable to do invasive > > debugging > > (e.g. gdb). > > For sure. At least an upgrade to 1.6.12 would get rid of most of these > known bugs. You could perform a rolling upgrade, starting with the machine > having been in that situation for the longest time. > > > Please let me know if this is fixed in a latter release, or any more > > information that > > can help find the root cause. > > For me everything here looks like the client-fin/server-fin bug that was > fixed two months ago, so if you're using this it's very likely fixed. If > not, there's still a small probability that the fixes made to better > deal with wakeup events in the case of the server-fin bug could have > addressed a wider class of bugs : often we find one way to enter a > certain bogus condition and hardly imagine all other possibilities. > > Regards, > Willy >
Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection
Hi Krishna, On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering) wrote: > I saw many similar issues posted earlier by others, but could not find a > thread > where this is resolved or fixed in a newer release. We are using Ubuntu > 16.04 > with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10 > TCP > connections, sometimes just 1 - a stale connection that does not seem to > belong > to any frontend session. Strace with -T shows the folllowing: In fact a few bugs have caused this situation and all known ones were fixed, which doesn't mean there is none left of course. However your version is totally outdated and contains tons of known bugs which were later fixed (196 total, 22 major, 78 medium, 96 minor) : http://www.haproxy.org/bugs/bugs-1.6.3.html > The single connection has this session information: > 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4 > source=a.a.a.a:35297 > flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil) > frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1) > addr=b.b.b.b:5667 > backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870 > server=d.d.d.d (id=4) addr=d.d.d.d:5667 > task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running > age=12d11h) > si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=, > et=0x000) > si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=, > et=0x000) > co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0 > flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0 > co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000 > flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0 > req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0) > an_exp= rex=? wex= > buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0 > res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0) > an_exp= rex= wex= > buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0 That's quite useful, thanks! - connection with client is closed - connection with server is still established and theorically stopped from polling - the request channel is closed in both directions - the response channel is closed in both directions - both buffers are empty It seems that something is preventing the connection close from being considered, while the task is woken up on a timeout and on I/O. This exactly reminds me of the client-fin/server-fin bug in fact. Do you have any of these timeouts in your config ? I'm also noticing that the session is aged 12.5 days. So either it has been looping for this long (after all the function has been called 1 billion times), or it was a long session which recently timed out. > We have 3 systems running the identical configuration and haproxy binary, So at least you have 3 times 196 bugs in production :-) > and > the 100% cpu is ongoing for the last 17 days on one system. The client > connection is no longer present. I am assuming that a haproxy reload would > solve this as the frontend connection is not present, but have not tested > it out yet. Since this box is in production, I am unable to do invasive > debugging > (e.g. gdb). For sure. At least an upgrade to 1.6.12 would get rid of most of these known bugs. You could perform a rolling upgrade, starting with the machine having been in that situation for the longest time. > Please let me know if this is fixed in a latter release, or any more > information that > can help find the root cause. For me everything here looks like the client-fin/server-fin bug that was fixed two months ago, so if you're using this it's very likely fixed. If not, there's still a small probability that the fixes made to better deal with wakeup events in the case of the server-fin bug could have addressed a wider class of bugs : often we find one way to enter a certain bogus condition and hardly imagine all other possibilities. Regards, Willy
Re: [Patches] TLS methods configuration reworked
Hi Cyril, On Thu, May 18, 2017 at 11:02:29PM +0200, Cyril Bonté wrote: > Hi all, > > Le 12/05/2017 à 15:13, Willy Tarreau a écrit : > > Hi guys, > > > > On Tue, May 09, 2017 at 11:21:36AM +0200, Emeric Brun wrote: > > > It seems to do what we want, so we can merge it. > > > > So the good news is that this patch set now got merged :-) > > Commit 5db33cbdc4 [1] seems to have broken the compilation when > OPENSSL_NO_SSL3 is defined : SSLv3_server_method() and SSLv3_client_method() > won't exist in this case. > Previously there was a condition to verify this, which has disappeared with > this patch set. Ah, thanks for the report. If you've diagnosed this and you know what is missing, do you think you could provide a patch ? Thanks, Willy
Re: haproxy consuming 100% cpu - epoll loop
Hi Patrick, On Thu, May 18, 2017 at 05:44:30PM -0400, Patrick Hemmer wrote: > > On 2017/1/17 17:02, Willy Tarreau wrote: > > Hi Patrick, > > > > On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote: > >> So on one of my local development machines haproxy started pegging the > >> CPU at 100% > >> `strace -T` on the process just shows: > >> > >> ... > >> epoll_wait(0, {}, 200, 0) = 0 <0.03> > >> epoll_wait(0, {}, 200, 0) = 0 <0.03> > >> epoll_wait(0, {}, 200, 0) = 0 <0.03> > >> epoll_wait(0, {}, 200, 0) = 0 <0.03> > >> epoll_wait(0, {}, 200, 0) = 0 <0.03> > >> epoll_wait(0, {}, 200, 0) = 0 <0.03> > >> ... > > Hmm not good. > > > >> Opening it up with gdb, the backtrace shows: > >> > >> (gdb) bt > >> #0 0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > >> #1 0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915) > >> at src/ev_epoll.c:125 > >> #2 0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737 > >> #3 0x7f4d1a4cf2c0 in main (argc=, argv= >> out>) at src/haproxy.c:2097 > > Ok so an event is not being processed correctly. > > > >> This is haproxy 1.7.0 on CentOS/7 > > Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0 > > and 1.7.1. One of them is responsible for the few outages on haproxy.org > > (last one happened today, I left it running to get the core to confirm). > > One of them is an issue with the condition to wake up an applet when it > > failed to get a buffer first and it could be what you're seeing. The > > other ones could possibly cause some memory corruption resulting in > > anything. > > > > Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going > > to do on haproxy.org now that I could get a core). Continue to monitor > > it but I'd feel much safer after this update. > > > > Thanks for your report! > > Willy > > > So I just had this issue recur, this time on version 1.7.2. OK. If it's still doing it, capturing the output of "show sess all" on the CLI could help a lot. I've looked at the changelog since 1.7.2 and other bugs have since been fixed possibly responsible for this : - c691781 ("BUG/MEDIUM: stream: fix client-fin/server-fin handling") => fixes a cause of 100% CPU when timeout client-fin/server-fin are used - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into buffe => fixes the computation of free buffer space, it's unknown if we've ever hit this bug. It could be possible that it causes some applets like stats to fail to write and to be called again immediately for example. - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into buffe => some filters might be woken up to do nothing. Compression might trigger this. - there are a bunch of polling-related fixes which might have got rid of such a bad situation on certain connect() cases, possibly over unix sockets. There are a few other fixes in the queue that I need to backport but none of them is related to this. Thus if you're in emergency, 1.7.5 could help by bringing the fixes above. If you can wait a few more days, I expect to issue 1.7.6 early next week. Cheers, Willy
HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection
Hi, First of all, thanks for a great product that is working extremely well for Flipkart! I saw many similar issues posted earlier by others, but could not find a thread where this is resolved or fixed in a newer release. We are using Ubuntu 16.04 with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10 TCP connections, sometimes just 1 - a stale connection that does not seem to belong to any frontend session. Strace with -T shows the folllowing: epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.20> epoll_wait(0, [], 200, 0) = 0 <0.09> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=2, u64=2}}], 200, 0) = 1 <0.06> epoll_wait(0, [{EPOLLIN, {u32=11, u64=11}}], 200, 0) = 1 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.29> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.21> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.11> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [{EPOLLIN, {u32=7, u64=7}}], 200, 0) = 1 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.07> epoll_wait(0, [{EPOLLOUT, {u32=2, u64=2}}], 200, 0) = 1 <0.15> epoll_wait(0, [], 200, 0) = 0 <0.07> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.16> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.08> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.17> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [{EPOLLIN, {u32=10, u64=10}}], 200, 0) = 1 <0.09> epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=10, u64=10}}], 200, 0) = 1 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.16> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.06> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.05> epoll_wait(0, [], 200, 0) = 0 <0.17> The single connection has this session information: 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4 source=a.a.a.a:35297 flags=0x1ce, conn_retries=0, srv_conn=0xca4000,
Re: 1.7.5 503 Timeouts with SNI backend
That’s incredibly insightful of you. I’ll set up a resolver for all of my CF uses and report back if I can repro this apart from that config fix. Thanks! On May 18, 2017 at 3:42:35 PM, Michael Ezzell (mich...@ezzell.net) wrote: On May 18, 2017 3:07 PM, "Ryan Schlesinger" wrote: We have the following backend configuration: backend clientsite_ember server cf foobar.cloudfront.net:443 ssl verify required sni str( foobar.cloudfront.net) ca-file /etc/ssl/certs/ca-certificates.crt This has been working great with 1.7.2 since February. I upgraded to 1.7.5 yesterday and today found that all requests through that backend were returning 503. Testing the cloudfront url manually loaded the site. Sample Logs: May 18 10:13:47 ip-10-4-13-35 haproxy: :46924 [18/May/2017:17:13:32.237] http-in~ clientsite_ember/cf 0/0/-1/-1/14969 503 212 - - CC-- That second C is significant: the proxy was waiting for the CONNECTION to establish on theserver. The server might at most have noticed a connection attempt. You don't have a healthcheck configured. You don't want option httpchk with CloudFront, but you do need at least a TCP check. The place where you were connecting to could have been unavailable. To understand how, take a look at the results of dig dzzzexample.cloudfront.net. There will be several responses. But, without a DNS resolver section configured on the proxy and attached to each backend server to continually re-resolve the addresses, the proxy will latch to just one, and stick to it until restarted. The DNS responses from CloudFront can vary from day to day or hour to hour, since the DNS is dynamically derived from their system's current notion of the "closest" (most optimal) location from where you query DNS from. From Cincinnati, Ohio, I see DNS responses indicating I'm connecting to South Bend, IN, one day, Chicago, IL, another, then Ashburn, VA. As I type this, I'm actually seeing New York, NY. (Do a reverse lookup on the IP addresses currently associated with the CloudFront hostname. An alphanumeric code in the hostname gives you the IATA code of the nearest airport to the CloudFront edge in question -- IADx is Ashburn, JFKx is NYC, etc.) If CloudFront lost an edge or took one out of DNS rotation and shut it down for maintenance, what you saw would potentially be one behavior HAProxy could be expected to exhibit, because it wouldn't know. Unless I missed a memo, HAProxy only resolves DNS at startup unless configured otherwise. The browser you tested with would have resolved a different address. I'm not saying there can't be an issue in 1.7.5 but your configuration seems vulnerable to service disruptions, since it can't take advantage of CloudFront's fault tolerance mechanisms.
Re: 1.7.5 503 Timeouts with SNI backend
On May 18, 2017 3:07 PM, "Ryan Schlesinger" wrote: We have the following backend configuration: backend clientsite_ember server cf foobar.cloudfront.net:443 ssl verify required sni str( foobar.cloudfront.net) ca-file /etc/ssl/certs/ca-certificates.crt This has been working great with 1.7.2 since February. I upgraded to 1.7.5 yesterday and today found that all requests through that backend were returning 503. Testing the cloudfront url manually loaded the site. Sample Logs: May 18 10:13:47 ip-10-4-13-35 haproxy: :46924 [18/May/2017:17:13:32.237] http-in~ clientsite_ember/cf 0/0/-1/-1/14969 503 212 - - CC-- That second C is significant: the proxy was waiting for the CONNECTION to establish on theserver. The server might at most have noticed a connection attempt. You don't have a healthcheck configured. You don't want option httpchk with CloudFront, but you do need at least a TCP check. The place where you were connecting to could have been unavailable. To understand how, take a look at the results of dig dzzzexample.cloudfront.net. There will be several responses. But, without a DNS resolver section configured on the proxy and attached to each backend server to continually re-resolve the addresses, the proxy will latch to just one, and stick to it until restarted. The DNS responses from CloudFront can vary from day to day or hour to hour, since the DNS is dynamically derived from their system's current notion of the "closest" (most optimal) location from where you query DNS from. From Cincinnati, Ohio, I see DNS responses indicating I'm connecting to South Bend, IN, one day, Chicago, IL, another, then Ashburn, VA. As I type this, I'm actually seeing New York, NY. (Do a reverse lookup on the IP addresses currently associated with the CloudFront hostname. An alphanumeric code in the hostname gives you the IATA code of the nearest airport to the CloudFront edge in question -- IADx is Ashburn, JFKx is NYC, etc.) If CloudFront lost an edge or took one out of DNS rotation and shut it down for maintenance, what you saw would potentially be one behavior HAProxy could be expected to exhibit, because it wouldn't know. Unless I missed a memo, HAProxy only resolves DNS at startup unless configured otherwise. The browser you tested with would have resolved a different address. I'm not saying there can't be an issue in 1.7.5 but your configuration seems vulnerable to service disruptions, since it can't take advantage of CloudFront's fault tolerance mechanisms.
Re: haproxy consuming 100% cpu - epoll loop
On 2017/1/17 17:02, Willy Tarreau wrote: > Hi Patrick, > > On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote: >> So on one of my local development machines haproxy started pegging the >> CPU at 100% >> `strace -T` on the process just shows: >> >> ... >> epoll_wait(0, {}, 200, 0) = 0 <0.03> >> epoll_wait(0, {}, 200, 0) = 0 <0.03> >> epoll_wait(0, {}, 200, 0) = 0 <0.03> >> epoll_wait(0, {}, 200, 0) = 0 <0.03> >> epoll_wait(0, {}, 200, 0) = 0 <0.03> >> epoll_wait(0, {}, 200, 0) = 0 <0.03> >> ... > Hmm not good. > >> Opening it up with gdb, the backtrace shows: >> >> (gdb) bt >> #0 0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6 >> #1 0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915) >> at src/ev_epoll.c:125 >> #2 0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737 >> #3 0x7f4d1a4cf2c0 in main (argc=, argv=> out>) at src/haproxy.c:2097 > Ok so an event is not being processed correctly. > >> This is haproxy 1.7.0 on CentOS/7 > Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0 > and 1.7.1. One of them is responsible for the few outages on haproxy.org > (last one happened today, I left it running to get the core to confirm). > One of them is an issue with the condition to wake up an applet when it > failed to get a buffer first and it could be what you're seeing. The > other ones could possibly cause some memory corruption resulting in > anything. > > Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going > to do on haproxy.org now that I could get a core). Continue to monitor > it but I'd feel much safer after this update. > > Thanks for your report! > Willy > So I just had this issue recur, this time on version 1.7.2. -Patrick
haproxy doesn't restart after segfault on systemd
So we had an incident today where haproxy segfaulted and our site went down. Unfortunately we did not capture a core, and the segfault message logged to dmesg just showed it inside libc. So there's likely not much we can do here. We'll be making changes to ensure we capture a core in the future. However the issue I am reporting that is reproducible (on version 1.7.5) is that haproxy did not auto restart, which would have minimized the downtime to the site. We use nbproc > 1, so we have multiple haproxy processes running, and when one of them dies, neither the "haproxy-master" process or the "haproxy-systemd-wrapper" process exits, which prevents systemd from starting the service back up. While I think this behavior would be fine, a possible alternative would be for the "haproxy-master" process to restart the dead worker without having to kill all the other processes. Another possible action would be to leave the workers running, but signal them to stop accepting new connections, and then let the "haproxy-master" exit so systemd will restart it. But in any case, I think we need some way of handling this so that site interruption is minimal. -Patrick
Re: [Patches] TLS methods configuration reworked
Hi all, Le 12/05/2017 à 15:13, Willy Tarreau a écrit : Hi guys, On Tue, May 09, 2017 at 11:21:36AM +0200, Emeric Brun wrote: It seems to do what we want, so we can merge it. So the good news is that this patch set now got merged :-) Commit 5db33cbdc4 [1] seems to have broken the compilation when OPENSSL_NO_SSL3 is defined : SSLv3_server_method() and SSLv3_client_method() won't exist in this case. Previously there was a condition to verify this, which has disappeared with this patch set. Thanks for your time and efforts back-and-forth on this one! Willy [1] http://www.haproxy.org/git?p=haproxy.git;a=commit;h=5db33cbdc4f2952cbd3c140edce0eda84e1447b4 -- Cyril Bonté
1.7.5 503 Timeouts with SNI backend
We have the following backend configuration: backend clientsite_ember server cf foobar.cloudfront.net:443 ssl verify required sni str( foobar.cloudfront.net) ca-file /etc/ssl/certs/ca-certificates.crt This has been working great with 1.7.2 since February. I upgraded to 1.7.5 yesterday and today found that all requests through that backend were returning 503. Testing the cloudfront url manually loaded the site. Sample Logs: May 18 10:13:47 ip-10-4-13-35 haproxy: :46924 [18/May/2017:17:13:32.237] http-in~ clientsite_ember/cf 0/0/-1/-1/14969 503 212 - - CC-- 10/10/1/1/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:13:54 ip-10-4-13-35 haproxy: :33235 [18/May/2017:17:13:22.354] http-in~ clientsite_ember/cf 0/30004/-1/-1/32296 503 212 - - CC-- 12/12/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.} "GET /path1/?slide=1 HTTP/1.1" May 18 10:14:45 ip-10-4-13-35 haproxy: :9313 [18/May/2017:17:14:07.198] http-in~ clientsite_ember/cf 0/30003/-1/-1/38336 503 212 - - CC-- 13/13/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:15:30 ip-10-4-135-120 haproxy: :37948 [18/May/2017:17:14:59.850] http-in~ clientsite_ember/cf 0/30004/-1/-1/30400 503 212 - - CC-- 9/9/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.} "GET /path1/?slide=1 HTTP/1.1" May 18 10:15:32 ip-10-4-69-34 haproxy: :38451 [18/May/2017:17:15:17.652] http-in~ clientsite_ember/cf 0/0/-1/-1/14714 503 212 - - CC-- 12/12/0/0/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:16:12 ip-10-4-135-120 haproxy: :52747 [18/May/2017:17:15:32.824] http-in~ clientsite_ember/cf 0/30004/-1/-1/40005 503 212 - - sC-- 12/12/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:17:45 ip-10-4-135-120 haproxy: :60096 [18/May/2017:17:17:05.314] http-in~ clientsite_ember/cf 0/30005/-1/-1/40007 503 212 - - sC-- 9/9/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1" May 18 10:18:25 ip-10-4-69-34 haproxy: :63513 [18/May/2017:17:17:45.827] http-in~ clientsite_ember/cf 0/30005/-1/-1/40006 503 212 - - sC-- 13/13/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1" May 18 10:18:27 ip-10-4-13-35 haproxy: :57858 [18/May/2017:17:18:15.384] http-in~ clientsite_ember/cf 0/0/-1/-1/11631 503 212 - - CC-- 15/15/1/1/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:18:34 ip-10-4-135-120 haproxy: :55173 [18/May/2017:17:18:14.921] http-in~ clientsite_ember/cf 0/0/-1/-1/19973 503 212 - - CC-- 11/11/0/0/1 0/0 {clientsite.com||Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company} "GET /path5 HTTP/1.1" May 18 10:18:49 ip-10-4-69-34 haproxy: :49219 [18/May/2017:17:18:34.138] http-in~ clientsite_ember/cf 0/0/-1/-1/15309 503 212 - - CC-- 16/16/0/0/1 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:18:55 ip-10-4-135-120 haproxy: :58221 [18/May/2017:17:18:35.904] http-in~ clientsite_ember/cf 0/0/-1/-1/19988 503 212 - - CC-- 14/14/1/1/1 0/0 {clientsite.com||Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company} "GET /path5 HTTP/1.1" May 18 10:19:06 ip-10-4-13-35 haproxy: :36125 [18/May/2017:17:18:26.333] http-in~ clientsite_ember/cf 0/30005/-1/-1/40007 503 212 - - sC-- 19/19/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1" May 18 10:19:26 ip-10-4-135-120 haproxy: :23388 [18/May/2017:17:18:47.167] http-in~ clientsite_ember/cf 0/30005/-1/-1/39090 503 212 - - CC-- 15/15/1/1/3 0/0 {clientsite.com||Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHT} "GET /path3 HTTP/1.1" May 18 10:19:46 ip-10-4-135-120 haproxy: :39212 [18/May/2017:17:19:06.835] http-in~ clientsite_ember/cf 0/30005/-1/-1/40006 503 212 - - sC-- 13/13/0/0/3 0/0 {clientsite.com||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)} "GET /path2/ HTTP/1.1" May 18 10:19:47 ip-10-4-69-34 haproxy: :43670 [18/May/2017:17:19:38.573] http-in~ clientsite_ember/cf 0/0/-1/-1/9047 503 212 - - CC-- 18/18/0/0/0 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWe} "GET /path1/?slide=1 HTTP/1.1" May 18 10:19:55 ip-10-4-13-35 haproxy: :20040 [18/May/2017:17:19:15.429] http-in~ clientsite_ember/cf 0/30004/-1/-1/40006 503 212 - - sC-- 18/18/1/1/3 0/0 {clientsite.com||Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X) AppleWe} "GET /path4/?slide=1 HTTP/1.1" May 18 10:20:06 ip-10-4-13-35 haproxy: :48559 [18/May
Re: Bug: DNS changes in 1.7.3+ break UNIX socket stats in daemon mode with resolvers on FreeBSD
On 05/12/2017 09:50 AM, Willy Tarreau wrote: > On Fri, May 12, 2017 at 10:20:56AM +0200, Frederic Lecaille wrote: >> Here is a more well-formed patch. >> Feel free to amend the commit message if not enough clear ;) > > It was clear enough, thanks. I added the mention of the faulty commit, > that helps tracking backports and credited Jim and Lukas for the > investigations. Thanks for getting this in! Everything still appears to be good here running with that patch applied. I don't see it in the 1.7 tree yet, will it be backported there? Is there an ETA on 1.7.6? Jim P.
Re: [PATCH] MINOR: ssl: support ssl-min-ver and ssl-max-ver with crt-list
Hi, Same patch, split in 3 parts for better understanding. > Le 12 mai 2017 à 15:05, Emmanuel Hocdet a écrit : > > Hi, > > This patch depend of " [Patches] TLS methods configuration reworked ». > > Actually it will only work with BoringSSL because haproxy use a special > ssl_sock_switchctx_cbk > with a BoringSSL callback to select certificat before any handshake > negotiation. > This feature (and others depend of this ssl_sock_switchctx_cbk) could work > with openssl 1.1.1 and > the new callback > https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_early_cb.html. > > ++ > Manu > 0001-REORG-ssl-move-defines-and-methodVersions-table-uppe.patch Description: Binary data 0002-MEDIUM-ssl-ctx_set_version-ssl_set_version-func-for-.patch Description: Binary data 0003-MINOR-ssl-support-ssl-min-ver-and-ssl-max-ver-with-c.patch Description: Binary data
Re: haproxy "inter" and "timeout check", retries and "fall"
Hi Bryan, For reference: defaults modehttp log global option httplog option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout http-keep-alive 10s *timeout check 10s* maxconn 3000 But in the backend setting, I have NOT defined the "inter", like below: backend apache_http balance roundrobin cookie iPlanetDirectoryPro prefix nocache *server httpdserver_80_1 httpd-1-internal:80 cookie S1 check** **server httpdserver_80_2 httpd-2-internal:80 cookie S2 check* log global Thank you for your comment, really appreciate. - In relation to the version of haproxy, this is installed as the supported package on rhel6.6 from where we get official support. You are right, it is too old, will seek upgrade from RedHat. - For "timeout check" and "inter", it was for some troubleshooting and would like to understand the behaviour a bit more. By reading haproxy official document, it is not clear to me. I think in my case, it uses the "timeout check" as 10 seconds. There is no "inter" parameter in the configuration. But here I try to understand which value will use if "timeout check" is present, but "inter" is not. I already set the timeout check". - Great for clarifying the "retries" parameter. - Finally, I think I am still right about the "fall" (default to 3) and "rise" (default to 2). It takes up to 50 seconds to converge the server, as far as the haproxy is concerned. Is that correct to say? Kindly inform of anything wrong or incorrect here. Regards, Jiafan On 05/15/2017 09:10 PM, Bryan Talbot wrote: On May 13, 2017, at May 13, 10:59 PM, Jiafan Zhou mailto:jiafan.z...@ericsson.com>> wrote: Hi all, The version of haproxy I use is: # haproxy -version HA-Proxy version 1.5.2 2014/07/12 Copyright 2000-2014 Willy Tarreau This version is so old. I’m sure there must be hundreds of bugs fixed over the last 3 years. Why not use a properly current version? I have a question regarding the Health Check. In the documentation of haproxy, it mentions the below for the "timeout check" and "inter": Now I am wondering here which one and what value will be used for healthcheck interval. Is it "timeout check" as 10 seconds, or the "inter" as the default 2 seconds? Why not just set the health check values that you care about and not worry about guessing what they’ll end up being when only some are set and some are using defaults? If you need / expect them to be a particular value for proper system operation, I’d set them no matter what the defaults may be declared to be. Another question, since I defined the "retries" to be 3, in the case of server connection failure, will it reconnect 3 times? Or does it use the "fall" parameter (which defaults to 3 here as well) instead for healthcheck retry? “retries” is for dispatching requests and is not used for health checks. So in this configuration, in the case of server failure, does it wait for up to 30 seconds (3 fall or retries), then 20 seconds (2 rise), before the server is considered operational? (in total 50 seconds) retries are not considered, only health check specific settings like “fail”, “inter" Thanks, Jiafan
Re: [PATCH] MINOR: boringssl: basic support for OCSP Stapling
Hi Willy, This patch only applies to boringssl. Could you merge them? ++ Emmanuel > Le 29 mars 2017 à 16:46, Emmanuel Hocdet a écrit : > > > Use boringssl SSL_CTX_set_ocsp_response to set OCSP response from file with > '.ocsp' extension. CLI update is not supported. > > <0001-MINOR-boringssl-basic-support-for-OCSP-Stapling.patch> >
Re: truncated request in log lines
On Thu, May 18, 2017 at 08:58:41AM +0200, Stéphane Cottin wrote: > > Nice, that was fast :-) > > Nobody have time, I just take care of things as they flow :) you're right! > Sorry, I didn't read the CONTRIBUTING, RTFM me. no pb. > Hope this one is better. Definitely. The most suitable form is the git-format-patch, which we can easily apply using git-am. But this one has everything I need so I'll apply it. Thank you! Willy