Re: Status Codes in H2 Mode
Hi Luke, On Mon, Mar 18, 2019 at 11:14:12AM -0400, Luke Seelenbinder wrote: (...) > If I disable HTX, everything flows per normal and the status codes are even > correctly -1. > > I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The > global "this will work" and "this will not work" switch is HTX mode. Anytime > it's enabled, I see bad behavior. Anytime it's disabled, I see flawless > behavior. > > Any thoughts? I've tried this with and without http-reuse, abortonclose, > various settings for pool-purge-delay. That's useful information. Christopher has been working on fixing some issues related to abortonclose and ended up having to touch a large number of places. We figured that we need to make deeper changes to make this thing more reliable. I still need to check with him which of his patches could be merged now (some are not suitable unfortunately). I'm assuming that this is always reproducible with H2 on the front and H1 on the back. I'll see if we can find a reliable reproducer for such situations, that will help us nail down this issues. Thanks, Willy
DNS Resolver Issues
Hi everyone! I assume I am misunderstanding something, but I cannot figure out what it is. We are using haproxy in AWS, in this case as sidecars to applications so they need not know about changing backend addresses at all, but can always talk to localhost. Haproxy listens on localhost and then forwards traffic to an ELB instance. This works great, but there have been two occasions now, where due to a change in the ELB's IP addresses, our services went down, because the backends could not be reached anymore. I don't understand why haproxy sticks to the old IP address instead of going to one of the updated ones. There is a resolvers section which points to the local dnsmasq instance (there to send some requests to consul, but that's not used here). All other traffic is forwarded on to the AWS DNS server set via DHCP. I managed to get timely updates and updated backend servers when using server-template, but form what I understand this should not really be necessary for this. This is the trimmed down sidecar config. I have not made any changes to dns timeouts etc. resolvers default # dnsmasq nameserver local 127.0.0.1:53 listen regular bind 127.0.0.1:9300 option dontlog-normal server lb-internal loadbalancer-internal.xxx.yyy:9300 resolvers default check addr loadbalancer-internal.xxx.yyy port 9300 listen templated bind 127.0.0.1:9200 option dontlog-normal option httpchk /haproxy-simple-healthcheck server-template lb-internal 2 loadbalancer-internal.xxx.yyy:9200 resolvers default check port 9299 To simulate changing ELB adresses, I added entries for loadbalancer-internal.xxx.yyy in /etc/hosts and to be able to control them via dnsmasq. I tried different scenarios, but could not reliably predict what would happen in all cases. The address ending in 52 (marked as "valid" below) is a currently (as of the time of testing) valid IP for the ELB. The one ending in 199 (marked "invalid") is an unused private IP address in my VPC. Starting with /etc/hosts: 10.205.100.52 loadbalancer-internal.xxx.yyy# valid 10.205.100.199 loadbalancer-internal.xxx.yyy# invalid haproxy starts and reports: regular: lb-internal UP/L7OK templated: lb-internal1 DOWN/L4TOUT lb-internal2UP/L7OK That's expected. Now when I edit /etc/hosts to _only_ contain the _invalid_ address and restart dnsmasq, I would expect both proxies to go fully down. But only the templated proxy behaves like that: regular: lb-internal UP/L7OK templated: lb-internal1 DOWN/L4TOUT lb-internal2 MAINT (resolution) Reloading haproxy in this state leads to: regular: lb-internal DOWN/L4TOUT templated: lb-internal1 MAINT (resolution) lb-internal2 DOWN/L4TOUT After fixing /etc/hosts to include the valid server again and restarting dnsmasq: regular: lb-internal DOWN/L4TOUT templated: lb-internal1UP/L7OK lb-internal2 DOWN/L4TOUT Shouldn't the regular proxy also recognize the change and bring the backend up or down depending on the DNS change? I have waited for several health check rounds (seeing "* L4TOUT" and "L4TOUT") toggle, but it still never updates. I also tried to have _only_ the invalid address in /etc/hosts, then restarting haproxy. The regular backends will never recognize it when I add the valid one back in. The templated one does, _unless_ I set it up to have only 1 instead of 2 server slots. In that case it behaves will also only pick up the valid server when reloaded. On the other hand, it _will_ recognize when I remove the valid server without a reload on the next health check, but _not_ bring them back in and make the proxy UP when it comes back. I assume my understanding of something here is broken, and I would gladly be told about it :) Thanks a lot! Daniel Version Info: -- $ haproxy -vv HA-Proxy version 1.8.19-1ppa1~trusty 2019/02/12 Copyright 2000-2019 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -O2 -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014 Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2 Built with Lua version : Lua 5.3.1 Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Encrypted password support via crypt(3): yes Built with multi-threading support. Built with PCRE version : 8.31 2012-07-06 Running on PCRE
Re: High p99 latency with HAProxy 1.9 in http mode compared to 1.8
Hi Ashwin, On Mon, Mar 18, 2019 at 10:57:45AM -0700, Ashwin Neerabail wrote: > Hi Willy, > > Thanks for the reply. > > My Test setup: > Client Server1 using local HAProxy 1.9 > 2 Backend servers and > Client Server2 using local HAProxy 1.8 > same 2 backend servers. > > I am measuring latency from the client server. > So when I run 1000rps test , 50% of them end up on 1.9 and 50% on 1.8. So > if the backend servers have a problem , 1.8 should show similar high > latency too. Indeed. > However consistently only 1.9 client shows latency. > > I even tested this against real traffic in production against various > backend ( Java Netty , Java Tomcat , Nginx) . Across the board we saw > similar latency spiked when we tested 1.9. This is quite useful, especially with nginx which is known for not being too much bothered by idle connections and that we've extensively tested during the server pools design as well. Now I'll have some questions to dig this issue further : - did you enable threads on 1.9 ? - do you have a "maxconn" setting on your server lines ? - if so do you know if you've ever had some queue on the backend caused by this maxconn setting ? This can be seen in the stats page under the "Queue/Max" column. - do you observe connection retries in your stats page ? This could explain the higher latency. Maybe connections time out quickly and can't be reused, or maybe we fail to allocate some from time to time due to a low file descriptor limit which is hit earlier when server-side pools are enabled. - do you observe the problem if you put "http-reuse always" on your 1.8 setup as well (I guess not since you said it doesn't fail on 1.9 as soon as you remove server pools)? Thanks, Willy
Re: High p99 latency with HAProxy 1.9 in http mode compared to 1.8
Hi Willy, Thanks for the reply. My Test setup: Client Server1 using local HAProxy 1.9 > 2 Backend servers and Client Server2 using local HAProxy 1.8 > same 2 backend servers. I am measuring latency from the client server. So when I run 1000rps test , 50% of them end up on 1.9 and 50% on 1.8. So if the backend servers have a problem , 1.8 should show similar high latency too. However consistently only 1.9 client shows latency. I even tested this against real traffic in production against various backend ( Java Netty , Java Tomcat , Nginx) . Across the board we saw similar latency spiked when we tested 1.9. Thanks, Ashwin On Thu, Feb 28, 2019 at 8:17 PM Willy Tarreau wrote: > Ashwin, > > I've taken some time to read your tests completely now, and something > bothers me : > > On Mon, Feb 25, 2019 at 11:11:08AM -0800, Ashwin Neerabail wrote: > > > - by disabling server-side idle connections (using "pool-max-conn 0" on > > > the server) though "http-reuse never" should be equivalent > > > > > > This seems to have done the trick. Adding `pool-max-conn 0` or > `http-reuse > > > never` fixes the problem. > > > 1.8 and 1.9 perform similarly (client app that calls haproxy is using > > > connection pooling). *Unfortunately , we have legacy clients that close > > > connections to front end for every request.* > > Well, the thing is that haproxy 1.8 doesn't have connection pooling and > 1.9 does. So this means that there is no regression between 1.8 and 1.9 > when using the same features. However connection pooling exhibits extra > latency. Are you really sure that your server remains performant when > dealing with idle connections ? Maybe it has an accept dispatcher with > a small queue and has trouble dealing with too many idle connections ? > > > > CPU Usage for 1.8 and 1.9 was same ~22%. > > > > > >- by placing an inconditional redirect rule in your backend so that > we > > > check how it performs when the connection doesn't leave : > > > http-request redirect location / > > > > > > Tried adding monitor-uri and returning from remote haproxy rather than > > > hitting backend server. > > > Strangely , in this case I see nearly identical performance /CPU usage > > > with 1.8 and 1.9 even with http reuse set to aggressive. > > > CPU Usage for 1.8 and 1.9 was same ~35%. > > > *Set up is Client > HAProxy > HAProxy (with monitor-uri) > Server.* > > Ah this test is extremely interesting! It indeed shows that the only > difference appears when reaching the server. But if the server has > trouble with idle connections, why don't you disable them on haproxy ? > As you've seen you can simply do that with "pool-max-conn 0" on the > server lines. You could even try with different values. It might be > possible that past a certain point the server's accept queue explodes > and that's when it starts to have problems. You could try with a limited > value, e.g. "pool-max-conn 10" then "pool-max-conn 100" etc and see > where it starts to break. > > Regards, > Willy >
Re: Status Codes in H2 Mode
Hi Willy, Unfortunately, I spoke too soon in my last email. After hitting send, I went down the rabbit hole again and uncovered some behaviors I thought we'd rooted out. Namely, any time I use HTX mode with an H2 fe -> H1 or H2 backend and have frequent request cancellation as discussed previously, I'm seeing hung requests. It's not every request nor is it every cycle of requests, but I'd say at least 10% of requests end up hanging indefinitely until they eventually timeout according to HAProxy. (So perhaps this is an indicator itself of what might be wrong?) HAProxy reports retries / redispatches and maxes out the timeouts then the request dies. Here's two example log lines, the second one I killed the request myself: [18/Mar/2019:15:02:49.723] stadiamaps~ tile/tile1 0/37204/-1/-1/49606 503 0 - - sC-- 2/1/2/2/3 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0" [18/Mar/2019:15:03:39.507] stadiamaps~ tile/tile1 0/24804/-1/-1/29123 503 0 - - CC-- 2/1/0/0/2 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0" If I disable HTX, everything flows per normal and the status codes are even correctly -1. I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The global "this will work" and "this will not work" switch is HTX mode. Anytime it's enabled, I see bad behavior. Anytime it's disabled, I see flawless behavior. Any thoughts? I've tried this with and without http-reuse, abortonclose, various settings for pool-purge-delay. Best, Luke — Luke Seelenbinder Stadia Maps | Founder stadiamaps.com On Mon, Mar 18, 2019, at 13:46, Luke Seelenbinder wrote: > Hi Willy, > > I finally had the opportunity to try out `option abortonclose`. > > Initially, it made the problem much worse. Instead of occasionally > incorrect status codes in the logs, I saw requests fail in the > following manner: > > [18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804 > 503 0 - - sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png > HTTP/2.0" > [18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803 > 503 0 - - sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png > HTTP/2.0" > > What's further interesting, it is was consistently 2 out of 18 > requests. That led me down the road of checking queue timeouts > (noticing the timing correlation in the logs). I adjusted `timeout > connect` up from 6200ms to 12400ms and added pool-purge-delay to 60s. > > After adjusting those timeouts and pool purges and re-enabling > `abortonclose`, the request errors I was seeing magically went away. > I'll push this config to production and see if we see a reduction in > 503s. I also suspect we'll see a marginal improvement in throughput and > response time due to keeping backend connections open longer. > > I'll also keep an eye our for inconsistencies between our backend > accept capability and timeouts and see if perhaps we're overrunning > some buffer somewhere in HAProxy, NGINX, or somewhere else. > > Thanks for your help so far! > > Best, > Luke > > — > Luke Seelenbinder > Stadia Maps | Founder > stadiamaps.com > > On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote: > > On Mon, Mar 04, 2019 at 11:45:53AM +, Luke Seelenbinder wrote: > > > Hi Willy, > > > > > > > Do you have "option abortonclose" in your config ? > > > > > > We do not have abortonclose. Do you recommend this if we have a lot of > > > client-side request aborts (but not connection level closes)? From reading > > > the docs, I came away conflicted as to the implications. :-) > > > > It will help, especially if you have maxconn configured on your server > > lines, as it will allow the requests to be aborted while still in queue. > > > > That said, we still don't know exactly what causes your logs. > > > > Willy > > > >
Re: Status Codes in H2 Mode
Hi Willy, I finally had the opportunity to try out `option abortonclose`. Initially, it made the problem much worse. Instead of occasionally incorrect status codes in the logs, I saw requests fail in the following manner: [18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804 503 0 - - sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png HTTP/2.0" [18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803 503 0 - - sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png HTTP/2.0" What's further interesting, it is was consistently 2 out of 18 requests. That led me down the road of checking queue timeouts (noticing the timing correlation in the logs). I adjusted `timeout connect` up from 6200ms to 12400ms and added pool-purge-delay to 60s. After adjusting those timeouts and pool purges and re-enabling `abortonclose`, the request errors I was seeing magically went away. I'll push this config to production and see if we see a reduction in 503s. I also suspect we'll see a marginal improvement in throughput and response time due to keeping backend connections open longer. I'll also keep an eye our for inconsistencies between our backend accept capability and timeouts and see if perhaps we're overrunning some buffer somewhere in HAProxy, NGINX, or somewhere else. Thanks for your help so far! Best, Luke — Luke Seelenbinder Stadia Maps | Founder stadiamaps.com On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote: > On Mon, Mar 04, 2019 at 11:45:53AM +, Luke Seelenbinder wrote: > > Hi Willy, > > > > > Do you have "option abortonclose" in your config ? > > > > We do not have abortonclose. Do you recommend this if we have a lot of > > client-side request aborts (but not connection level closes)? From reading > > the docs, I came away conflicted as to the implications. :-) > > It will help, especially if you have maxconn configured on your server > lines, as it will allow the requests to be aborted while still in queue. > > That said, we still don't know exactly what causes your logs. > > Willy >
How to configure Email server in haproxy version 1.8
Hi, I want to know to configure Email server in haproxy 1.8 version when any server is down. Regards Shweta ::DISCLAIMER:: -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. --