Re: Status Codes in H2 Mode

2019-03-18 Thread Willy Tarreau
Hi Luke,

On Mon, Mar 18, 2019 at 11:14:12AM -0400, Luke Seelenbinder wrote:
(...)
> If I disable HTX, everything flows per normal and the status codes are even
> correctly -1.
> 
> I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The
> global "this will work" and "this will not work" switch is HTX mode. Anytime
> it's enabled, I see bad behavior. Anytime it's disabled, I see flawless
> behavior.
> 
> Any thoughts? I've tried this with and without http-reuse, abortonclose,
> various settings for pool-purge-delay.

That's useful information. Christopher has been working on fixing some
issues related to abortonclose and ended up having to touch a large
number of places. We figured that we need to make deeper changes to
make this thing more reliable. I still need to check with him which of
his patches could be merged now (some are not suitable unfortunately).

I'm assuming that this is always reproducible with H2 on the front and
H1 on the back. I'll see if we can find a reliable reproducer for such
situations, that will help us nail down this issues.

Thanks,
Willy



DNS Resolver Issues

2019-03-18 Thread Daniel Schneller
Hi everyone!

I assume I am misunderstanding something, but I cannot figure out what it is.
We are using haproxy in AWS, in this case as sidecars to applications so they 
need not
know about changing backend addresses at all, but can always talk to localhost.

Haproxy listens on localhost and then forwards traffic to an ELB instance. 
This works great, but there have been two occasions now, where due to a change 
in the
ELB's IP addresses, our services went down, because the backends could not be 
reached
anymore. I don't understand why haproxy sticks to the old IP address instead of 
going
to one of the updated ones.

There is a resolvers section which points to the local dnsmasq instance (there 
to send
some requests to consul, but that's not used here). All other traffic is 
forwarded on
to the AWS DNS server set via DHCP.

I managed to get timely updates and updated backend servers when using 
server-template,
but form what I understand this should not really be necessary for this. 

This is the trimmed down sidecar config. I have not made any changes to dns 
timeouts etc.

resolvers default
  # dnsmasq
  nameserver local 127.0.0.1:53
  
listen regular
  bind 127.0.0.1:9300
  option dontlog-normal
  server lb-internal loadbalancer-internal.xxx.yyy:9300 resolvers default check 
addr loadbalancer-internal.xxx.yyy port 9300

listen templated
  bind 127.0.0.1:9200
  option dontlog-normal
  option httpchk /haproxy-simple-healthcheck
  server-template lb-internal 2 loadbalancer-internal.xxx.yyy:9200 resolvers 
default check  port 9299


To simulate changing ELB adresses, I added entries for 
loadbalancer-internal.xxx.yyy in /etc/hosts
and to be able to control them via dnsmasq.

I tried different scenarios, but could not reliably predict what would happen 
in all cases.

The address ending in 52 (marked as "valid" below) is a currently (as of the 
time of testing) 
valid IP for the ELB. The one ending in 199 (marked "invalid") is an unused 
private IP address
in my VPC.


Starting with /etc/hosts:

10.205.100.52  loadbalancer-internal.xxx.yyy# valid
10.205.100.199 loadbalancer-internal.xxx.yyy# invalid

haproxy starts and reports:

regular:   lb-internal UP/L7OK
templated: lb-internal1  DOWN/L4TOUT
   lb-internal2UP/L7OK

That's expected. Now when I edit /etc/hosts to _only_ contain the _invalid_ 
address
and restart dnsmasq, I would expect both proxies to go fully down. But only the 
templated
proxy behaves like that:

regular:   lb-internal UP/L7OK
templated: lb-internal1  DOWN/L4TOUT
   lb-internal2  MAINT (resolution)
   
Reloading haproxy in this state leads to:

regular:   lb-internal   DOWN/L4TOUT
templated: lb-internal1  MAINT (resolution)
   lb-internal2  DOWN/L4TOUT
   
After fixing /etc/hosts to include the valid server again and restarting 
dnsmasq:

regular:   lb-internal   DOWN/L4TOUT
templated: lb-internal1UP/L7OK
   lb-internal2  DOWN/L4TOUT


Shouldn't the regular proxy also recognize the change and bring the backend up 
or down
depending on the DNS change? I have waited for several health check rounds 
(seeing 
"* L4TOUT" and "L4TOUT") toggle, but it still never updates.

I also tried to have _only_ the invalid address in /etc/hosts, then restarting 
haproxy.
The regular backends will never recognize it when I add the valid one back in.

The templated one does, _unless_ I set it up to have only 1 instead of 2 server 
slots.
In that case it behaves will also only pick up the valid server when reloaded.

On the other hand, it _will_ recognize when I remove the valid server without a 
reload
on the next health check, but _not_ bring them back in and make the proxy UP 
when it 
comes back.


I assume my understanding of something here is broken, and I would gladly be 
told
about it :)


Thanks a lot!
Daniel


Version Info:
--
$ haproxy -vv
HA-Proxy version 1.8.19-1ppa1~trusty 2019/02/12
Copyright 2000-2019 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fno-strict-aliasing 
-Wdeclaration-after-statement -fwrapv -Wno-unused-label
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 
USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT 
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE 

Re: High p99 latency with HAProxy 1.9 in http mode compared to 1.8

2019-03-18 Thread Willy Tarreau
Hi Ashwin,

On Mon, Mar 18, 2019 at 10:57:45AM -0700, Ashwin Neerabail wrote:
> Hi Willy,
> 
> Thanks for the reply.
> 
> My Test setup:
> Client Server1 using local HAProxy 1.9 > 2 Backend servers  and
> Client Server2 using local HAProxy 1.8 > same 2 backend servers.
> 
> I am measuring latency from the client server.
> So when I run 1000rps test , 50% of them end up on 1.9 and 50% on 1.8. So
> if the backend servers have a problem , 1.8 should show similar high
> latency too.

Indeed.

> However consistently only 1.9 client shows latency.
>
> I even tested this against real traffic in production against various
> backend ( Java Netty , Java Tomcat , Nginx) . Across the board we saw
> similar latency spiked when we tested 1.9.

This is quite useful, especially with nginx which is known for not being
too much bothered by idle connections and that we've extensively tested
during the server pools design as well.

Now I'll have some questions to dig this issue further :
  - did you enable threads on 1.9 ?
  - do you have a "maxconn" setting on your server lines ?
  - if so do you know if you've ever had some queue on the
backend caused by this maxconn setting ? This can be seen
in the stats page under the "Queue/Max" column.
  - do you observe connection retries in your stats page ? This
could explain the higher latency. Maybe connections time
out quickly and can't be reused, or maybe we fail to allocate
some from time to time due to a low file descriptor limit which
is hit earlier when server-side pools are enabled.
  - do you observe the problem if you put "http-reuse always" on
your 1.8 setup as well (I guess not since you said it doesn't
fail on 1.9 as soon as you remove server pools)?

Thanks,
Willy



Re: High p99 latency with HAProxy 1.9 in http mode compared to 1.8

2019-03-18 Thread Ashwin Neerabail
Hi Willy,

Thanks for the reply.

My Test setup:
Client Server1 using local HAProxy 1.9 > 2 Backend servers  and
Client Server2 using local HAProxy 1.8 > same 2 backend servers.

I am measuring latency from the client server.
So when I run 1000rps test , 50% of them end up on 1.9 and 50% on 1.8. So
if the backend servers have a problem , 1.8 should show similar high
latency too.
However consistently only 1.9 client shows latency.

I even tested this against real traffic in production against various
backend ( Java Netty , Java Tomcat , Nginx) . Across the board we saw
similar latency spiked when we tested 1.9.

Thanks,
Ashwin



On Thu, Feb 28, 2019 at 8:17 PM Willy Tarreau  wrote:

> Ashwin,
>
> I've taken some time to read your tests completely now, and something
> bothers me :
>
> On Mon, Feb 25, 2019 at 11:11:08AM -0800, Ashwin Neerabail wrote:
> > > - by disabling server-side idle connections (using "pool-max-conn 0" on
> > >  the server) though "http-reuse never" should be equivalent
> > >
> > > This seems to have done the trick. Adding `pool-max-conn 0` or
> `http-reuse
> > > never` fixes the problem.
> > > 1.8 and 1.9 perform similarly (client app that calls haproxy is using
> > > connection pooling). *Unfortunately , we have legacy clients that close
> > > connections to front end for every request.*
>
> Well, the thing is that haproxy 1.8 doesn't have connection pooling and
> 1.9 does. So this means that there is no regression between 1.8 and 1.9
> when using the same features. However connection pooling exhibits extra
> latency. Are you really sure that your server remains performant when
> dealing with idle connections ? Maybe it has an accept dispatcher with
> a small queue and has trouble dealing with too many idle connections ?
>
> > > CPU Usage for 1.8 and 1.9 was same ~22%.
> > >
> > >- by placing an inconditional redirect rule in your backend so that
> we
> > >  check how it performs when the connection doesn't leave :
> > >  http-request redirect location /
> > >
> > > Tried adding monitor-uri and returning from remote haproxy rather than
> > > hitting backend server.
> > > Strangely , in this case I see nearly identical performance /CPU usage
> > > with 1.8 and 1.9 even with http reuse set to aggressive.
> > > CPU Usage for 1.8 and 1.9 was same ~35%.
> > > *Set up is Client > HAProxy > HAProxy (with monitor-uri) > Server.*
>
> Ah this test is extremely interesting! It indeed shows that the only
> difference appears when reaching the server. But if the server has
> trouble with idle connections, why don't you disable them on haproxy ?
> As you've seen you can simply do that with "pool-max-conn 0" on the
> server lines. You could even try with different values. It might be
> possible that past a certain point the server's accept queue explodes
> and that's when it starts to have problems. You could try with a limited
> value, e.g. "pool-max-conn 10" then "pool-max-conn 100" etc and see
> where it starts to break.
>
> Regards,
> Willy
>


Re: Status Codes in H2 Mode

2019-03-18 Thread Luke Seelenbinder
Hi Willy,

Unfortunately, I spoke too soon in my last email. After hitting send, I went 
down the rabbit hole again and uncovered some behaviors I thought we'd rooted 
out. Namely, any time I use HTX mode with an H2 fe -> H1 or H2 backend and have 
frequent request cancellation as discussed previously, I'm seeing hung 
requests. It's not every request nor is it every cycle of requests, but I'd say 
at least 10% of requests end up hanging indefinitely until they eventually 
timeout according to HAProxy. (So perhaps this is an indicator itself of what 
might be wrong?) HAProxy reports retries / redispatches and maxes out the 
timeouts then the request dies. Here's two example log lines, the second one I 
killed the request myself:

[18/Mar/2019:15:02:49.723] stadiamaps~ tile/tile1 0/37204/-1/-1/49606 503 0 - - 
sC-- 2/1/2/2/3 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0"
[18/Mar/2019:15:03:39.507] stadiamaps~ tile/tile1 0/24804/-1/-1/29123 503 0 - - 
CC-- 2/1/0/0/2 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0"

If I disable HTX, everything flows per normal and the status codes are even 
correctly -1.

I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The 
global "this will work" and "this will not work" switch is HTX mode. Anytime 
it's enabled, I see bad behavior. Anytime it's disabled, I see flawless 
behavior.

Any thoughts? I've tried this with and without http-reuse, abortonclose, 
various settings for pool-purge-delay.

Best,
Luke

—
Luke Seelenbinder
Stadia Maps | Founder
stadiamaps.com

On Mon, Mar 18, 2019, at 13:46, Luke Seelenbinder wrote:
> Hi Willy,
> 
> I finally had the opportunity to try out `option abortonclose`.
> 
> Initially, it made the problem much worse. Instead of occasionally 
> incorrect status codes in the logs, I saw requests fail in the 
> following manner:
> 
> [18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804 
> 503 0 - - sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png 
> HTTP/2.0"
> [18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803 
> 503 0 - - sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png 
> HTTP/2.0"
> 
> What's further interesting, it is was consistently 2 out of 18 
> requests. That led me down the road of checking queue timeouts 
> (noticing the timing correlation in the logs). I adjusted `timeout 
> connect` up from 6200ms to 12400ms and added pool-purge-delay to 60s.
> 
> After adjusting those timeouts and pool purges and re-enabling 
> `abortonclose`, the request errors I was seeing magically went away. 
> I'll push this config to production and see if we see a reduction in 
> 503s. I also suspect we'll see a marginal improvement in throughput and 
> response time due to keeping backend connections open longer.
> 
> I'll also keep an eye our for inconsistencies between our backend 
> accept capability and timeouts and see if perhaps we're overrunning 
> some buffer somewhere in HAProxy, NGINX, or somewhere else.
> 
> Thanks for your help so far!
> 
> Best,
> Luke
> 
> —
> Luke Seelenbinder
> Stadia Maps | Founder
> stadiamaps.com
> 
> On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote:
> > On Mon, Mar 04, 2019 at 11:45:53AM +, Luke Seelenbinder wrote:
> > > Hi Willy,
> > >
> > > > Do you have "option abortonclose" in your config ?
> > >
> > > We do not have abortonclose. Do you recommend this if we have a lot of
> > > client-side request aborts (but not connection level closes)? From reading
> > > the docs, I came away conflicted as to the implications. :-)
> > 
> > It will help, especially if you have maxconn configured on your server
> > lines, as it will allow the requests to be aborted while still in queue.
> > 
> > That said, we still don't know exactly what causes your logs.
> > 
> > Willy
> >
> 
>



Re: Status Codes in H2 Mode

2019-03-18 Thread Luke Seelenbinder
Hi Willy,

I finally had the opportunity to try out `option abortonclose`.

Initially, it made the problem much worse. Instead of occasionally incorrect 
status codes in the logs, I saw requests fail in the following manner:

[18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804 503 0 - - 
sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png HTTP/2.0"
[18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803 503 0 - - 
sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png HTTP/2.0"

What's further interesting, it is was consistently 2 out of 18 requests. That 
led me down the road of checking queue timeouts (noticing the timing 
correlation in the logs). I adjusted `timeout connect` up from 6200ms to 
12400ms and added pool-purge-delay to 60s.

After adjusting those timeouts and pool purges and re-enabling `abortonclose`, 
the request errors I was seeing magically went away. I'll push this config to 
production and see if we see a reduction in 503s. I also suspect we'll see a 
marginal improvement in throughput and response time due to keeping backend 
connections open longer.

I'll also keep an eye our for inconsistencies between our backend accept 
capability and timeouts and see if perhaps we're overrunning some buffer 
somewhere in HAProxy, NGINX, or somewhere else.

Thanks for your help so far!

Best,
Luke

—
Luke Seelenbinder
Stadia Maps | Founder
stadiamaps.com

On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote:
> On Mon, Mar 04, 2019 at 11:45:53AM +, Luke Seelenbinder wrote:
> > Hi Willy,
> >
> > > Do you have "option abortonclose" in your config ?
> >
> > We do not have abortonclose. Do you recommend this if we have a lot of
> > client-side request aborts (but not connection level closes)? From reading
> > the docs, I came away conflicted as to the implications. :-)
> 
> It will help, especially if you have maxconn configured on your server
> lines, as it will allow the requests to be aborted while still in queue.
> 
> That said, we still don't know exactly what causes your logs.
> 
> Willy
>



How to configure Email server in haproxy version 1.8

2019-03-18 Thread Shweta Garg
Hi,

I want to know to configure Email server in haproxy 1.8 version when any server 
is down.

Regards
Shweta
::DISCLAIMER::
--
The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only. E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents (with or without referred errors) shall therefore 
not attach any liability on the originator or HCL or its affiliates. Views or 
opinions, if any, presented in this email are solely those of the author and 
may not necessarily reflect the views or opinions of HCL or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of authorized representative of HCL is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any email and/or attachments, please check them for 
viruses and other defects.
--