Re: HAProxy 1.8.4 crashing

2018-07-05 Thread Holger Just
Hi Praveen,

There are several fixes for segfaults which might occur in your version
of HAProxy. Before checking anything else, you should upgrade to the
latest version of HAProxy 1.8 (currently 1.8.12).

See http://www.haproxy.org/bugs/bugs-1.8.4.html for bugs fixed in this
version compared to your current version.

Regards,
Holger

UPPALAPATI, PRAVEEN wrote:
> 
> Hi Haproxy Team,
> 
> Our Prod Haproxy instance is crashing with following error in 
> /var/log/messages:
> 
> Jun 28 17:52:30 zlp32359 kernel: haproxy[55940]: segfault at 60 ip 
> 0045b0a9 sp 7f4ef6b9f010 error 4 in haproxy[40+12b000]
> Jun 28 17:56:01 zlp32359 systemd: Started Session 73792 of user root.
> Jun 28 17:56:01 zlp32359 systemd: Starting Session 73792 of user root.
> Jun 28 17:56:01 zlp32359 LMC: Hardware Manufacturer = VMWARE
> Jun 28 17:56:01 zlp32359 LMC: Hardware Product Name = VMware Virtual Platform
> Jun 28 17:56:01 zlp32359 LMC: Hardware Serial # = VMware-42 29 ea 5e 6c 7b 5b 
> 49-ca 32 48 fb 5a 9d e7 d6
> Jun 28 17:56:01 zlp32359 LMC: ### NO PID_MAX ISSUES FOUND ###
> Jun 28 17:56:01 zlp32359 LMC: ### NO READ ONLY FILE SYSTEM ISSUES FOUND ###
> Jun 28 17:56:01 zlp32359 LMC: ### NO SCSI ABORT ISSUES FOUND ###
> Jun 28 17:56:02 zlp32359 LMC: ### NO SCSI ERROR ISSUES FOUND ###
> 
> HaProxyVersion:
> 
> haproxy -v
> HA-Proxy version 1.8.4-1deb90d 2018/02/08
> Copyright 2000-2018 Willy Tarreau 
> 
> Cmd to run haproxy :
> 
> //opt/app/haproxy/sbin/haproxy -D -f //opt/app/haproxy/etc/haproxy.cfg -f 
> //opt/app/haproxy/etc/ haproxy-healthcheck.cfg -p 
> //opt/app/haproxy/log/haprox.pid
> 
> Let me know if you need more information and if you need more logging let me 
> know how can enable it. I am not able to reproduce this in our dev 
> box(Probably I am not able to replicate the traffic on dev).
> 
> Thanks,
> Praveen.
> 
> 
> 
> 



Re: HAProxy 1.8.4 crashing

2018-07-05 Thread Willy Tarreau
Hi Praveen,

On Thu, Jul 05, 2018 at 04:13:25PM +, UPPALAPATI, PRAVEEN wrote:
> 
> 
> Hi Haproxy Team,
> 
> Our Prod Haproxy instance is crashing with following error in 
> /var/log/messages:
> 
> Jun 28 17:52:30 zlp32359 kernel: haproxy[55940]: segfault at 60 ip 
> 0045b0a9 sp 7f4ef6b9f010 error 4 in haproxy[40+12b000]
> Jun 28 17:56:01 zlp32359 systemd: Started Session 73792 of user root.
> Jun 28 17:56:01 zlp32359 systemd: Starting Session 73792 of user root.
> Jun 28 17:56:01 zlp32359 LMC: Hardware Manufacturer = VMWARE
> Jun 28 17:56:01 zlp32359 LMC: Hardware Product Name = VMware Virtual Platform
> Jun 28 17:56:01 zlp32359 LMC: Hardware Serial # = VMware-42 29 ea 5e 6c 7b 5b 
> 49-ca 32 48 fb 5a 9d e7 d6
> Jun 28 17:56:01 zlp32359 LMC: ### NO PID_MAX ISSUES FOUND ###
> Jun 28 17:56:01 zlp32359 LMC: ### NO READ ONLY FILE SYSTEM ISSUES FOUND ###
> Jun 28 17:56:01 zlp32359 LMC: ### NO SCSI ABORT ISSUES FOUND ###
> Jun 28 17:56:02 zlp32359 LMC: ### NO SCSI ERROR ISSUES FOUND ###
> 
> HaProxyVersion:
> 
> haproxy -v
> HA-Proxy version 1.8.4-1deb90d 2018/02/08


Well, as you can see, at least 109 bugs have been fixed since this version,
including one critical and 11 major, all potentially able to provoke a crash :

   http://www.haproxy.org/bugs/bugs-1.8.4.html

I think it's *really* time for you to apply maintenance updates and try again.

Regards,
Willy



HAProxy 1.8.4 crashing

2018-07-05 Thread UPPALAPATI, PRAVEEN



Hi Haproxy Team,

Our Prod Haproxy instance is crashing with following error in /var/log/messages:

Jun 28 17:52:30 zlp32359 kernel: haproxy[55940]: segfault at 60 ip 
0045b0a9 sp 7f4ef6b9f010 error 4 in haproxy[40+12b000]
Jun 28 17:56:01 zlp32359 systemd: Started Session 73792 of user root.
Jun 28 17:56:01 zlp32359 systemd: Starting Session 73792 of user root.
Jun 28 17:56:01 zlp32359 LMC: Hardware Manufacturer = VMWARE
Jun 28 17:56:01 zlp32359 LMC: Hardware Product Name = VMware Virtual Platform
Jun 28 17:56:01 zlp32359 LMC: Hardware Serial # = VMware-42 29 ea 5e 6c 7b 5b 
49-ca 32 48 fb 5a 9d e7 d6
Jun 28 17:56:01 zlp32359 LMC: ### NO PID_MAX ISSUES FOUND ###
Jun 28 17:56:01 zlp32359 LMC: ### NO READ ONLY FILE SYSTEM ISSUES FOUND ###
Jun 28 17:56:01 zlp32359 LMC: ### NO SCSI ABORT ISSUES FOUND ###
Jun 28 17:56:02 zlp32359 LMC: ### NO SCSI ERROR ISSUES FOUND ###

HaProxyVersion:

haproxy -v
HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau 

Cmd to run haproxy :

//opt/app/haproxy/sbin/haproxy -D -f //opt/app/haproxy/etc/haproxy.cfg -f 
//opt/app/haproxy/etc/ haproxy-healthcheck.cfg -p 
//opt/app/haproxy/log/haprox.pid

Let me know if you need more information and if you need more logging let me 
know how can enable it. I am not able to reproduce this in our dev box(Probably 
I am not able to replicate the traffic on dev).

Thanks,
Praveen.






Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-10 Thread Cyril Bonté
Hi Robin,

> De: "Robin Geuze" 
> À: "Willy Tarreau" 
> Cc: haproxy@formilux.org
> Envoyé: Lundi 9 Avril 2018 10:24:43
> Objet: Re: Haproxy 1.8.4 crashing workers and increased memory usage
> 
> Hey Willy,
> 
> So I made a build this morning with libslz and re-enabled compression
> and within an hour we had the exit code 134 errors, so zlib does not
> seem to be the problem here.

I have spent some times on this issue yesterday, without being able to 
reproduce it.
I suspect something wrong with pending connections (without any clue, except 
there's an abort() in the path), but couldn't see anything wrong in the code.

There's still something missing in this thread (maybe I missed it), but can you 
provide the output of "haproxy -vv" ?
Also, are you 100% sure you're running the version you compiled ? I prefer to 
ask, as it sometimes happens ;-)

Thanks,
Cyril



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-09 Thread Robin Geuze

Hey,

Won't that be a bit pointless since we don't use threads?

Regards,

Robin Geuze


On 4/9/2018 10:31, Илья Шипицин wrote:

can you try thread sanitizer (in real time)?

https://github.com/google/sanitizers/wiki#threadsanitizer


I'd like to try myself, however, we do not observe bad things in our 
environment


2018-04-09 13:24 GMT+05:00 Robin Geuze >:


Hey Willy,

So I made a build this morning with libslz and re-enabled
compression and within an hour we had the exit code 134 errors, so
zlib does not seem to be the problem here.

Regards,

Robin Geuze



On 4/7/2018 00:30, Willy Tarreau wrote:

Hi Robin,

On Fri, Apr 06, 2018 at 03:52:33PM +0200, Robin Geuze wrote:

Hey Willy,

I was actually the one that had the hunch to disable
compression. I
suspected that this was the issue because there was a
bunch of "abort" calls
in include/common/hathreads.h" which is used by the
compression stuff.
However I just noticed those aborts are actually only
there if DEBUG_THREAD
is defined which it doesn't seem to be for our build. So
basically, I have
no clue whatsoever why disabling compression fixes the bug.

At least I don't feel alone :-)

I can see next week if we can make a build with slz
instead of zlib (we seem
to be linked against zlib/libz atm).

Thank you, I appreciate it!

Cheers,
Willy








Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-09 Thread Илья Шипицин
can you try thread sanitizer (in real time)?

https://github.com/google/sanitizers/wiki#threadsanitizer


I'd like to try myself, however, we do not observe bad things in our
environment

2018-04-09 13:24 GMT+05:00 Robin Geuze :

> Hey Willy,
>
> So I made a build this morning with libslz and re-enabled compression and
> within an hour we had the exit code 134 errors, so zlib does not seem to be
> the problem here.
>
> Regards,
>
> Robin Geuze
>
>
>
> On 4/7/2018 00:30, Willy Tarreau wrote:
>
>> Hi Robin,
>>
>> On Fri, Apr 06, 2018 at 03:52:33PM +0200, Robin Geuze wrote:
>>
>>> Hey Willy,
>>>
>>> I was actually the one that had the hunch to disable compression. I
>>> suspected that this was the issue because there was a bunch of "abort"
>>> calls
>>> in include/common/hathreads.h" which is used by the compression stuff.
>>> However I just noticed those aborts are actually only there if
>>> DEBUG_THREAD
>>> is defined which it doesn't seem to be for our build. So basically, I
>>> have
>>> no clue whatsoever why disabling compression fixes the bug.
>>>
>> At least I don't feel alone :-)
>>
>> I can see next week if we can make a build with slz instead of zlib (we
>>> seem
>>> to be linked against zlib/libz atm).
>>>
>> Thank you, I appreciate it!
>>
>> Cheers,
>> Willy
>>
>
>
>


Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-09 Thread Robin Geuze

Hey Willy,

So I made a build this morning with libslz and re-enabled compression 
and within an hour we had the exit code 134 errors, so zlib does not 
seem to be the problem here.


Regards,

Robin Geuze


On 4/7/2018 00:30, Willy Tarreau wrote:

Hi Robin,

On Fri, Apr 06, 2018 at 03:52:33PM +0200, Robin Geuze wrote:

Hey Willy,

I was actually the one that had the hunch to disable compression. I
suspected that this was the issue because there was a bunch of "abort" calls
in include/common/hathreads.h" which is used by the compression stuff.
However I just noticed those aborts are actually only there if DEBUG_THREAD
is defined which it doesn't seem to be for our build. So basically, I have
no clue whatsoever why disabling compression fixes the bug.

At least I don't feel alone :-)


I can see next week if we can make a build with slz instead of zlib (we seem
to be linked against zlib/libz atm).

Thank you, I appreciate it!

Cheers,
Willy





Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-06 Thread Willy Tarreau
Hi Robin,

On Fri, Apr 06, 2018 at 03:52:33PM +0200, Robin Geuze wrote:
> Hey Willy,
> 
> I was actually the one that had the hunch to disable compression. I
> suspected that this was the issue because there was a bunch of "abort" calls
> in include/common/hathreads.h" which is used by the compression stuff.
> However I just noticed those aborts are actually only there if DEBUG_THREAD
> is defined which it doesn't seem to be for our build. So basically, I have
> no clue whatsoever why disabling compression fixes the bug.

At least I don't feel alone :-)

> I can see next week if we can make a build with slz instead of zlib (we seem
> to be linked against zlib/libz atm).

Thank you, I appreciate it!

Cheers,
Willy



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-06 Thread Robin Geuze

Hey Willy,

I was actually the one that had the hunch to disable compression. I 
suspected that this was the issue because there was a bunch of "abort" 
calls in include/common/hathreads.h" which is used by the compression 
stuff. However I just noticed those aborts are actually only there if 
DEBUG_THREAD is defined which it doesn't seem to be for our build. So 
basically, I have no clue whatsoever why disabling compression fixes the 
bug.


I can see next week if we can make a build with slz instead of zlib (we 
seem to be linked against zlib/libz atm).


Regards,

Robin Geuze


On 4/6/2018 14:18, Willy Tarreau wrote:

Hi Frank,

On Fri, Apr 06, 2018 at 10:53:36AM +, Frank Schreuder wrote:

We tested haproxy 1.8.6 with compression enabled today, within the first few 
hours it already went wrong:
[ALERT] 095/120526 (12989) : Current worker 5241 exited with code 134

OK thanks, and sorry for that.


Our other balancer running haproxy 1.8.5 with compression disabled is still
running fine after 2 days with the same workload.
So there seems to be a locking issue when compression is enabled.

Well, an issue with compression, but I'm really not seeing what makes
you speak about locking since :
   - you don't seem to have threads enabled
   - locking issues generally cause deadlocks, not aborts

The other problem is that we noticed already that there are very few
abort() calls in haproxy and none of them in this area. So it's very
possible that it comes from another layer detecting an issue provoked
by compression. Typically the libc's malloc/free can stop the program
using abort() if they detect a corruption.

It would really help to know where this abort() happens, at least to
get a backtrace.

By the way, area you using zlib or slz ? zlib uses a tricky allocator.
I checked it again yesterday and it was made thread safe. But we couldn't
rule out an issue there. slz doesn't need memory however. If you're on
zlib, switching to slz could also indicate if the problem is related to
these memory allocations or not.

Thanks,
Willy






Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-06 Thread Willy Tarreau
Hi Frank,

On Fri, Apr 06, 2018 at 10:53:36AM +, Frank Schreuder wrote:
> We tested haproxy 1.8.6 with compression enabled today, within the first few 
> hours it already went wrong:
> [ALERT] 095/120526 (12989) : Current worker 5241 exited with code 134

OK thanks, and sorry for that.

> Our other balancer running haproxy 1.8.5 with compression disabled is still
> running fine after 2 days with the same workload.
> So there seems to be a locking issue when compression is enabled.

Well, an issue with compression, but I'm really not seeing what makes
you speak about locking since :
  - you don't seem to have threads enabled
  - locking issues generally cause deadlocks, not aborts

The other problem is that we noticed already that there are very few
abort() calls in haproxy and none of them in this area. So it's very
possible that it comes from another layer detecting an issue provoked
by compression. Typically the libc's malloc/free can stop the program
using abort() if they detect a corruption.

It would really help to know where this abort() happens, at least to
get a backtrace.

By the way, area you using zlib or slz ? zlib uses a tricky allocator.
I checked it again yesterday and it was made thread safe. But we couldn't
rule out an issue there. slz doesn't need memory however. If you're on
zlib, switching to slz could also indicate if the problem is related to
these memory allocations or not.

Thanks,
Willy



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-06 Thread Frank Schreuder
Hi Willy,

>>  There are very few abort() calls in the code :
>>    - some in the thread debugging code to detect recursive locks ;
>>    - one in the cache applet which triggers on an impossible case very
>>      likely resulting from cache corruption (hence a bug)
>>    - a few inside the Lua library
>>    - a few in the HPACK decompressor, detecting a few possible bugs there
>> 
>> After playing around with some config changes we managed to not have haproxy
>> throw the "worker  exited with code 134" error for at least a day. Which
>> is a long time as before we had this error at least 5 times a day...
>
> Great!
>
>> The line we removed from our config to get this result was:
>> compression algo gzip
>
> Hmmm interesting.
>
>> Could it be a locking issue in the compression code? I'm going to run a few
>> more days without compression enabled, but for now this looks promising!
>
> In fact, the locking is totally disabled when not using compression, so
> it cannot be an option. Also, most of the recently fixed bugs may only
> be triggered with H2 or threads, none of which you're using. I rechecked
> the compression code to try to spot anything obvious, but nothing popped
> out :-/
>
> All I can strongly recommend if you retry with compression enabled is to
> do it with latest 1.8 release. I'm currently checking that I didn't miss
> anything to issue 1.8.6 hopefully today. If it still dies, this will at
> least rule out the possible side effects of a few of the bugs we've fixed
> since, all of which were really tricky.

We tested haproxy 1.8.6 with compression enabled today, within the first few 
hours it already went wrong:
[ALERT] 095/120526 (12989) : Current worker 5241 exited with code 134

Our other balancer running haproxy 1.8.5 with compression disabled is still 
running fine after 2 days with the same workload.
So there seems to be a locking issue when compression is enabled.

Thanks,
Frank


Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-05 Thread Willy Tarreau
Hi Frank,

On Thu, Apr 05, 2018 at 09:41:25AM +, Frank Schreuder wrote:
> Hi Willy
> 
>  There are very few abort() calls in the code :
>    - some in the thread debugging code to detect recursive locks ;
>    - one in the cache applet which triggers on an impossible case very
>      likely resulting from cache corruption (hence a bug)
>    - a few inside the Lua library
>    - a few in the HPACK decompressor, detecting a few possible bugs there
> 
> After playing around with some config changes we managed to not have haproxy
> throw the "worker  exited with code 134" error for at least a day. Which
> is a long time as before we had this error at least 5 times a day...

Great!

> The line we removed from our config to get this result was:
> compression algo gzip

Hmmm interesting.

> Could it be a locking issue in the compression code? I'm going to run a few
> more days without compression enabled, but for now this looks promising!

In fact, the locking is totally disabled when not using compression, so
it cannot be an option. Also, most of the recently fixed bugs may only
be triggered with H2 or threads, none of which you're using. I rechecked
the compression code to try to spot anything obvious, but nothing popped
out :-/

All I can strongly recommend if you retry with compression enabled is to
do it with latest 1.8 release. I'm currently checking that I didn't miss
anything to issue 1.8.6 hopefully today. If it still dies, this will at
least rule out the possible side effects of a few of the bugs we've fixed
since, all of which were really tricky.

Cheers,
Willy



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-04-05 Thread Frank Schreuder
Hi Willy

 There are very few abort() calls in the code :
   - some in the thread debugging code to detect recursive locks ;
   - one in the cache applet which triggers on an impossible case very
     likely resulting from cache corruption (hence a bug)
   - a few inside the Lua library
   - a few in the HPACK decompressor, detecting a few possible bugs there

After playing around with some config changes we managed to not have haproxy 
throw the "worker  exited with code 134" error for at least a day. Which 
is a long time as before we had this error at least 5 times a day...

The line we removed from our config to get this result was:
compression algo gzip

Could it be a locking issue in the compression code? I'm going to run a few 
more days without compression enabled, but for now this looks promising!

Thanks,
Frank



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-28 Thread Johan Hendriks


Op 23/02/2018 om 13:10 schreef Frank Schreuder:
> Hi Willy,
>
 A few more things on the core dumps :
   - they are ignored if you have a chroot statement in the global section
   - you need not to use "user/uid/group/gid" otherwise the system also
    disables core dumps
>>> I'm using chroot and user/group in my config, so I'm not able to share core 
>>> dumps.
>> Well, if at least you can attach gdb to a process hoping to see it stop and 
>> emit
>> "bt full" to see the whole backtrace, it will help a lot.
> I will try to get a backtrace but this can take a while. I'm running with 7 
> processes which respawn every few minutes. Crashing workers only happen every 
> few hours at random moments. So I need some luck and timing here...
>
 There are very few abort() calls in the code :
   - some in the thread debugging code to detect recursive locks ;
   - one in the cache applet which triggers on an impossible case very
     likely resulting from cache corruption (hence a bug)
   - a few inside the Lua library
   - a few in the HPACK decompressor, detecting a few possible bugs there

 Except for Lua, all of them were added during 1.8, so depending on what the
 configuration uses, there are very few possible candidates.
>>> I added my configuration in this mail. Hopefully this will narrow down the
>>> possible candidates.
>> Well, at least you don't use threads nor lua nor caching nor HTTP/2 so
>> it cannot come from any of those we have identified. It could still come
>> from openssl however.
> There are some bugfixes marked as medium in the haproxy 1.8 repository 
> related to SSL. Would it be possible that they are related to the crashes I'm 
> seeing?
>  
>>> I did some more research to the memory warnings we encounter every few days.
>>> It seems like the haproxy processes use a lot of memory. Would haproxy with
>>> nbthreads share this memory?
>> It depends. In fact, the memory will indeed be shared between threads
>> started together, but if this memory is consumed at load time and never
>> modified, it's also shared between the processes already.
>>
>>> I'm using systemd to reload haproxy for new SSL certificates every few 
>>> minutes.
>> OK. I'm seeing that you load certs from a directory in your config. Do you
>> have a high number of certs ? I'm asking because we've already seen some
>> configs eating multiple gigs of RAM with the certs because there were a lot.
> Yes I have around 40k SSL certificates in this directory and is growing over 
> time.
>
>> In your case they're loaded twice (one for the IPv4 bind line, one for the
>> IPv6). William planned to work on a way to merge all identical certs and have
>> a single instance of them when loaded multiple times, which should already
>> reduce the amount of memory consumed by this.
> I can bind ipv4 and ipv6 in the same line with:
> bind ipv4@:443,ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> /etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
>
> This would also solve the "double load" issue right?
>
>>> frontend fe_http
>>>  bind ipv4@:80 backlog 65534
>>>  bind ipv6@:80 backlog 65534
>>>  bind ipv4@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
>>> /etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
>>>  bind ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
>>> /etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
>>>  bind-process 1-7
>>>  tcp-request inspect-delay 5s
>>>  tcp-request content accept if { req_ssl_hello_type 1 }
>> This one is particularly strange. I suspect it's a leftover from an old
>> configuration dating from the days where haproxy didn't support SSL,
>> because it's looking for SSL messages inside the HTTP traffic, which
>> will never be present. You can safely remove tose two lines.
> We use this to guard against some attacks we have seen in the past. Setting 
> up connections without ssl handshake to use all available connections. I will 
> remove them if you are sure this no longer works.
>
>>>  option forwardfor
>>>  acl secure dst_port 443
>>>  acl is_acme_request path_beg /.well-known/acme-challenge/
>>>  reqadd X-Forwarded-Proto:\ https if secure
>>>  default_backend be_reservedpage
>>>  use_backend be_acme if is_acme_request
>>>  use_backend 
>>> %[req.fhdr(host),lower,map_dom(/etc/haproxy/domain2backend.map)]
>>>  compression algo gzip
>>>  maxconn 32000
>> In my opinion its not a good idea to let a single frontend steal all the
>> process's connections, it will prevent you from connecting to the stats
>> page when a problem happens. You should have a slightly larger global
>> maxconn setting to avoid this.
> Yes you are right, I will fix this in my configuration.
>
>>> backend be_acme
>>>  bind-process 1
>>>  option httpchk HEAD /ping.php HTTP/1.1\r\nHost:\ **removed hostname**
>>>  option http-server-close
>>>   

Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-27 Thread Willy Tarreau
Hi Frank,

On Fri, Feb 23, 2018 at 12:10:13PM +, Frank Schreuder wrote:
> > Well, at least you don't use threads nor lua nor caching nor HTTP/2 so
> > it cannot come from any of those we have identified. It could still come
> > from openssl however.
> 
> There are some bugfixes marked as medium in the haproxy 1.8 repository
> related to SSL. Would it be possible that they are related to the crashes I'm
> seeing?

Unfortunately no, I'm not seeing anything which could explain this. I'm
currently working on an issue affecting 1.8 with H2 where a wrongly sized
error message could cause a leak of struct stream+struct buffer, but I'm
not seeing this impact non-H2 traffic, so that cannot be your issue either
(it only happens when the advertised content-length in the response is
lower than the actual data sent).

> > > I'm using systemd to reload haproxy for new SSL certificates every few 
> > > minutes.
> >
> > OK. I'm seeing that you load certs from a directory in your config. Do you
> > have a high number of certs ? I'm asking because we've already seen some
> > configs eating multiple gigs of RAM with the certs because there were a lot.
> 
> Yes I have around 40k SSL certificates in this directory and is growing over 
> time.

OK then that makes sense. I would love to find a way to report the amount
of RAM used by certs, it could be helpful to troubleshoot such issues.

> > In your case they're loaded twice (one for the IPv4 bind line, one for the
> > IPv6). William planned to work on a way to merge all identical certs and 
> > have
> > a single instance of them when loaded multiple times, which should already
> > reduce the amount of memory consumed by this.
> 
> I can bind ipv4 and ipv6 in the same line with:
> bind ipv4@:443,ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> /etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> 
> This would also solve the "double load" issue right?

Sure! I totally forgot about this syntax ;-)

> > > frontend fe_http
> > > bind ipv4@:80 backlog 65534
> > > bind ipv6@:80 backlog 65534
> > > bind ipv4@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> > >/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> > > bind ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> > >/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> > > bind-process 1-7
> > > tcp-request inspect-delay 5s
> > > tcp-request content accept if { req_ssl_hello_type 1 }
> >
> > This one is particularly strange. I suspect it's a leftover from an old
> > configuration dating from the days where haproxy didn't support SSL,
> > because it's looking for SSL messages inside the HTTP traffic, which
> > will never be present. You can safely remove tose two lines.
> 
> We use this to guard against some attacks we have seen in the past. Setting
> up connections without ssl handshake to use all available connections. I will
> remove them if you are sure this no longer works.

It doesn't do what you seem to believe it does (it has never worked). It
works if you're *not* decrypting SSL. Typically you'd have a TCP listener
passing the traffic to the next layer, and enforcing this control. Then it
would work. But here it's placed the other way around. SSL is unstripped
at the edge, and the clear text extracted from SSL stream is fed through
the TCP rules. Thus these rules can never match, unless of course you're
doing a double encapsulation of SSL, which is not really the purpose here :-)

> > Aside this I really see nothing suspicious in your configuration that could
> > justify a problem. Let's hope you can at least either catch a core or attach
> > a gdb to one of these processes.
> 
> I will let you know as soon as I'm able to get a backtrace. In the meanwhile
> I will improve and test my new configuration changes.

OK.

Cheers,
Willy



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-23 Thread Frank Schreuder
Hi Willy,

> > > A few more things on the core dumps :
> > >  - they are ignored if you have a chroot statement in the global section
> > >  - you need not to use "user/uid/group/gid" otherwise the system also
> > >    disables core dumps
> > 
> > I'm using chroot and user/group in my config, so I'm not able to share core 
> > dumps.
>
> Well, if at least you can attach gdb to a process hoping to see it stop and 
> emit
> "bt full" to see the whole backtrace, it will help a lot.

I will try to get a backtrace but this can take a while. I'm running with 7 
processes which respawn every few minutes. Crashing workers only happen every 
few hours at random moments. So I need some luck and timing here...

> > > There are very few abort() calls in the code :
> > >  - some in the thread debugging code to detect recursive locks ;
> > >  - one in the cache applet which triggers on an impossible case very
> > >    likely resulting from cache corruption (hence a bug)
> > >  - a few inside the Lua library
> > >  - a few in the HPACK decompressor, detecting a few possible bugs there
> > >
> > > Except for Lua, all of them were added during 1.8, so depending on what 
> > > the
> > > configuration uses, there are very few possible candidates.
> > 
> > I added my configuration in this mail. Hopefully this will narrow down the
> > possible candidates.
>
> Well, at least you don't use threads nor lua nor caching nor HTTP/2 so
> it cannot come from any of those we have identified. It could still come
> from openssl however.

There are some bugfixes marked as medium in the haproxy 1.8 repository related 
to SSL. Would it be possible that they are related to the crashes I'm seeing?
 
> > I did some more research to the memory warnings we encounter every few days.
> > It seems like the haproxy processes use a lot of memory. Would haproxy with
> > nbthreads share this memory?
>
> It depends. In fact, the memory will indeed be shared between threads
> started together, but if this memory is consumed at load time and never
> modified, it's also shared between the processes already.
>
> > I'm using systemd to reload haproxy for new SSL certificates every few 
> > minutes.
>
> OK. I'm seeing that you load certs from a directory in your config. Do you
> have a high number of certs ? I'm asking because we've already seen some
> configs eating multiple gigs of RAM with the certs because there were a lot.

Yes I have around 40k SSL certificates in this directory and is growing over 
time.

> In your case they're loaded twice (one for the IPv4 bind line, one for the
> IPv6). William planned to work on a way to merge all identical certs and have
> a single instance of them when loaded multiple times, which should already
> reduce the amount of memory consumed by this.

I can bind ipv4 and ipv6 in the same line with:
bind ipv4@:443,ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534

This would also solve the "double load" issue right?

> > frontend fe_http
> > bind ipv4@:80 backlog 65534
> > bind ipv6@:80 backlog 65534
> > bind ipv4@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> >/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> > bind ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> >/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> > bind-process 1-7
> > tcp-request inspect-delay 5s
> > tcp-request content accept if { req_ssl_hello_type 1 }
>
> This one is particularly strange. I suspect it's a leftover from an old
> configuration dating from the days where haproxy didn't support SSL,
> because it's looking for SSL messages inside the HTTP traffic, which
> will never be present. You can safely remove tose two lines.

We use this to guard against some attacks we have seen in the past. Setting up 
connections without ssl handshake to use all available connections. I will 
remove them if you are sure this no longer works.

> > option forwardfor
> > acl secure dst_port 443
> > acl is_acme_request path_beg /.well-known/acme-challenge/
> > reqadd X-Forwarded-Proto:\ https if secure
> > default_backend be_reservedpage
> > use_backend be_acme if is_acme_request
> > use_backend 
> >%[req.fhdr(host),lower,map_dom(/etc/haproxy/domain2backend.map)]
> > compression algo gzip
> > maxconn 32000
>
> In my opinion its not a good idea to let a single frontend steal all the
> process's connections, it will prevent you from connecting to the stats
> page when a problem happens. You should have a slightly larger global
> maxconn setting to avoid this.

Yes you are right, I will fix this in my configuration.

> > backend be_acme
> > bind-process 1
> > option httpchk HEAD /ping.php HTTP/1.1\r\nHost:\ **removed hostname**
> > option http-server-close
> > option http-pretend-keepalive
>
> Same comment here as above regarding close vs keep-alive.

Yes I

Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-23 Thread Willy Tarreau
Hi Frank,

On Fri, Feb 23, 2018 at 10:28:15AM +, Frank Schreuder wrote:
> > A few more things on the core dumps :
> >  - they are ignored if you have a chroot statement in the global section
> >  - you need not to use "user/uid/group/gid" otherwise the system also
> >    disables core dumps
> 
> I'm using chroot and user/group in my config, so I'm not able to share core 
> dumps.

Well, if at least you can attach gdb to a process hoping to see it stop and emit
"bt full" to see the whole backtrace, it will help a lot.

> > There are very few abort() calls in the code :
> >  - some in the thread debugging code to detect recursive locks ;
> >  - one in the cache applet which triggers on an impossible case very
> >    likely resulting from cache corruption (hence a bug)
> >  - a few inside the Lua library
> >  - a few in the HPACK decompressor, detecting a few possible bugs there
> >
> > Except for Lua, all of them were added during 1.8, so depending on what the
> > configuration uses, there are very few possible candidates.
> 
> I added my configuration in this mail. Hopefully this will narrow down the
> possible candidates.

Well, at least you don't use threads nor lua nor caching nor HTTP/2 so
it cannot come from any of those we have identified. It could still come
from openssl however.

> I did some more research to the memory warnings we encounter every few days.
> It seems like the haproxy processes use a lot of memory. Would haproxy with
> nbthreads share this memory?

It depends. In fact, the memory will indeed be shared between threads
started together, but if this memory is consumed at load time and never
modified, it's also shared between the processes already.

>  1160 haproxy   20   0 1881720 1.742g   5504 S  83.9 11.5   1:53.38 haproxy
>  1045 haproxy   20   0 1880120 1.740g   5572 S  71.0 11.5   1:36.62 haproxy
>  1104 haproxy   20   0 1880376 1.741g   6084 R  64.6 11.5   1:46.29 haproxy
>  1079 haproxy   20   0 1881116 1.741g   5564 S  58.1 11.5   1:42.29 haproxy
>  1135 haproxy   20   0 1881240 1.741g   5564 S  58.1 11.5   1:49.85 haproxy
>995 haproxy   20   0 1881852 1.742g   5584 R  38.7 11.5   1:30.05 haproxy
>  1020 haproxy   20   0 1881448 1.741g   5516 S  25.8 11.5   1:32.20 haproxy
>  4926 haproxy   20   0 1881008 1.718g   2176 S   6.5 11.3   3:11.74 haproxy
>  8526 haproxy   20   0 1878032   6516   1304 S   0.0  0.0   2:10.04 haproxy
>  8529 haproxy   20   0 1880336   5208  4 S   0.0  0.0   2:34.68 haproxy
> 11530 haproxy   20   0 1878748   6556   1392 S   0.0  0.0   2:25.94 haproxy
> 26938 haproxy   20   0 1882592   6032892 S   0.0  0.0   3:56.79 haproxy
> 29577 haproxy   20   0 1880480 1.738g   3132 S   0.0 11.5   2:08.74 haproxy
> 31124 haproxy   20   0 1880776 1.740g   4284 S   0.0 11.5   2:58.84 haproxy
>   7548 root  20   0 1869896 1.731g   4456 S   0.0 11.4   1008:23 haproxy
> 
> I'm using systemd to reload haproxy for new SSL certificates every few 
> minutes.

OK. I'm seeing that you load certs from a directory in your config. Do you
have a high number of certs ? I'm asking because we've already seen some
configs eating multiple gigs of RAM with the certs because there were a lot.

In your case they're loaded twice (one for the IPv4 bind line, one for the
IPv6). William planned to work on a way to merge all identical certs and have
a single instance of them when loaded multiple times, which should already
reduce the amount of memory consumed by this.

> Configuration:
(...)
> defaults
> log global
> timeout http-request 5s
> timeout connect  2s
> timeout client   125s
> timeout server   125s
> mode http
> option dontlog-normal
> option http-server-close
  
It is very likely that you don't need this one anymore, and can improve your
server's load by using keep-alive between haproxy and the backend servers.
But that's irrelevant to your current problem.

(...)
> frontend fe_http
> bind ipv4@:80 backlog 65534
> bind ipv6@:80 backlog 65534
> bind ipv4@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt /etc/haproxy/ssl/ 
> crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> bind ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt /etc/haproxy/ssl/ 
> crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> bind-process 1-7
> tcp-request inspect-delay 5s
> tcp-request content accept if { req_ssl_hello_type 1 }

This one is particularly strange. I suspect it's a leftover from an old
configuration dating from the days where haproxy didn't support SSL,
because it's looking for SSL messages inside the HTTP traffic, which
will never be present. You can safely remove tose two lines.

> option forwardfor
> acl secure dst_port 443
> acl is_acme_request path_beg /.well-known/acme-challenge/
> reqadd X-Forwarded-Proto:\ https if secure
> default_backend be_reservedpage
> use_backend be_acme if is_acme_request
> use_backend 
> %[req.fhdr(host),low

Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-23 Thread Frank Schreuder
Hi Willy and Tim,

> > >> Code 134 implies the worker was killed with SIGABRT. You could check
> > >> whether there is a core dump.
> > > 
> > > I don't have any core dumps.
> > 
> > Check whether coredumps are enabled using `ulimit -c`, often they are
> > disabled by default, because they could contain sensitive information.
> > After the next crash you should be able to retrieve a backtrace using
> > gdb. Possibly recompile haproxy with debug symbols for it to be useful.
> 
> If it happens quickly, another option might be to attach gdb to the
> process after it is started. But with multiple processes it's not very
> convenient.
> 
> A few more things on the core dumps :
>  - they are ignored if you have a chroot statement in the global section
>  - you need not to use "user/uid/group/gid" otherwise the system also
>    disables core dumps

I'm using chroot and user/group in my config, so I'm not able to share core 
dumps.

> There are very few abort() calls in the code :
>  - some in the thread debugging code to detect recursive locks ;
>  - one in the cache applet which triggers on an impossible case very
>    likely resulting from cache corruption (hence a bug)
>  - a few inside the Lua library
>  - a few in the HPACK decompressor, detecting a few possible bugs there
>
> Except for Lua, all of them were added during 1.8, so depending on what the
> configuration uses, there are very few possible candidates.

I added my configuration in this mail. Hopefully this will narrow down the 
possible candidates.

I did some more research to the memory warnings we encounter every few days. It 
seems like the haproxy processes use a lot of memory. Would haproxy with 
nbthreads share this memory?

 1160 haproxy   20   0 1881720 1.742g   5504 S  83.9 11.5   1:53.38 haproxy
 1045 haproxy   20   0 1880120 1.740g   5572 S  71.0 11.5   1:36.62 haproxy
 1104 haproxy   20   0 1880376 1.741g   6084 R  64.6 11.5   1:46.29 haproxy
 1079 haproxy   20   0 1881116 1.741g   5564 S  58.1 11.5   1:42.29 haproxy
 1135 haproxy   20   0 1881240 1.741g   5564 S  58.1 11.5   1:49.85 haproxy
   995 haproxy   20   0 1881852 1.742g   5584 R  38.7 11.5   1:30.05 haproxy
 1020 haproxy   20   0 1881448 1.741g   5516 S  25.8 11.5   1:32.20 haproxy
 4926 haproxy   20   0 1881008 1.718g   2176 S   6.5 11.3   3:11.74 haproxy
 8526 haproxy   20   0 1878032   6516   1304 S   0.0  0.0   2:10.04 haproxy
 8529 haproxy   20   0 1880336   5208  4 S   0.0  0.0   2:34.68 haproxy
11530 haproxy   20   0 1878748   6556   1392 S   0.0  0.0   2:25.94 haproxy
26938 haproxy   20   0 1882592   6032892 S   0.0  0.0   3:56.79 haproxy
29577 haproxy   20   0 1880480 1.738g   3132 S   0.0 11.5   2:08.74 haproxy
31124 haproxy   20   0 1880776 1.740g   4284 S   0.0 11.5   2:58.84 haproxy
  7548 root  20   0 1869896 1.731g   4456 S   0.0 11.4   1008:23 haproxy

I'm using systemd to reload haproxy for new SSL certificates every few minutes.

[Service]
Environment=CONFIG=/etc/haproxy/haproxy.cfg
EnvironmentFile=-/etc/default/haproxy
ExecStartPre=/usr/sbin/haproxy -f ${CONFIG} -c -q
ExecStart=/usr/sbin/haproxy -Ws -f ${CONFIG} -p /run/haproxy.pid $EXTRAOPTS
ExecReload=/usr/sbin/haproxy -c -f ${CONFIG}
ExecReload=/bin/kill -USR2 $MAINPID
KillMode=mixed
Restart=always


Configuration:
global
log **removed hostname** syslog
maxconn 32000
ulimit-n 65536
tune.maxrewrite 2048
user haproxy
group haproxy
daemon
chroot /var/lib/haproxy
nbproc 7
maxcompcpuusage 85
spread-checks 0
ssl-default-bind-options no-sslv3
stats socket /var/run/haproxy.sock mode 400 level admin process 1
stats socket /var/run/haproxy.sock.2 mode 400 level admin process 2
stats socket /var/run/haproxy.sock.3 mode 400 level admin process 3
stats socket /var/run/haproxy.sock.4 mode 400 level admin process 4
stats socket /var/run/haproxy.sock.5 mode 400 level admin process 5
stats socket /var/run/haproxy.sock.6 mode 400 level admin process 6
stats socket /var/run/haproxy.sock.7 mode 400 level admin process 7
master-worker no-exit-on-failure

defaults
log global
timeout http-request 5s
timeout connect  2s
timeout client   125s
timeout server   125s
mode http
option dontlog-normal
option http-server-close
option tcp-smart-connect

frontend fe_http
bind ipv4@:80 backlog 65534
bind ipv6@:80 backlog 65534
bind ipv4@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt /etc/haproxy/ssl/ 
crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
bind ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt /etc/haproxy/ssl/ 
crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
bind-process 1-7
tcp-request inspect-delay 5s
tcp-request content accept if { req_ssl_hello_type 1 }
option forwardfor
acl secure dst_port 443
acl is_acme_request path_beg /.well-known/acme-challenge/
reqadd X-Forwarded-Proto:\ https if secure
default_backend be_reservedpage

Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-22 Thread Willy Tarreau
Hi guys,

On Thu, Feb 22, 2018 at 04:20:07PM +0100, Tim Düsterhus wrote:
> Frank,
> 
> Am 22.02.2018 um 15:33 schrieb Frank Schreuder:
> >> Code 134 implies the worker was killed with SIGABRT. You could check
> >> whether there is a core dump.
> > 
> > I don't have any core dumps.
> 
> Check whether coredumps are enabled using `ulimit -c`, often they are
> disabled by default, because they could contain sensitive information.
> After the next crash you should be able to retrieve a backtrace using
> gdb. Possibly recompile haproxy with debug symbols for it to be useful.

If it happens quickly, another option might be to attach gdb to the
process after it is started. But with multiple processes it's not very
convenient.

A few more things on the core dumps :
  - they are ignored if you have a chroot statement in the global section
  - you need not to use "user/uid/group/gid" otherwise the system also
disables core dumps

There are very few abort() calls in the code :
  - some in the thread debugging code to detect recursive locks ;
  - one in the cache applet which triggers on an impossible case very
likely resulting from cache corruption (hence a bug)
  - a few inside the Lua library
  - a few in the HPACK decompressor, detecting a few possible bugs there

Except for Lua, all of them were added during 1.8, so depending on what the
configuration uses, there are very few possible candidates.

Cheers,
Willy



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-22 Thread Tim Düsterhus
Frank,

Am 22.02.2018 um 15:33 schrieb Frank Schreuder:
>> Code 134 implies the worker was killed with SIGABRT. You could check
>> whether there is a core dump.
> 
> I don't have any core dumps.

Check whether coredumps are enabled using `ulimit -c`, often they are
disabled by default, because they could contain sensitive information.
After the next crash you should be able to retrieve a backtrace using
gdb. Possibly recompile haproxy with debug symbols for it to be useful.

Best regards
Tim Düsterhus



Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-22 Thread Frank Schreuder
Hi Tim,

>> I'm running haproxy 1.8.4 with a heavy work load.
>> For some reason some workers die every now and then with the following error 
>> in the log:
>> Feb 22 05:00:42 hostname haproxy[9950]: [ALERT] 052/045759 (9950) : Current 
>> worker 3569 exited with code 134
>>
>
> Code 134 implies the worker was killed with SIGABRT. You could check
> whether there is a core dump.

I don't have any core dumps.

> When grepping through the code I notice one abort() in cache.c that
> could possibly be executed in production. Are you using the new cache
> that was added in haproxy 1.8?

I just checked my configuration file to confirm that we don't use the new cache.

What I do notice is that we use nbproc 7 with 7 stats sockets. In haproxy 1.8 
nbthread comes available but is highly experimental according the 
documentation. Would it be possible that this solves any of my issues?

Thanks,
Frank 





Re: Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-22 Thread Tim Düsterhus
Frank,

Am 22.02.2018 um 13:00 schrieb Frank Schreuder:
> I'm running haproxy 1.8.4 with a heavy work load.
> For some reason some workers die every now and then with the following error 
> in the log:
> Feb 22 05:00:42 hostname haproxy[9950]: [ALERT] 052/045759 (9950) : Current 
> worker 3569 exited with code 134


Code 134 implies the worker was killed with SIGABRT. You could check
whether there is a core dump.

When grepping through the code I notice one abort() in cache.c that
could possibly be executed in production. Are you using the new cache
that was added in haproxy 1.8?

Best regards
Tim Düsterhus



Haproxy 1.8.4 crashing workers and increased memory usage

2018-02-22 Thread Frank Schreuder
Hi all,

I'm running haproxy 1.8.4 with a heavy work load.
For some reason some workers die every now and then with the following error in 
the log:
Feb 22 05:00:42 hostname haproxy[9950]: [ALERT] 052/045759 (9950) : Current 
worker 3569 exited with code 134

We never saw this behavior on haproxy 1.7.9, is this a know issue? I'm not able 
to reproduce it as I don't know which request causes this issue.
I'm using "master-worker no-exit-on-failure" now, to lower the impact, 
otherwise it would kill all workers and a restart of haproxy takes some time 
with a lot of SSL certificates.

Another issue I noticed is a much higher memory usage on 1.8.4 compared to 
1.7.9. Maybe growing memory over time is a side effect of seamless reloads?

Thanks,
Frank