haproxy doesn't restart after segfault on systemd

2017-05-18 Thread Patrick Hemmer
So we had an incident today where haproxy segfaulted and our site went
down. Unfortunately we did not capture a core, and the segfault message
logged to dmesg just showed it inside libc. So there's likely not much
we can do here. We'll be making changes to ensure we capture a core in
the future.

However the issue I am reporting that is reproducible (on version 1.7.5)
is that haproxy did not auto restart, which would have minimized the
downtime to the site. We use nbproc > 1, so we have multiple haproxy
processes running, and when one of them dies, neither the
"haproxy-master" process or the "haproxy-systemd-wrapper" process exits,
which prevents systemd from starting the service back up.

While I think this behavior would be fine, a possible alternative would
be for the "haproxy-master" process to restart the dead worker without
having to kill all the other processes.

Another possible action would be to leave the workers running, but
signal them to stop accepting new connections, and then let the
"haproxy-master" exit so systemd will restart it.

But in any case, I think we need some way of handling this so that site
interruption is minimal.

-Patrick


Re: haproxy doesn't restart after segfault on systemd

2017-05-19 Thread William Lallemand
Hi,

On Thu, May 18, 2017 at 05:25:26PM -0400, Patrick Hemmer wrote:
> So we had an incident today where haproxy segfaulted and our site went
> down. Unfortunately we did not capture a core, and the segfault message
> logged to dmesg just showed it inside libc. So there's likely not much
> we can do here. We'll be making changes to ensure we capture a core in
> the future.
> 
> However the issue I am reporting that is reproducible (on version 1.7.5)
> is that haproxy did not auto restart, which would have minimized the
> downtime to the site. We use nbproc > 1, so we have multiple haproxy
> processes running, and when one of them dies, neither the
> "haproxy-master" process or the "haproxy-systemd-wrapper" process exits,
> which prevents systemd from starting the service back up.
> 

In fact the systemd wrapper was designed as a hack to allow a 'systemd reload'.
Without it, we were having systemd thinking that everything crashed.

There wasn't any big architecture change with this feature, the systemd wrapper
is not aware at all of which haproxy processes are started, and the
haproxy-master only do a waitpid once he's started.

In future versions of HAProxy, the systemd-wrapper will disappear and will be
merged within the haproxy binary, it will allow to have a single binary which
works in a master-worker mode.

> While I think this behavior would be fine, a possible alternative would
> be for the "haproxy-master" process to restart the dead worker without
> having to kill all the other processes.

It's not easy to relaunch only one process once everything is parsed and ready,
because a lot of stuff is free-ed. The master already lost the configuration at
this step. I don't see how we can implement this in a proper way.

> 
> Another possible action would be to leave the workers running, but
> signal them to stop accepting new connections, and then let the
> "haproxy-master" exit so systemd will restart it.
> 

Something feasible could be to have a configuration keyword which allow the
master to reload when the number of active processes is below the value.

But it can be dangerous, for example if you have a RAM problem on your server,
and the HAProxy processes keep being OOM, it will reload each time and will be
an infinite loop. So I don't know if that's a good idea.

And you don't want your stats to disappear without asking a reload yourself.


> But in any case, I think we need some way of handling this so that site
> interruption is minimal.
> 
> -Patrick

-- 
William Lallemand



Re: haproxy doesn't restart after segfault on systemd

2017-05-19 Thread William Lallemand
Hi again,

On Fri, May 19, 2017 at 11:45:18AM +0200, William Lallemand wrote:
> Hi,
> 
> On Thu, May 18, 2017 at 05:25:26PM -0400, Patrick Hemmer wrote:
> > So we had an incident today where haproxy segfaulted and our site went
> > down. Unfortunately we did not capture a core, and the segfault message
> > logged to dmesg just showed it inside libc. So there's likely not much
> > we can do here. We'll be making changes to ensure we capture a core in
> > the future.
> > 
> > However the issue I am reporting that is reproducible (on version 1.7.5)
> > is that haproxy did not auto restart, which would have minimized the
> > downtime to the site. We use nbproc > 1, so we have multiple haproxy
> > processes running, and when one of them dies, neither the
> > "haproxy-master" process or the "haproxy-systemd-wrapper" process exits,
> > which prevents systemd from starting the service back up.
> > 
> 
> In fact the systemd wrapper was designed as a hack to allow a 'systemd 
> reload'.
> Without it, we were having systemd thinking that everything crashed.
> 
> There wasn't any big architecture change with this feature, the systemd 
> wrapper
> is not aware at all of which haproxy processes are started, and the
> haproxy-master only do a waitpid once he's started.
> 
> In future versions of HAProxy, the systemd-wrapper will disappear and will be
> merged within the haproxy binary, it will allow to have a single binary which
> works in a master-worker mode.
> 
> > While I think this behavior would be fine, a possible alternative would
> > be for the "haproxy-master" process to restart the dead worker without
> > having to kill all the other processes.
> 
> It's not easy to relaunch only one process once everything is parsed and 
> ready,
> because a lot of stuff is free-ed. The master already lost the configuration 
> at
> this step. I don't see how we can implement this in a proper way.
> 
> > 
> > Another possible action would be to leave the workers running, but
> > signal them to stop accepting new connections, and then let the
> > "haproxy-master" exit so systemd will restart it.
> > 
> 
> Something feasible could be to have a configuration keyword which allow the
> master to reload when the number of active processes is below the value.
> 
> But it can be dangerous, for example if you have a RAM problem on your server,
> and the HAProxy processes keep being OOM, it will reload each time and will be
> an infinite loop. So I don't know if that's a good idea.
> 
> And you don't want your stats to disappear without asking a reload yourself.
> 
> 
> > But in any case, I think we need some way of handling this so that site
> > interruption is minimal.
> > 

I had a discussion with Willy about that, and we though that it's not a good
idea to let HAProxy restart by itself , because it will reduce the amount of
bug report and can cause other problems.

However, we though about another way to do that, we can add an option to the
master process, which can kill the master and every active processes when one
of the process segv. It will allow systemd to be notified of the crash, and you
will be able to restart everything with it, using Restart=always for example.

-- 
William Lallemand