Hi,

On Thu, May 18, 2017 at 05:25:26PM -0400, Patrick Hemmer wrote:
> So we had an incident today where haproxy segfaulted and our site went
> down. Unfortunately we did not capture a core, and the segfault message
> logged to dmesg just showed it inside libc. So there's likely not much
> we can do here. We'll be making changes to ensure we capture a core in
> the future.
> 
> However the issue I am reporting that is reproducible (on version 1.7.5)
> is that haproxy did not auto restart, which would have minimized the
> downtime to the site. We use nbproc > 1, so we have multiple haproxy
> processes running, and when one of them dies, neither the
> "haproxy-master" process or the "haproxy-systemd-wrapper" process exits,
> which prevents systemd from starting the service back up.
> 

In fact the systemd wrapper was designed as a hack to allow a 'systemd reload'.
Without it, we were having systemd thinking that everything crashed.

There wasn't any big architecture change with this feature, the systemd wrapper
is not aware at all of which haproxy processes are started, and the
haproxy-master only do a waitpid once he's started.

In future versions of HAProxy, the systemd-wrapper will disappear and will be
merged within the haproxy binary, it will allow to have a single binary which
works in a master-worker mode.

> While I think this behavior would be fine, a possible alternative would
> be for the "haproxy-master" process to restart the dead worker without
> having to kill all the other processes.

It's not easy to relaunch only one process once everything is parsed and ready,
because a lot of stuff is free-ed. The master already lost the configuration at
this step. I don't see how we can implement this in a proper way.

> 
> Another possible action would be to leave the workers running, but
> signal them to stop accepting new connections, and then let the
> "haproxy-master" exit so systemd will restart it.
> 

Something feasible could be to have a configuration keyword which allow the
master to reload when the number of active processes is below the value.

But it can be dangerous, for example if you have a RAM problem on your server,
and the HAProxy processes keep being OOM, it will reload each time and will be
an infinite loop. So I don't know if that's a good idea.

And you don't want your stats to disappear without asking a reload yourself.


> But in any case, I think we need some way of handling this so that site
> interruption is minimal.
> 
> -Patrick

-- 
William Lallemand

Reply via email to