So with the prevalence of the issues lately where haproxy is going unresponsive and consuming 100% CPU, I wanted to see what thoughts were on implementing systemd watchdog functionality.

In our case, haproxy going unresponsive is extremely problematic as our clustering software (pacemaker+systemd) sees the service still running, and doesn't realize it needs to restart the service or fail over. We could look into implementing some sort of custom check resource in pacemaker, but before going down that route I wanted to explore the systemd watchdog functionality.


The watchdog is implemented by periodically sending "WATCHDOG=1" on the systemd notification socket. However there are a few different ways I can see this being implemented.

We could put this in the master control process, but this only tells us if the master is functioning, not the workers, which are what really matter.

So the next thought would be for all of the workers to listen on a shared socket. The master would periodically send a request to that socket, and as long as it gets a response, it pings the watchdog. This tells us that there is at least one worker able to accept traffic.

However if a frontend is bound to a specific worker, then that would frontend would be non-responsive, and the watchdog wouldn't restart the service. For that the worker would have to send a request to each worker separately, and require a response from all of them before it pings the watchdog. This would be better able to detect issues, but for some people who aren't using any bound-to-process frontends, they would be able to handle failure of a single worker and potentially schedule a restart/reload at a less impactful time.

The last idea would be to have the watchdog watch the master only, and the master watches the workers in turn. If a worker stops responding, the master would restart just that one worker.


Any thoughts on the matter, or do we not want to do this, and rely on a custom check in the cluster management software?

-Patrick

Reply via email to