On Sun, Feb 01, 2015 at 08:25:24AM +0100, Pavlos Parissis wrote:
> If I understood Bhaskar's suggestion correctly, we could delegate health
> check for backend servers to a single server which does all the health
> checking. Am I right ?

Yes that was the idea.

> If it this is case then the downside of multiple
> health checks when nbproc > 1 is gone! But, I would like to see a
> fail-back mechanism as we have with agent check in case that single
> server is gone. Alternatively, we could have Bhaskar's suggestion
> implemented in the agent check.

... or you can use a local proxy which load-balances between multiple
servers.

> I am re-heating the request of delegate health checks to a central
> service with a fall-back mechanism in place because
> * Reduces checks in setups where you have servers in multiple backends
> * Reduces checks in setups where you have more than 1 HAProxy active
> server(HAProxy servers behind a Layer 4 load balancer - ECMP and etc)
> * Reduces checks when multi-process model is used
> * Reduces CPU stress on firewalls, when they are present between HAProxy
> and backend servers.

Absolutely. And keeps state across reloads, and ensures that all LBs have
the same view of the service when servers are flapping.

> This assumes that there are enough resources on the 'health-checker'
> server to sustain huge amount of requests. Which is not a big deal if
> 'health-checker' solution is designed correctly, meaning that backend
> servers push their availability to that 'health-checker' server and etc.
> Furthermore, 'health-checker' server should have a check in place to
> detect backend servers not sending their health status and declare them
> down after a certain period of inactivity.

We used to work on exactly such a design a few years ago at HAPTech, and
the principle for it was to be a cache for health checks. That provided
all the benefits of what you mentionned above, including a more consistent
state between LBs when servers are flapping. The idea is that each check
result is associated with a maxage and that any check received while the
last result's age has not maxed out would be returned from the cache. It
happens that all the stuff added to health checks since then had complicated
things significantly (eg: capture of last response, send of the local info,
etc). We've more or less abandonned that work by lack of time and need for
a redesign. So I could say that the design is far from being obvious, but
the gains to expect are very important. Also such a checker should be
responsible for notifications, and possibly for aggregating states before
returning composite statuses (that may be one point to reconsider in the
future to limit complexity though).

> In case of servers located across multiple vlans, there is a edge case
> where backend servers are reported as healthy but HAProxy fails to send
> traffic to them due to missing network routes, firewall holes and etc.

That's less of an issue because in generaly you want a check to fail if
haproxy is not able to reach the target server, whatever the reasons. It
has happened to me a few times to discover a server was suddenly marked
down on a backup LB because someone had changed a firewall rule, and that
was of critical importance given that I knew that switching to the backup
LB would have prevented that service from working anymore.

> The main gain of this solution is that you make backend servers
> responsible for announcing their availability, it is a mindset change as
> we have used to have LBs performing the health checks and be the
> authoritative source of such information.

It's not necessarily a mindset change, it's a different way of knowing
whether servers are up or down.

Regards,
Willy


Reply via email to