Re: [PATCH 5/5] dynamic health check

2013-02-04 Thread Simon Horman
On Fri, Feb 01, 2013 at 08:22:24AM +0100, Willy Tarreau wrote:
 Hi Simon,
 
 On Fri, Feb 01, 2013 at 01:56:01PM +0900, Simon Horman wrote:
  Hi Malcolm, Hi Willy,
  
  after a bit of a hiatus I'd like to restart this discussion.
 
 Cool, I wanted to ping you on this last week-end but forgot to do so !
 
  On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote:
   Hi Malcolm,
   
   On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote:
Willy / Simon,

I'm very happy to add a down option, my original thought was that you
would use the standard health checks as well as the dynamic agent for
changing the weight.
   
   That's what I thought I initially understood from our discussion a few
   months ago but then your post of the specs last week slightly confused
   me as I understood you needed this as a dedicated check. I think it was
   the same for Simon.
  
  Sorry, I think that the problem here lies in my understanding of what is
  desired.
 
 No problem, we were several ones to get confused.
 
As you may for example want a specific HAproxy SMTP health check + use
the dynamic weighting agent.
   
   Exactly. But then we have two options :
 - retrieve the information from the checked port (easy for HTTP or TCP)
 - retrieve the information from a dedicated port = this involves a
   second task to do this, with its own check intervals.
   
   The latter doesn't seem stupid at all, quite the opposite in fact, but
   it will require more settings on the server line. However it comes with
   a benefit, it is that when the agent returns disable, checks are
   disabled on the real port, but then we could have the agent continue to
   be checked and later return a valid result again.
  
I'm not sure if that would cause some coding issues if the health
checks say 'Down' and the agent says 50%? (I would assume haproxy
health checks take priority?)
   
   Status and weights are orthogonal. The real check should have precedence.
   
Or if the agent says Down but the HAProxy health check says up?
   
   I think it should be ANDed. This could help provide a first implementation
   of multi-port checks after all.
  
  That sounds reasonable.
  
I've certainly happy for Down to be added as an option with a
description string.
Also I'm assuming that later (the dynamic agent) could easily be
extended to an http style get check rather than TCP (lb-agent-chk)  if
users prefer to write an HTTP server application to integrate with it
(Kemp and Barracuda support this method).
  
  On the topic of of down. I think that Willy's proposal is
  entirely reasonable. However its unclear to me if disable should also
  be supported or not.
 
 The disable mode is very problematic : if a server accidently returns it,
 there is no way to roll back except a manual intervention on the load
 balancers. Also there is a high risk that the backup LB will be forgotten
 in such an operation. I have no technical worries here, just operational
 ones. If we run agent checks on a dedicated port in parallel to health
 checks, this is different, because we could ensure that such checks could
 still be running when the server is disabled so that the agent can change
 the mode again. So maybe a first version should not support disable and a
 later one could support it ?

This seems reasonable to me.

 Also, I believe that in another thread we discussed about supporting a
 new status (eg: STOPPED) which differs from DOWN in that it means the
 service was intentionally stopped and did not crash. We can't support
 this well right now (just map it do down) but I think it's important
 that people can design their agents for this. Similarly, a FAIL
 status could be useful in the usual situations where a server is inoperant
 due to external conditions but could appear valid. The common example is
 the mail server which fails to receive e-mails because the FS is full.
 Everything works except the service cannot be delivered. There is nothing
 to restart, the issue can go away by itself, etc... We'd map this to DOWN
 again, but I think some users may later prefer to have a dedicated status
 in the agent's language. So we should probably plan it in the language in
 order to avoid ugly patches here and there.

Adding stopped and fail, and mapping them both to down seems reasonable to me.
I assume that they also accept reason strings as down does.

   That's what I'm commonly observing too. Even right now, there are a lot
   of users who use httpchk for services that are not HTTP at all, but they
   have a very simple agent responding to checks.
   
   So now we have to decide what to do. I think Simon's code already provides
   some useful features (assuming we support down). It should probably be
   extended later to support combined checks.
   
   In my opinion, this could be done in three steps :
   
 1) we merge Simon's work with the option lb-agent-chk 

Re: [PATCH 5/5] dynamic health check

2013-01-31 Thread Simon Horman
Hi Malcolm, Hi Willy,

after a bit of a hiatus I'd like to restart this discussion.

On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote:
 Hi Malcolm,
 
 On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote:
  Willy / Simon,
  
  I'm very happy to add a down option, my original thought was that you
  would use the standard health checks as well as the dynamic agent for
  changing the weight.
 
 That's what I thought I initially understood from our discussion a few
 months ago but then your post of the specs last week slightly confused
 me as I understood you needed this as a dedicated check. I think it was
 the same for Simon.

Sorry, I think that the problem here lies in my understanding of what is
desired.

  As you may for example want a specific HAproxy SMTP health check + use
  the dynamic weighting agent.
 
 Exactly. But then we have two options :
   - retrieve the information from the checked port (easy for HTTP or TCP)
   - retrieve the information from a dedicated port = this involves a
 second task to do this, with its own check intervals.
 
 The latter doesn't seem stupid at all, quite the opposite in fact, but
 it will require more settings on the server line. However it comes with
 a benefit, it is that when the agent returns disable, checks are
 disabled on the real port, but then we could have the agent continue to
 be checked and later return a valid result again.

  I'm not sure if that would cause some coding issues if the health
  checks say 'Down' and the agent says 50%? (I would assume haproxy
  health checks take priority?)
 
 Status and weights are orthogonal. The real check should have precedence.
 
  Or if the agent says Down but the HAProxy health check says up?
 
 I think it should be ANDed. This could help provide a first implementation
 of multi-port checks after all.

That sounds reasonable.

  I've certainly happy for Down to be added as an option with a
  description string.
  Also I'm assuming that later (the dynamic agent) could easily be
  extended to an http style get check rather than TCP (lb-agent-chk)  if
  users prefer to write an HTTP server application to integrate with it
  (Kemp and Barracuda support this method).

On the topic of of down. I think that Willy's proposal is
entirely reasonable. However its unclear to me if disable should also
be supported or not.

 That's what I'm commonly observing too. Even right now, there are a lot
 of users who use httpchk for services that are not HTTP at all, but they
 have a very simple agent responding to checks.
 
 So now we have to decide what to do. I think Simon's code already provides
 some useful features (assuming we support down). It should probably be
 extended later to support combined checks.
 
 In my opinion, this could be done in three steps :
 
   1) we merge Simon's work with the option lb-agent-chk directive which
  *replaces* the health check method with this one ;
 
   2) we implement agent-port and agent-interval on the server lines to
  automatically enable the agent to be run on another port even when a
  different check is running ;
 
   3) we implement http-check agent-hdr name to retrieve the agent string
  from an HTTP header for HTTP checks ;
 
 That way we always support exactly the same syntax but can retrieve the
 required information at different places depending on the checks. Does
 that sound good to you ?

That sounds entirely reasonable to me.



Re: [PATCH 5/5] dynamic health check

2013-01-31 Thread Willy Tarreau
Hi Simon,

On Fri, Feb 01, 2013 at 01:56:01PM +0900, Simon Horman wrote:
 Hi Malcolm, Hi Willy,
 
 after a bit of a hiatus I'd like to restart this discussion.

Cool, I wanted to ping you on this last week-end but forgot to do so !

 On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote:
  Hi Malcolm,
  
  On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote:
   Willy / Simon,
   
   I'm very happy to add a down option, my original thought was that you
   would use the standard health checks as well as the dynamic agent for
   changing the weight.
  
  That's what I thought I initially understood from our discussion a few
  months ago but then your post of the specs last week slightly confused
  me as I understood you needed this as a dedicated check. I think it was
  the same for Simon.
 
 Sorry, I think that the problem here lies in my understanding of what is
 desired.

No problem, we were several ones to get confused.

   As you may for example want a specific HAproxy SMTP health check + use
   the dynamic weighting agent.
  
  Exactly. But then we have two options :
- retrieve the information from the checked port (easy for HTTP or TCP)
- retrieve the information from a dedicated port = this involves a
  second task to do this, with its own check intervals.
  
  The latter doesn't seem stupid at all, quite the opposite in fact, but
  it will require more settings on the server line. However it comes with
  a benefit, it is that when the agent returns disable, checks are
  disabled on the real port, but then we could have the agent continue to
  be checked and later return a valid result again.
 
   I'm not sure if that would cause some coding issues if the health
   checks say 'Down' and the agent says 50%? (I would assume haproxy
   health checks take priority?)
  
  Status and weights are orthogonal. The real check should have precedence.
  
   Or if the agent says Down but the HAProxy health check says up?
  
  I think it should be ANDed. This could help provide a first implementation
  of multi-port checks after all.
 
 That sounds reasonable.
 
   I've certainly happy for Down to be added as an option with a
   description string.
   Also I'm assuming that later (the dynamic agent) could easily be
   extended to an http style get check rather than TCP (lb-agent-chk)  if
   users prefer to write an HTTP server application to integrate with it
   (Kemp and Barracuda support this method).
 
 On the topic of of down. I think that Willy's proposal is
 entirely reasonable. However its unclear to me if disable should also
 be supported or not.

The disable mode is very problematic : if a server accidently returns it,
there is no way to roll back except a manual intervention on the load
balancers. Also there is a high risk that the backup LB will be forgotten
in such an operation. I have no technical worries here, just operational
ones. If we run agent checks on a dedicated port in parallel to health
checks, this is different, because we could ensure that such checks could
still be running when the server is disabled so that the agent can change
the mode again. So maybe a first version should not support disable and a
later one could support it ?

Also, I believe that in another thread we discussed about supporting a
new status (eg: STOPPED) which differs from DOWN in that it means the
service was intentionally stopped and did not crash. We can't support
this well right now (just map it do down) but I think it's important
that people can design their agents for this. Similarly, a FAIL
status could be useful in the usual situations where a server is inoperant
due to external conditions but could appear valid. The common example is
the mail server which fails to receive e-mails because the FS is full.
Everything works except the service cannot be delivered. There is nothing
to restart, the issue can go away by itself, etc... We'd map this to DOWN
again, but I think some users may later prefer to have a dedicated status
in the agent's language. So we should probably plan it in the language in
order to avoid ugly patches here and there.

  That's what I'm commonly observing too. Even right now, there are a lot
  of users who use httpchk for services that are not HTTP at all, but they
  have a very simple agent responding to checks.
  
  So now we have to decide what to do. I think Simon's code already provides
  some useful features (assuming we support down). It should probably be
  extended later to support combined checks.
  
  In my opinion, this could be done in three steps :
  
1) we merge Simon's work with the option lb-agent-chk directive which
   *replaces* the health check method with this one ;
  
2) we implement agent-port and agent-interval on the server lines to
   automatically enable the agent to be run on another port even when a
   different check is running ;
  
3) we implement http-check agent-hdr name to retrieve the agent string
  

Re: [PATCH 5/5] dynamic health check

2012-12-24 Thread Willy Tarreau
Hi Malcolm,

On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote:
 Willy / Simon,
 
 I'm very happy to add a down option, my original thought was that you
 would use the standard health checks as well as the dynamic agent for
 changing the weight.

That's what I thought I initially understood from our discussion a few
months ago but then your post of the specs last week slightly confused
me as I understood you needed this as a dedicated check. I think it was
the same for Simon.

 As you may for example want a specific HAproxy SMTP health check + use
 the dynamic weighting agent.

Exactly. But then we have two options :
  - retrieve the information from the checked port (easy for HTTP or TCP)
  - retrieve the information from a dedicated port = this involves a
second task to do this, with its own check intervals.

The latter doesn't seem stupid at all, quite the opposite in fact, but
it will require more settings on the server line. However it comes with
a benefit, it is that when the agent returns disable, checks are
disabled on the real port, but then we could have the agent continue to
be checked and later return a valid result again.

 I'm not sure if that would cause some coding issues if the health
 checks say 'Down' and the agent says 50%? (I would assume haproxy
 health checks take priority?)

Status and weights are orthogonal. The real check should have precedence.

 Or if the agent says Down but the HAProxy health check says up?

I think it should be ANDed. This could help provide a first implementation
of multi-port checks after all.

 I've certainly happy for Down to be added as an option with a
 description string.
 Also I'm assuming that later (the dynamic agent) could easily be
 extended to an http style get check rather than TCP (lb-agent-chk)  if
 users prefer to write an HTTP server application to integrate with it
 (Kemp and Barracuda support this method).

That's what I'm commonly observing too. Even right now, there are a lot
of users who use httpchk for services that are not HTTP at all, but they
have a very simple agent responding to checks.

So now we have to decide what to do. I think Simon's code already provides
some useful features (assuming we support down). It should probably be
extended later to support combined checks.

In my opinion, this could be done in three steps :

  1) we merge Simon's work with the option lb-agent-chk directive which
 *replaces* the health check method with this one ;

  2) we implement agent-port and agent-interval on the server lines to
 automatically enable the agent to be run on another port even when a
 different check is running ;

  3) we implement http-check agent-hdr name to retrieve the agent string
 from an HTTP header for HTTP checks ;

That way we always support exactly the same syntax but can retrieve the
required information at different places depending on the checks. Does
that sound good to you ?

Best regards,
Willy




Re: [PATCH 5/5] dynamic health check

2012-12-24 Thread Malcolm Turnbull
Willy.

Yes. That sounds good to me.

Thanks. And have a nice Christmas...


On 24 December 2012 09:23, Willy Tarreau w...@1wt.eu wrote:
 Hi Malcolm,

 On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote:
 Willy / Simon,

 I'm very happy to add a down option, my original thought was that you
 would use the standard health checks as well as the dynamic agent for
 changing the weight.

 That's what I thought I initially understood from our discussion a few
 months ago but then your post of the specs last week slightly confused
 me as I understood you needed this as a dedicated check. I think it was
 the same for Simon.

 As you may for example want a specific HAproxy SMTP health check + use
 the dynamic weighting agent.

 Exactly. But then we have two options :
   - retrieve the information from the checked port (easy for HTTP or TCP)
   - retrieve the information from a dedicated port = this involves a
 second task to do this, with its own check intervals.

 The latter doesn't seem stupid at all, quite the opposite in fact, but
 it will require more settings on the server line. However it comes with
 a benefit, it is that when the agent returns disable, checks are
 disabled on the real port, but then we could have the agent continue to
 be checked and later return a valid result again.

 I'm not sure if that would cause some coding issues if the health
 checks say 'Down' and the agent says 50%? (I would assume haproxy
 health checks take priority?)

 Status and weights are orthogonal. The real check should have precedence.

 Or if the agent says Down but the HAProxy health check says up?

 I think it should be ANDed. This could help provide a first implementation
 of multi-port checks after all.

 I've certainly happy for Down to be added as an option with a
 description string.
 Also I'm assuming that later (the dynamic agent) could easily be
 extended to an http style get check rather than TCP (lb-agent-chk)  if
 users prefer to write an HTTP server application to integrate with it
 (Kemp and Barracuda support this method).

 That's what I'm commonly observing too. Even right now, there are a lot
 of users who use httpchk for services that are not HTTP at all, but they
 have a very simple agent responding to checks.

 So now we have to decide what to do. I think Simon's code already provides
 some useful features (assuming we support down). It should probably be
 extended later to support combined checks.

 In my opinion, this could be done in three steps :

   1) we merge Simon's work with the option lb-agent-chk directive which
  *replaces* the health check method with this one ;

   2) we implement agent-port and agent-interval on the server lines to
  automatically enable the agent to be run on another port even when a
  different check is running ;

   3) we implement http-check agent-hdr name to retrieve the agent string
  from an HTTP header for HTTP checks ;

 That way we always support exactly the same syntax but can retrieve the
 required information at different places depending on the checks. Does
 that sound good to you ?

 Best regards,
 Willy




-- 
Regards,

Malcolm Turnbull.

Loadbalancer.org Ltd.
Phone: +44 (0)870 443 8779
http://www.loadbalancer.org/



Re: [PATCH 5/5] dynamic health check

2012-12-23 Thread Willy Tarreau
Hi Simon,

CCing Malcolm who posted the specs for the check.

On Mon, Dec 24, 2012 at 10:33:57AM +0900, Simon Horman wrote:
 Support a dynamic health check performed by opening a TCP socket to a
 pre-defined port and reading an ascii string. The string should have one of
 the following forms:
 
 i. An ascii representation of an positive integer percentage.
e.g. 75%
 
Values in this format will set the wight proportional to the initial
weight of a server as configured when haproxy starts.
 
 ii. The string drain.
 
This will cause the weight of a server to be set to 0, and thus it will
not accept any new connections other than those that are accepted via
persistence.
 
 ii. The string disable.
 
Put the server into maintenance mode. The server must be re-enabled
before any further health checks will be performed.

This is more for Malcolm : I'm realizing that there is no way for the agent
to report a failure. I would love to see a down statement here. The first
goal obviously is to immediately stop using a temporary faulty server. One
of the benefits is that a down state raises an alert. Another benefit is that
the reason can be stored, logged and reported on the stats page. For example,
seeing a server marked down with full length check failed at database
would be very useful. As you can see, I would like the reason to be the end
of the string. So for example, the response for down would be the string :

down File system full
or
down Service not running

The first word down indicates the status, the rest of the string the reason.
It seems that this would be compatible with your protocol, don't you think ?

 A dynmaic helath check may be configued using option dynamic-chk.
 The use of an alternate check-port, used to obtain dynamic heath check
 information described above as opposed to the port of the service,
 may be useful in conjunction with this option.

I'm realizing that the name dynamic might probably not be the most
appropriate as I initially understood it as a modifier for other checks.
For example, when we implement exactly the same thing within an HTTP
header, dynamic could be the option combined with http-chk. After
all, we're relying on a clearly specified agent. Why not call it with
the agent's name (eg: lb-agent-chk) ?

 +#define PR_O2_FEEDBACK_CHK 0x8000   /* use a TCP connection to obtain a 
 metric of server health */

Then once we agree on a name, let's have the same one in this option.

Otherwise it looks good to me. I'm about to issue dev16 today (in a few
hours), if we can quickly decide what to do above, I could even include
it there.

Cheers,
Willy