Re: [PATCH 5/5] dynamic health check
On Fri, Feb 01, 2013 at 08:22:24AM +0100, Willy Tarreau wrote: Hi Simon, On Fri, Feb 01, 2013 at 01:56:01PM +0900, Simon Horman wrote: Hi Malcolm, Hi Willy, after a bit of a hiatus I'd like to restart this discussion. Cool, I wanted to ping you on this last week-end but forgot to do so ! On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote: Hi Malcolm, On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote: Willy / Simon, I'm very happy to add a down option, my original thought was that you would use the standard health checks as well as the dynamic agent for changing the weight. That's what I thought I initially understood from our discussion a few months ago but then your post of the specs last week slightly confused me as I understood you needed this as a dedicated check. I think it was the same for Simon. Sorry, I think that the problem here lies in my understanding of what is desired. No problem, we were several ones to get confused. As you may for example want a specific HAproxy SMTP health check + use the dynamic weighting agent. Exactly. But then we have two options : - retrieve the information from the checked port (easy for HTTP or TCP) - retrieve the information from a dedicated port = this involves a second task to do this, with its own check intervals. The latter doesn't seem stupid at all, quite the opposite in fact, but it will require more settings on the server line. However it comes with a benefit, it is that when the agent returns disable, checks are disabled on the real port, but then we could have the agent continue to be checked and later return a valid result again. I'm not sure if that would cause some coding issues if the health checks say 'Down' and the agent says 50%? (I would assume haproxy health checks take priority?) Status and weights are orthogonal. The real check should have precedence. Or if the agent says Down but the HAProxy health check says up? I think it should be ANDed. This could help provide a first implementation of multi-port checks after all. That sounds reasonable. I've certainly happy for Down to be added as an option with a description string. Also I'm assuming that later (the dynamic agent) could easily be extended to an http style get check rather than TCP (lb-agent-chk) if users prefer to write an HTTP server application to integrate with it (Kemp and Barracuda support this method). On the topic of of down. I think that Willy's proposal is entirely reasonable. However its unclear to me if disable should also be supported or not. The disable mode is very problematic : if a server accidently returns it, there is no way to roll back except a manual intervention on the load balancers. Also there is a high risk that the backup LB will be forgotten in such an operation. I have no technical worries here, just operational ones. If we run agent checks on a dedicated port in parallel to health checks, this is different, because we could ensure that such checks could still be running when the server is disabled so that the agent can change the mode again. So maybe a first version should not support disable and a later one could support it ? This seems reasonable to me. Also, I believe that in another thread we discussed about supporting a new status (eg: STOPPED) which differs from DOWN in that it means the service was intentionally stopped and did not crash. We can't support this well right now (just map it do down) but I think it's important that people can design their agents for this. Similarly, a FAIL status could be useful in the usual situations where a server is inoperant due to external conditions but could appear valid. The common example is the mail server which fails to receive e-mails because the FS is full. Everything works except the service cannot be delivered. There is nothing to restart, the issue can go away by itself, etc... We'd map this to DOWN again, but I think some users may later prefer to have a dedicated status in the agent's language. So we should probably plan it in the language in order to avoid ugly patches here and there. Adding stopped and fail, and mapping them both to down seems reasonable to me. I assume that they also accept reason strings as down does. That's what I'm commonly observing too. Even right now, there are a lot of users who use httpchk for services that are not HTTP at all, but they have a very simple agent responding to checks. So now we have to decide what to do. I think Simon's code already provides some useful features (assuming we support down). It should probably be extended later to support combined checks. In my opinion, this could be done in three steps : 1) we merge Simon's work with the option lb-agent-chk
Re: [PATCH 5/5] dynamic health check
Hi Malcolm, Hi Willy, after a bit of a hiatus I'd like to restart this discussion. On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote: Hi Malcolm, On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote: Willy / Simon, I'm very happy to add a down option, my original thought was that you would use the standard health checks as well as the dynamic agent for changing the weight. That's what I thought I initially understood from our discussion a few months ago but then your post of the specs last week slightly confused me as I understood you needed this as a dedicated check. I think it was the same for Simon. Sorry, I think that the problem here lies in my understanding of what is desired. As you may for example want a specific HAproxy SMTP health check + use the dynamic weighting agent. Exactly. But then we have two options : - retrieve the information from the checked port (easy for HTTP or TCP) - retrieve the information from a dedicated port = this involves a second task to do this, with its own check intervals. The latter doesn't seem stupid at all, quite the opposite in fact, but it will require more settings on the server line. However it comes with a benefit, it is that when the agent returns disable, checks are disabled on the real port, but then we could have the agent continue to be checked and later return a valid result again. I'm not sure if that would cause some coding issues if the health checks say 'Down' and the agent says 50%? (I would assume haproxy health checks take priority?) Status and weights are orthogonal. The real check should have precedence. Or if the agent says Down but the HAProxy health check says up? I think it should be ANDed. This could help provide a first implementation of multi-port checks after all. That sounds reasonable. I've certainly happy for Down to be added as an option with a description string. Also I'm assuming that later (the dynamic agent) could easily be extended to an http style get check rather than TCP (lb-agent-chk) if users prefer to write an HTTP server application to integrate with it (Kemp and Barracuda support this method). On the topic of of down. I think that Willy's proposal is entirely reasonable. However its unclear to me if disable should also be supported or not. That's what I'm commonly observing too. Even right now, there are a lot of users who use httpchk for services that are not HTTP at all, but they have a very simple agent responding to checks. So now we have to decide what to do. I think Simon's code already provides some useful features (assuming we support down). It should probably be extended later to support combined checks. In my opinion, this could be done in three steps : 1) we merge Simon's work with the option lb-agent-chk directive which *replaces* the health check method with this one ; 2) we implement agent-port and agent-interval on the server lines to automatically enable the agent to be run on another port even when a different check is running ; 3) we implement http-check agent-hdr name to retrieve the agent string from an HTTP header for HTTP checks ; That way we always support exactly the same syntax but can retrieve the required information at different places depending on the checks. Does that sound good to you ? That sounds entirely reasonable to me.
Re: [PATCH 5/5] dynamic health check
Hi Simon, On Fri, Feb 01, 2013 at 01:56:01PM +0900, Simon Horman wrote: Hi Malcolm, Hi Willy, after a bit of a hiatus I'd like to restart this discussion. Cool, I wanted to ping you on this last week-end but forgot to do so ! On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote: Hi Malcolm, On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote: Willy / Simon, I'm very happy to add a down option, my original thought was that you would use the standard health checks as well as the dynamic agent for changing the weight. That's what I thought I initially understood from our discussion a few months ago but then your post of the specs last week slightly confused me as I understood you needed this as a dedicated check. I think it was the same for Simon. Sorry, I think that the problem here lies in my understanding of what is desired. No problem, we were several ones to get confused. As you may for example want a specific HAproxy SMTP health check + use the dynamic weighting agent. Exactly. But then we have two options : - retrieve the information from the checked port (easy for HTTP or TCP) - retrieve the information from a dedicated port = this involves a second task to do this, with its own check intervals. The latter doesn't seem stupid at all, quite the opposite in fact, but it will require more settings on the server line. However it comes with a benefit, it is that when the agent returns disable, checks are disabled on the real port, but then we could have the agent continue to be checked and later return a valid result again. I'm not sure if that would cause some coding issues if the health checks say 'Down' and the agent says 50%? (I would assume haproxy health checks take priority?) Status and weights are orthogonal. The real check should have precedence. Or if the agent says Down but the HAProxy health check says up? I think it should be ANDed. This could help provide a first implementation of multi-port checks after all. That sounds reasonable. I've certainly happy for Down to be added as an option with a description string. Also I'm assuming that later (the dynamic agent) could easily be extended to an http style get check rather than TCP (lb-agent-chk) if users prefer to write an HTTP server application to integrate with it (Kemp and Barracuda support this method). On the topic of of down. I think that Willy's proposal is entirely reasonable. However its unclear to me if disable should also be supported or not. The disable mode is very problematic : if a server accidently returns it, there is no way to roll back except a manual intervention on the load balancers. Also there is a high risk that the backup LB will be forgotten in such an operation. I have no technical worries here, just operational ones. If we run agent checks on a dedicated port in parallel to health checks, this is different, because we could ensure that such checks could still be running when the server is disabled so that the agent can change the mode again. So maybe a first version should not support disable and a later one could support it ? Also, I believe that in another thread we discussed about supporting a new status (eg: STOPPED) which differs from DOWN in that it means the service was intentionally stopped and did not crash. We can't support this well right now (just map it do down) but I think it's important that people can design their agents for this. Similarly, a FAIL status could be useful in the usual situations where a server is inoperant due to external conditions but could appear valid. The common example is the mail server which fails to receive e-mails because the FS is full. Everything works except the service cannot be delivered. There is nothing to restart, the issue can go away by itself, etc... We'd map this to DOWN again, but I think some users may later prefer to have a dedicated status in the agent's language. So we should probably plan it in the language in order to avoid ugly patches here and there. That's what I'm commonly observing too. Even right now, there are a lot of users who use httpchk for services that are not HTTP at all, but they have a very simple agent responding to checks. So now we have to decide what to do. I think Simon's code already provides some useful features (assuming we support down). It should probably be extended later to support combined checks. In my opinion, this could be done in three steps : 1) we merge Simon's work with the option lb-agent-chk directive which *replaces* the health check method with this one ; 2) we implement agent-port and agent-interval on the server lines to automatically enable the agent to be run on another port even when a different check is running ; 3) we implement http-check agent-hdr name to retrieve the agent string
Re: [PATCH 5/5] dynamic health check
Hi Malcolm, On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote: Willy / Simon, I'm very happy to add a down option, my original thought was that you would use the standard health checks as well as the dynamic agent for changing the weight. That's what I thought I initially understood from our discussion a few months ago but then your post of the specs last week slightly confused me as I understood you needed this as a dedicated check. I think it was the same for Simon. As you may for example want a specific HAproxy SMTP health check + use the dynamic weighting agent. Exactly. But then we have two options : - retrieve the information from the checked port (easy for HTTP or TCP) - retrieve the information from a dedicated port = this involves a second task to do this, with its own check intervals. The latter doesn't seem stupid at all, quite the opposite in fact, but it will require more settings on the server line. However it comes with a benefit, it is that when the agent returns disable, checks are disabled on the real port, but then we could have the agent continue to be checked and later return a valid result again. I'm not sure if that would cause some coding issues if the health checks say 'Down' and the agent says 50%? (I would assume haproxy health checks take priority?) Status and weights are orthogonal. The real check should have precedence. Or if the agent says Down but the HAProxy health check says up? I think it should be ANDed. This could help provide a first implementation of multi-port checks after all. I've certainly happy for Down to be added as an option with a description string. Also I'm assuming that later (the dynamic agent) could easily be extended to an http style get check rather than TCP (lb-agent-chk) if users prefer to write an HTTP server application to integrate with it (Kemp and Barracuda support this method). That's what I'm commonly observing too. Even right now, there are a lot of users who use httpchk for services that are not HTTP at all, but they have a very simple agent responding to checks. So now we have to decide what to do. I think Simon's code already provides some useful features (assuming we support down). It should probably be extended later to support combined checks. In my opinion, this could be done in three steps : 1) we merge Simon's work with the option lb-agent-chk directive which *replaces* the health check method with this one ; 2) we implement agent-port and agent-interval on the server lines to automatically enable the agent to be run on another port even when a different check is running ; 3) we implement http-check agent-hdr name to retrieve the agent string from an HTTP header for HTTP checks ; That way we always support exactly the same syntax but can retrieve the required information at different places depending on the checks. Does that sound good to you ? Best regards, Willy
Re: [PATCH 5/5] dynamic health check
Willy. Yes. That sounds good to me. Thanks. And have a nice Christmas... On 24 December 2012 09:23, Willy Tarreau w...@1wt.eu wrote: Hi Malcolm, On Mon, Dec 24, 2012 at 09:06:25AM +, Malcolm Turnbull wrote: Willy / Simon, I'm very happy to add a down option, my original thought was that you would use the standard health checks as well as the dynamic agent for changing the weight. That's what I thought I initially understood from our discussion a few months ago but then your post of the specs last week slightly confused me as I understood you needed this as a dedicated check. I think it was the same for Simon. As you may for example want a specific HAproxy SMTP health check + use the dynamic weighting agent. Exactly. But then we have two options : - retrieve the information from the checked port (easy for HTTP or TCP) - retrieve the information from a dedicated port = this involves a second task to do this, with its own check intervals. The latter doesn't seem stupid at all, quite the opposite in fact, but it will require more settings on the server line. However it comes with a benefit, it is that when the agent returns disable, checks are disabled on the real port, but then we could have the agent continue to be checked and later return a valid result again. I'm not sure if that would cause some coding issues if the health checks say 'Down' and the agent says 50%? (I would assume haproxy health checks take priority?) Status and weights are orthogonal. The real check should have precedence. Or if the agent says Down but the HAProxy health check says up? I think it should be ANDed. This could help provide a first implementation of multi-port checks after all. I've certainly happy for Down to be added as an option with a description string. Also I'm assuming that later (the dynamic agent) could easily be extended to an http style get check rather than TCP (lb-agent-chk) if users prefer to write an HTTP server application to integrate with it (Kemp and Barracuda support this method). That's what I'm commonly observing too. Even right now, there are a lot of users who use httpchk for services that are not HTTP at all, but they have a very simple agent responding to checks. So now we have to decide what to do. I think Simon's code already provides some useful features (assuming we support down). It should probably be extended later to support combined checks. In my opinion, this could be done in three steps : 1) we merge Simon's work with the option lb-agent-chk directive which *replaces* the health check method with this one ; 2) we implement agent-port and agent-interval on the server lines to automatically enable the agent to be run on another port even when a different check is running ; 3) we implement http-check agent-hdr name to retrieve the agent string from an HTTP header for HTTP checks ; That way we always support exactly the same syntax but can retrieve the required information at different places depending on the checks. Does that sound good to you ? Best regards, Willy -- Regards, Malcolm Turnbull. Loadbalancer.org Ltd. Phone: +44 (0)870 443 8779 http://www.loadbalancer.org/
Re: [PATCH 5/5] dynamic health check
Hi Simon, CCing Malcolm who posted the specs for the check. On Mon, Dec 24, 2012 at 10:33:57AM +0900, Simon Horman wrote: Support a dynamic health check performed by opening a TCP socket to a pre-defined port and reading an ascii string. The string should have one of the following forms: i. An ascii representation of an positive integer percentage. e.g. 75% Values in this format will set the wight proportional to the initial weight of a server as configured when haproxy starts. ii. The string drain. This will cause the weight of a server to be set to 0, and thus it will not accept any new connections other than those that are accepted via persistence. ii. The string disable. Put the server into maintenance mode. The server must be re-enabled before any further health checks will be performed. This is more for Malcolm : I'm realizing that there is no way for the agent to report a failure. I would love to see a down statement here. The first goal obviously is to immediately stop using a temporary faulty server. One of the benefits is that a down state raises an alert. Another benefit is that the reason can be stored, logged and reported on the stats page. For example, seeing a server marked down with full length check failed at database would be very useful. As you can see, I would like the reason to be the end of the string. So for example, the response for down would be the string : down File system full or down Service not running The first word down indicates the status, the rest of the string the reason. It seems that this would be compatible with your protocol, don't you think ? A dynmaic helath check may be configued using option dynamic-chk. The use of an alternate check-port, used to obtain dynamic heath check information described above as opposed to the port of the service, may be useful in conjunction with this option. I'm realizing that the name dynamic might probably not be the most appropriate as I initially understood it as a modifier for other checks. For example, when we implement exactly the same thing within an HTTP header, dynamic could be the option combined with http-chk. After all, we're relying on a clearly specified agent. Why not call it with the agent's name (eg: lb-agent-chk) ? +#define PR_O2_FEEDBACK_CHK 0x8000 /* use a TCP connection to obtain a metric of server health */ Then once we agree on a name, let's have the same one in this option. Otherwise it looks good to me. I'm about to issue dev16 today (in a few hours), if we can quickly decide what to do above, I could even include it there. Cheers, Willy