Hi Willy, sorry for not responding earlier.
I will likely have more questions regarding your review but at this point one in particular comes to mind: are you still interested in removing lb-agent-chk? If so I think it is a reasonable idea as it should avoid complexity both in this series and in the future. On Fri, Sep 06, 2013 at 03:05:35PM +0200, Willy Tarreau wrote: > Hi Simon, > > I'm finally done with the review. That was quite long (spread over two > weeks), and obviously less than what it might have taken you to do this > work. Next time if you can send smaller chunks it might help a lot ! > > Due to the difficulty to assign enough continuous time on it, I used > "git notes" to add notes to the commits, which allowed me to go back > and forth in the review. So I have not commented inline the code but > I have so few comments about the code by itself that it should not be > an issue at all, I hope you don't mind. Otherwise ask me for details > and I'll happily go into depth. > > I found one concern at the end about the external check that's probably > worth a separate discussion thread. I'm suspecting that we should probably > get the 20 first patches merged ASAP and discuss the last two after that. > > I'm appending the whole review here. This is a git log --reverse which > shows my comments below the Notes: tags. > > Thanks very much for this work, I hope we can sort out all remaining > points to get it merged into 1.5-dev20. > > Best regards, > Willy > > --- > > commit 08eff21897f3f5d62f1360f309c5a71da6146afa > Author: Simon Horman <ho...@verge.net.au> > Date: Sat Feb 23 08:34:13 2013 +0900 > > DOC: Clarify documentation of option lb-agent-chk > > Avoid referring to check-port as this is not a configuration parameter. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK > > commit bff65d578c2e4212dfcbfa14ba549581ca42bc1d > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 07:45:56 2013 +0900 > > CLEANUP: Make parameters of srv_downtime and srv_getinter const > > The parameters of srv_downtime and srv_getinter are not modified > and thus may be const. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK > > commit 9386a28b61cb7fb011ef2e710616b972f7bfe4e3 > Author: Simon Horman <ho...@verge.net.au> > Date: Sat Feb 23 10:16:43 2013 +0900 > > MEDIUM: Split up struct server's check element > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > > The split has been made by: > * Moving elements of struct server's check element that will > be shared by both checks into a new check_common element > of struct server. > * Moving the remaining elements to a new struct check and > making struct server's check element a struct check. > * Adding a server element to struct check, a back-pointer > to the server element it is a member of. > - At this time the server could be obtained using > container_of, however, this will not be so easy > once a second struct check element is added to struct server > to accommodate an agent health check. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v6 > * Correct use of srv->check_common.xprt ssl_sock_prepare_srv_ctx(). > Previously srv->check.xprt was used which causes a build failure > when compiled with make USE_OPENSSL=1 ... > Thanks to Scott McKeown for discovering this. > > v2 - v5 > * No change > > Notes: > OK. > > commit ed27bc9796c5d0c06e1dc3dad4774510b3f1277a > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 09:09:03 2013 +0900 > > MEDIUM: Move {,fast,down}inter to struct check > > Move {,fast,down}inter elements from struct server to struct check. > This allows those elements of a check to be independent of the check's > server. > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK though could probably be merged with previous one > > commit 368f15d3ad805edbbc9070e708ce5b99624a8607 > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 07:25:29 2013 +0900 > > MEDIUM: Move result element to struct check > > Move result element from struct server to struct check > This allows check results to be independent of the check's server. > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK. > > commit f0d5a3c54696b4abd284b363adfb616cad0837ff > Author: Simon Horman <ho...@verge.net.au> > Date: Sat Feb 23 15:35:38 2013 +0900 > > MEDIUM: Paramatise functions over the check of a server > > Paramatise the following functions over the check of a server > > * set_server_down > * set_server_up > * srv_getinter > * server_status_printf > * set_server_check_status > * set_server_disabled > > Generally the server parameter of these functions has been removed. > Where it is still needed it is obtained using check->server. > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > By paramatising these functions they may act on each of the checks > without further significant modification. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > This is a rather large patch and I would be happy to split it up > on request. > > v4 > * Use check->type in process_check() > > Use check->type instead of s->proxy->options2 & PR_O2_CHK_ANY > as the former is specific to the check being processed whereas > the latter relates only to the primary check of a server. > > * Add type to struct check > > This is used to indicate the type of a check independent of > its server's proxy's check type. > > v2 - v3 > * No Change > > Notes: > - still need to understand the SSP_O_HCHK -> check change > - set_server_disabled() : takes a check arg, converts it to server then > uses &s->check, seems suspicious or at least confusing (why not simply > check ?) > > - also : > for(srv = s->tracknext; srv; srv = srv->tracknext) > - set_server_disabled(srv); > + set_server_disabled(&s->check); > > isn't it &srv->check instead ? > > - it's nice to have a check being the owner of a connection instead of the > server, looks much cleaner than what we used to have. > - it should probably be made it more obvious what case is covered by > !check->type > (if relevant at this point) > > commit 8c7409c8a6ddb4f3629fd877935355e0902ebfa5 > Author: Simon Horman <ho...@verge.net.au> > Date: Sat Feb 23 15:14:19 2013 +0900 > > MEDIUM: cfgparse: Factor out check initialisation > > This is in preparation for struct server having two elements > of type struct check. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v5 > * Remove server argument from init_check. It is not used. > * Set type in init_check > > This allows a zero type to be used to indicate that > a task for a secondary agent check should not be stated. > > Notes: > OK > > commit 73f8958f7c982aa0d376ab5b6d4baeb0892a61c2 > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 17:23:38 2013 +0900 > > MEDIUM: Add state to struct check > > Add state to struct check. This is currently used to store one bit, > CHK_RUNNING, which is set if a check is running and clear otherwise. > This bit was previously SRV_CHK_RUNNING of the state element of struct > server. > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > > Signed-off-by: Simon Horman <horms+rene...@verge.net.au> > > Notes: > Please use CHK_STATE_RUNNING instead of CHK_RUNNING so that it's > easier to know it is related to the state attribute of the check. > > Also, we have many HCHK_* and now CHK_* as well, so I wonder how/when > this will become confusing. > > commit 8fe10dc6d602286db0270a0d1d27a3fa2907792e > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 21:50:46 2013 +0900 > > MEDIUM: Add name element to struct check > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > > Signed-off-by: Simon Horman <horms+rene...@verge.net.au> > > --- > > v5 > * Rebase > > Notes: > Is it worth adding a name field just to report it in the stats ? Don't > we have everything required to display the proper name when dumping > the stats ? (maybe not). > > commit 59d4d5ae53c2225a6001be9d526de6623dba0abd > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 17:23:38 2013 +0900 > > MEDIUM: Move health element to struct check > > This is in preparation for associating a agent check > with a server which runs as well as the server's existing check. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK > > commit 4f7700d5da3c22444f16716d105e013ef69ad0b0 > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Feb 24 20:50:33 2013 +0900 > > MEDIUM: Add helper for task creation for checks > > This helper is in preparation for adding a second struct check element > to struct server. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v5 > * Rebase > > Notes: > Cosmetic, but the cfgparse change just changes the previous patch, so it > could > have been merged there. > > Please change the "polarity" of the start_check_task() function or make it > very clear that it returns an error code, or change the check below : > > - if (start_check_task(&s->check, mininter, nbcheck, srvpos++)) > + if (start_check_task(&s->check, mininter, nbcheck, srvpos++) < > 0) > return -1 > > The problem is I'm reading it as "if we could start a check task, return > -1". > And yes, I know that all the kernel's networking code you're dealing with > every day works this way so this can be confusing. > > Last cosmetic point, please keep srvpos++ out of the condition (I thought > we lost it). I tend to discourage people from doing variable assignments > in conditions because these are the places that are commented out the > first > during debugging sessions, causing the accidental loss of the assignment. > > Otherwise OK. > > commit cf63f97ef6fa2ef584caab595c06738262f3ac80 > Author: Simon Horman <ho...@verge.net.au> > Date: Sat Feb 23 15:10:54 2013 +0900 > > MEDIUM: checks: Add supplementary agent checks > > Allow an agent check to be run in conjunction with one other server > health check. > > If the backend for a server check is not lb-agent-chk then an agent > check may also be run using the agent-check parameter to a server, > which sets the TCP port to be used for the agent check. > > e.g. > server web1_1 127.0.0.1:80 check agent-port 10000 > > The agent-inter parameter may also be used to specify the interval > and timeout for agent checks. > > If either the health or agent check determines that a server is down > then it is marked as being down, otherwise it is marked as being up. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v5 > * Rebase for removal of server argument from init_check > * Rebase for setting of type in init_check > > v4 > * Increment global.maxsock for agent-port. > > If agent-port is configured then an extra socket is required. > > * Do not send requests to secondary agent checks > > The request configuration of a proxy relates to the primary health > check and should not be sent to the secondary health check if it > is in operation. > > * Correct usage of PR_O2_LB_AGENT_CHK > > The correct way to check for PR_O2_LB_AGENT_CHK is > not to use x & PR_O2_LB_AGENT_CHK, but rather to use > (x & PR_O2_CHK_ANY ) == PR_O2_LB_AGENT_CHK. > > v2 - v3 > * No change > > Notes: > The commit message confused me but fortunately the doc fixed me :-) > > I'm having an issue with using agent-port to automatically enable the > check. We've done this mistake with the stats in the past. Specifying > any stats setting enables the stats. The end result is that nobody > sets the stats URL or passwords in a defaults section. > > Here we'll get the same issue. On large deployments, it's very likely > that users will want to have "default-server agent-port 10000" in their > defaults section, but then they will have no way to disable it for some > servers. > > Thus I'd rather stick to the same logic we have for current health checks > which consists in having the port being an independant and harmless > setting, > and have an "agent-check" keyword on the server lines to enable it where > relevant. > > Please do not forget to mention the compatibility between agent-port and > default-server in the doc. > > Also, just a stupid question, we've introduced option lb-agent-chk very > recently and we're still in development. Don't you think it would be > useful to remove it now before a release if agent-check can cover all > its needs ? It would avoid having to deal with complex setups in the > future. CCing the lb.org guys here who most likely use it already (and > might probably plan a config migration). > > commit 8abb1b04fcb258a6004d37fe2dd2436cb121007a > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Mar 3 16:46:40 2013 +0900 > > MEDIUM: Add helper for agent check events. > > Break agent check handling out of event_srv_chk_r(). > > This is in preparation for supporting agent check results > returned in an HTTP header. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK > > commit 77950728605ced6093a829b08ffc5b7d5ecc0b6d > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Mar 3 16:22:45 2013 +0900 > > MEDIUM: Parser to allow matching of HTTP header > > Replace the current header parser, which simply skips the headers, > with a version that allows matching of a key in a header. > > This is in preparation for supporting agent check results > returned in an HTTP header. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > I don't understand the parser :-( Some comments on the inputs/outputs of > the > function are probably needed since they're clearly not obvious. > > Also I fear that it does not support partial responses anymore, but I may > be wrong (ie: wait for more data to come when the response is not > complete). > > Anyway, this change must be tagged as MAJOR since it will definitely cause > regressions in obscure places : I'm seeing the parser is stricter than it > was, and people who implement script-based health checks tend to emit > awful > HTTP-like responses which certainly don't qualify as valid but which were > accepted. So whatever we can do to optimise compatibility with previous > checks should be considered (especially since we're speaking about checks > and not production traffic). > > commit 0b829e506657e0d18cc8ffd64808f8cb0e379168 > Author: Simon Horman <ho...@verge.net.au> > Date: Sun Mar 3 17:50:39 2013 +0900 > > MEDIUM: Add http-check agent-hdr option > > Allow agent checks to obtain information in > an HTTP header of the response to an http check. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v4 > * Do not duplicate code that is in process_result() > > v2 - v3 > * No change > > Notes: > OK in principle, but what does it do when agent-port is set ? > Is there any conflict or do we face a race between the two competing > checks ? > > In the doc : > > http-check agent-hdr > Use the value of the given http header as the result of an agent check > May be used in sections : defaults | frontend | listen | backend > yes | no | yes | yes > Arguments : > > <header> The header string to use to send the server name > > Please mention "<header>" on the syntax line itself : > > http-check agent-hdr <header> > > commit 410090bf97c2553da33f799c6dc313fb91b13ead > Author: Simon Horman <ho...@verge.net.au> > Date: Wed Jun 5 18:06:07 2013 +0900 > > MEDIUM: Add DRAIN state and report it on the stats page > > Add a DRAIN sub-state for a server which > will be shown on the stats page instead of UP if > an agent check is in use and the agent has most recently returned "drain". > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > In stats_dump_sv_stats() we use 'state' to display a new state, but the > CSS is not updated, so we'll end up with any formating or colorset. I > think it would be good to use : > - the same color as "UP 2/3" (yellow) for "DRAIN 2/3" > - a light brown such as #CC9900 when the DRAIN is in effect > > Also please update the comments at the top of the function to reflect > the new input range. > > commit 5a53bc5b4aebbfb499b96a5d3dadde514d885aaf > Author: Simon Horman <ho...@verge.net.au> > Date: Wed Jun 26 23:38:56 2013 +0900 > > MEDIUM: Log agent fail, stopped or down as info > > In the case where an agent check returns fail, stopped or down, > log this as info when logging the server status along with any > trailing message returned by the agent after fail, stopped or down. > > Previously only the trailing message was logged as info and > if omitted no info was logged. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > OK > > commit 5ab56ee1510abfc79ec1f46657a49abb47da6824 > Author: Simon Horman <ho...@verge.net.au> > Date: Wed Jun 5 18:05:05 2013 +0900 > > MEDIUM: Do not mark a server as down if the agent is unavailable > > In the case where agent-port is used and the agent > check is a secondary check to not mark a server as down > if the agent becomes unavailable. > > In this configuration the agent should only cause a server to be marked > as down if the agent returns "fail", "stopped" or "down". > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v4 > * Never update health of agent on failure to connect > > Previously the health was recharged on such cases. > But this causes the health to be recharged even > if the most recent result from the agent was "fail", "stopped", or > "down". > > The result was unnecessary logging of "fail", "stopped", or "down" > results from the agent if they were interleaved with timeouts. > Or in other words, if the agent responded only occasionally > but consistently with "fail", "stopped", or "down". > > v3 > * No change > > v2 > * Only log agent server status if the change in status > will affect the server's state > > Notes: > While I absolutely agree with the goal described in the commit message, > I fail to connect it to what I'm seeing in the code. I don't know if it > is a result of the rework of the patches but for me all it does it apply > a second DRAIN state whose purpose I don't understand. I'm seeing that > we first set WILLDRAIN, then update the status, then add DRAIN on top of > that, that remains unclear to me. I'm not sure what situation the 4 > combinations of these flags correspond to. > > commit 3980cce19837da833cf0c5ced030060257b9df0c > Author: Simon Horman <ho...@verge.net.au> > Date: Thu Jun 27 21:15:04 2013 +0900 > > MEDIUM: Set rise and fall of agent checks to 1 > > This is achieved by moving rise and fall from struct server to struct > check. > > After this move the behaviour of the primary check, server->check is > unchanged. However, the secondary agent check, server->agent now has > independent rise and fall values each of which are set to 1. > > The result is that receiving "fail", "stopped" or "down" just once from > the > agent will mark the server as down. And receiving a weight just once will > allow the server to be marked up if its primary check is in good health. > > This opens up the scope to allow the rise and fall values of the agent > check to be configurable, however this has not been implemented at this > stage. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v7 > When checking health against rise in set_server_up the health of a check > should be checked against the rise of the same check. Otherwise the > condition > for the s->agent may be false when it should be true as s->agent's rise > and > thus healthy health value is 1 whereas the default rise for check is > likely to > be 3, as check may be s->check and the the default value for > s->check.rise is 3. > > v5 > * Rebase > > Notes: > One remaining "server" below : > > - int health; /* 0 to server->rise-1 = > bad; > - * rise to > server->rise+server->fall-1 = good */ > + int health; /* 0 to rise-1 = bad; > + * rise to > rise+server->fall-1 = good */ > + int rise, fall; /* time in iterations */ > > Otherwise OK. > > commit fa2d91c61ce78d24143a74508d1dc8c85fca53f1 > Author: Simon Horman <ho...@verge.net.au> > Date: Sat Jun 22 14:36:39 2013 +0900 > > MEDIUM: Add set agent pause|unpause unix socket command > > The syntax of this new command is: > > set agent <backend>/<server> pause|unpause > > This command changes the behaviour of agent checks as follows: > > In the case where an agent check is being run as a secondary check, > due to the agent-port parameter of a server directive, new checks > are only initialised when the agent is in the unpaused state. Thus, > setting the agent to pause will prevent any new agent checks from > begin initiated until the agent is set to unpaused. > > When set pause is in effect the processing of an agent check run as the > primary check, due to either option lb-agent-check or http-check > agent-hdr being set, or processing of a secondary agent check that was > initiated while the agent was set as unpaused is as follows: All results > that would alter the weight, specifically "drain" or a weight returned by > the agent, are ignored. The processing of agent check is otherwise > unchanged. > > The motivation for this option is to allow weight setting is to allow the > weight changing effects of the agent checks to be paused to allow the > weight of a server to be configured using set weight without being > overridden by the agent. > > The default state is unpaused. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > Notes: > I think that instead of setting the agent status to pause/unpaused, we'd > rather reuse the same principle as we currently have with servers : > > enable agent <backend>/<server> > disable agent <backend>/<server> > > OK for the rest of course. > > commit e4cbb6e2d82ad7d037316ab0dc4932e16a3b4e14 > Author: Simon Horman <ho...@verge.net.au> > Date: Tue Mar 13 14:02:40 2012 +0900 > > MEDIUM: Break out check establishment into establish_chk() > > This is in preparation for adding a new type of check that > uses a process rather than a socket. > > Signed-off-by: Simon Horman <ho...@verge.net.au> > > --- > > v4 > * Use check->type in establish_conn_chk() > > Use check->type instead of s->proxy->options2 & PR_O2_CHK_ANY > as the former is specific to the check being processed whereas > the latter relates only to the primary check of a server. > > v2 - v3 > * No change > > Notes: > I'd rather call it connect_chk() since we already use the verb > "connect" at other places for the same purpose. > > commit 23bc14cb0389d6ecb525c76766181ef01938a6cf > Author: Simon Horman <ho...@verge.net.au> > Date: Tue Mar 13 14:03:22 2012 +0900 > > MEDIUM: Add external check > > Add an external check which makes use of an external process to > check the status of a server. > > --- > > v6 > * Correct implementation and documentation of arguments to external-check > command so that they are consistent with both each other and > ldirectord's > external check. The motivation being to allow the same scripts to be > used > with both haproxy and ldirectord. > > v5 > * Rebase > > v4 > * Remove stray use of s->check in process_chk() > The check parameter should be used throughout process_chk() > * Layer 7 timeouts of agent checks should be ignored > * Ensure that argc is never used uninitialised in prepare_external_check() > > v3 > * Rebase: basically a rewrite of large sections of the code > * Merge with the following patches > + "external-check: Actually execute command" > + "Allow selection of of external-check in configuration file" > > v2 > * If the external command exits normally (WIFEXITED()) is true) > then set the check's code to the exit status (WEXITSTATUS()) > of the process. > * Treat a timeout is a failure case rather than the test having passed > * Remove duplicate getnameinfo() call in start_checks() > * Remove duplicate assignment of sockaddr argument to getnameinfo(9 > which caused the check port and check addr configuration of > a server to be ignored. > > Notes: > Problems : > - don't use "option xxx <arg>" despite the old poor choice introduced > with "option httpchk". Options are meant to be booleans. > - we need some way to ensure that we can globally disable this feature > or better, that it must be explicitly enabled, as it introduces a > major security risk (APIs, ...) > - probably that we need to have a global prefix path setting for all > commands. > - anonymous union does not build with gcc-2.95 (still supported) > - please don't use stdbool (again), it is not portable and has already > broke twice in the past. > - getnameinfo is not portable either, better use str2sa() or equivalent > (and remove include netdb) > - I'm quite worried about the changes with has_conn in process_chk, > they're touching a very sensible place, so I think it would be > much better to have a distinct process_chk function for external > checks (we can even rename process_chk to process_chk_conn) > > -------- >