RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-12 Thread Cedric Paillet
Hi,

Sorry, I now understand my problem better.

When a server is in maintenance, the health checks are not performed and 
sv->check.state == CHK_ST_PAUSED.
This is checked in the haproxy_server_check_status metric but not in 
backend_agg_check_status.
I will check my new patch and push it afterwards.

Thank you very much.


-Message d'origine-
De : Christopher Faulet  
Envoyé : lundi 11 septembre 2023 20:15
À : Cedric Paillet ; haproxy@formilux.org
Objet : Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

Le 07/09/2023 à 16:50, Cedric Paillet a écrit :
> 
>> And I guess we should also check the healthchecks are enabled for the 
>> server. It is not really an issue because call to get_check_status_result() 
>> will exclude neutral and unknown satuses. But there is no reason to count 
>> these servers.
> 
> What we observed is that the health check of servers that have been put into 
> maintenance is no longer updated, and the status returned is the last known 
> one. I need to double-check, but I believe we even saw L7OK when putting a 
> "UP" server into maintenance. (What I'm sure of is that the majority of 
> servers in maintenance were in L7STS and not in UNK).
> 
> If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to 
> display an incorrect backend_agg_server_status (and also 
> haproxy_server_check_status) for servers in maintenance for those who don't 
> set (no-maint=empty), and we probably then need another patch so that the 
> status of these servers is HCHK_STATUS_UNKNOWN?"
> 

Health-checks for servers in maintenance are paused. So indeed, the last known 
status does not change anymore in this state. My purpose here was to also 
filter servers to only count those with health-checks enabled and running. When 
the server's metrics are dumped, the check status is already skipped for 
servers in maintenance. Thus it seems logical to not count them for the 
aggregated metric.

In fact, this way, all servers in maintenance are skipped without checking the 
server's admin state. But it is probably cleaner to keep both checks for 
consistency. Except if I missed something. This part is not really clear for me 
anymore...

--
Christopher Faulet



Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-11 Thread Christopher Faulet

Le 07/09/2023 à 16:50, Cedric Paillet a écrit :



And I guess we should also check the healthchecks are enabled for the server. 
It is not really an issue because call to get_check_status_result() will 
exclude neutral and unknown satuses. But there is no reason to count these 
servers.


What we observed is that the health check of servers that have been put into maintenance 
is no longer updated, and the status returned is the last known one. I need to 
double-check, but I believe we even saw L7OK when putting a "UP" server into 
maintenance. (What I'm sure of is that the majority of servers in maintenance were in 
L7STS and not in UNK).

If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to display 
an incorrect backend_agg_server_status (and also haproxy_server_check_status) for 
servers in maintenance for those who don't set (no-maint=empty), and we probably 
then need another patch so that the status of these servers is 
HCHK_STATUS_UNKNOWN?"



Health-checks for servers in maintenance are paused. So indeed, the last known 
status does not change anymore in this state. My purpose here was to also filter 
servers to only count those with health-checks enabled and running. When the 
server's metrics are dumped, the check status is already skipped for servers in 
maintenance. Thus it seems logical to not count them for the aggregated metric.


In fact, this way, all servers in maintenance are skipped without checking the 
server's admin state. But it is probably cleaner to keep both checks for 
consistency. Except if I missed something. This part is not really clear for me 
anymore...


--
Christopher Faulet




Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-11 Thread Christopher Faulet

Le 11/09/2023 à 15:09, Cedric Paillet a écrit :

The state of servers that were put in maintenance via the runtime API are
reported within the "backend_agg_check_status" metric, which
lead to inconsistent sums when compared to the "haproxy_server_check_status"
metric.

Now excluding them from this computation.
---
  addons/promex/service-prometheus.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..7b6081fab 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
+   if ((ctx->flags & PROMEX_FL_NO_MAINT_SRV) 
&& (sv->cur_admin & SRV_ADMF_MAINT))
+   goto next_sv;
srv_check_status = 
sv->check.status;

srv_check_count[srv_check_status] += 1;
-   sv = sv->next;
+   next_sv:
+   sv = sv->next;
}
for (; ctx->obj_state < HCHK_STATUS_SIZE; 
ctx->obj_state++) {
if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)


Thanks Cedric, I'll merge it ASAP.

--
Christopher Faulet




[PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-11 Thread Cedric Paillet
The state of servers that were put in maintenance via the runtime API are
reported within the "backend_agg_check_status" metric, which
lead to inconsistent sums when compared to the "haproxy_server_check_status"
metric.

Now excluding them from this computation.
---
 addons/promex/service-prometheus.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..7b6081fab 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
+   if ((ctx->flags & 
PROMEX_FL_NO_MAINT_SRV) && (sv->cur_admin & SRV_ADMF_MAINT))
+   goto next_sv;
srv_check_status = 
sv->check.status;

srv_check_count[srv_check_status] += 1;
-   sv = sv->next;
+   next_sv:
+   sv = sv->next;
}
for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)
-- 
2.25.1



RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-07 Thread Cedric Paillet
> I guess you mean "backend_add_check_status". Because 
> "backend_agg_server_check_status" is a deprecated metric. It was replaced by 
> "backend_agg_server_status".
Yes, sorry (and I should know, it's me who deprecated this value! 😊)"
In this patch both will be up updated, as it's the same code !
I will update the commit message.
(we can deprecate it in 2.9 ? do you want a patch ?)

> Here we should check for PROMEX_FL_NO_MAINT_SRV promex flag, no ? I mean:
Yes, sure.

>And I guess we should also check the healthchecks are enabled for the server. 
>It is not really an issue because call to get_check_status_result() will 
>exclude neutral and unknown satuses. But there is no reason to count these 
>servers.

What we observed is that the health check of servers that have been put into 
maintenance is no longer updated, and the status returned is the last known 
one. I need to double-check, but I believe we even saw L7OK when putting a "UP" 
server into maintenance. (What I'm sure of is that the majority of servers in 
maintenance were in L7STS and not in UNK).

If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to 
display an incorrect backend_agg_server_status (and also 
haproxy_server_check_status) for servers in maintenance for those who don't set 
(no-maint=empty), and we probably then need another patch so that the status of 
these servers is HCHK_STATUS_UNKNOWN?"

Cédric



Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-07 Thread Christopher Faulet

Hi,

Thanks Cedric and sorry for the delay. I have few comments.

Le 01/09/2023 à 08:22, Cedric Paillet a écrit :

The state of servers that were put in maintenance via the runtime API are
reported within the "backend_agg_server_check_status" metric, which
I guess you mean "backend_add_check_status". Because 
"backend_agg_server_check_status" is a deprecated metric. It was replaced by 
"backend_agg_server_status".



lead to inconsistent sums when compared to the "haproxy_server_check_status"
metric.

Now excluding them from this computation.
---
  addons/promex/service-prometheus.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..7b61683dd 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
+   if (sv->cur_admin & 
SRV_ADMF_MAINT)
+   goto next_sv;


Here we should check for PROMEX_FL_NO_MAINT_SRV promex flag, no ? I mean:

  if ((ctx->flags & PROMEX_FL_NO_MAINT_SRV) && (sv->cur_admin & SRV_ADMF_MAINT))
  goto next_sv;

And I guess we should also check the healthchecks are enabled for the server. It 
is not really an issue because call to get_check_status_result() will exclude 
neutral and unknown satuses. But there is no reason to count these servers.



srv_check_status = 
sv->check.status;

srv_check_count[srv_check_status] += 1;
-   sv = sv->next;
+   next_sv:
+   sv = sv->next;
}
for (; ctx->obj_state < HCHK_STATUS_SIZE; 
ctx->obj_state++) {
if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)


Otherwise, I'm ok with the fix.

--
Christopher Faulet




[PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-08-31 Thread Cedric Paillet
The state of servers that were put in maintenance via the runtime API are
reported within the "backend_agg_server_check_status" metric, which
lead to inconsistent sums when compared to the "haproxy_server_check_status"
metric.

Now excluding them from this computation.
---
 addons/promex/service-prometheus.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..7b61683dd 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
+   if (sv->cur_admin & 
SRV_ADMF_MAINT)
+   goto next_sv;
srv_check_status = 
sv->check.status;

srv_check_count[srv_check_status] += 1;
-   sv = sv->next;
+   next_sv:
+   sv = sv->next;
}
for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
if 
(get_check_status_result(ctx->obj_state) < CHK_RES_FAILED)
-- 
2.25.1