[PATCH] BUG/MINOR: promex: fix backend_agg_check_status
When a server is in maintenance, the check.status is no longer updated. Therefore, we shouldn't consider check.status if the checks are not active. This check was properly implemented in the haproxy_server_check_status metric, but wasn't carried over to backend_agg_check_status, which introduced inconsistencies between the two metrics." --- addons/promex/service-prometheus.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index e02e8c0c7..6885d202e 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -863,8 +863,10 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) goto next_px; sv = px->srv; while (sv) { - srv_check_status = sv->check.status; - srv_check_count[srv_check_status] += 1; + if ((sv->check.state & (CHK_ST_ENABLED|CHK_ST_PAUSED)) == CHK_ST_ENABLED) { + srv_check_status = sv->check.status; + srv_check_count[srv_check_status] += 1; + } sv = sv->next; } for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { -- 2.25.1
RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance
Hi, Sorry, I now understand my problem better. When a server is in maintenance, the health checks are not performed and sv->check.state == CHK_ST_PAUSED. This is checked in the haproxy_server_check_status metric but not in backend_agg_check_status. I will check my new patch and push it afterwards. Thank you very much. -Message d'origine- De : Christopher Faulet Envoyé : lundi 11 septembre 2023 20:15 À : Cedric Paillet ; haproxy@formilux.org Objet : Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance Le 07/09/2023 à 16:50, Cedric Paillet a écrit : > >> And I guess we should also check the healthchecks are enabled for the >> server. It is not really an issue because call to get_check_status_result() >> will exclude neutral and unknown satuses. But there is no reason to count >> these servers. > > What we observed is that the health check of servers that have been put into > maintenance is no longer updated, and the status returned is the last known > one. I need to double-check, but I believe we even saw L7OK when putting a > "UP" server into maintenance. (What I'm sure of is that the majority of > servers in maintenance were in L7STS and not in UNK). > > If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to > display an incorrect backend_agg_server_status (and also > haproxy_server_check_status) for servers in maintenance for those who don't > set (no-maint=empty), and we probably then need another patch so that the > status of these servers is HCHK_STATUS_UNKNOWN?" > Health-checks for servers in maintenance are paused. So indeed, the last known status does not change anymore in this state. My purpose here was to also filter servers to only count those with health-checks enabled and running. When the server's metrics are dumped, the check status is already skipped for servers in maintenance. Thus it seems logical to not count them for the aggregated metric. In fact, this way, all servers in maintenance are skipped without checking the server's admin state. But it is probably cleaner to keep both checks for consistency. Except if I missed something. This part is not really clear for me anymore... -- Christopher Faulet
[PATCH] MINOR: promex: Introduce 'FULL' status for srv_state metric
This adds a new 'FULL' status to the Prometheus metric 'srv_state'. It helps identify servers that have exceeded their maxconn limit and cannot accept new connections. Rename server_has_room to !server_is_full to matches what's used at a few places in the doc in association with servers or backends being "full". --- addons/promex/service-prometheus.c | 7 +++ include/haproxy/queue.h| 6 +++--- src/backend.c | 2 +- 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index 7b6081fab..40ec00085 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -380,6 +380,7 @@ enum promex_srv_state { PROMEX_SRV_STATE_UP, PROMEX_SRV_STATE_MAINT, PROMEX_SRV_STATE_DRAIN, + PROMEX_SRV_STATE_FULL, PROMEX_SRV_STATE_NOLB, PROMEX_SRV_STATE_COUNT /* must be last */ @@ -390,10 +391,13 @@ const struct ist promex_srv_st[PROMEX_SRV_STATE_COUNT] = { [PROMEX_SRV_STATE_UP]= IST("UP"), [PROMEX_SRV_STATE_MAINT] = IST("MAINT"), [PROMEX_SRV_STATE_DRAIN] = IST("DRAIN"), + [PROMEX_SRV_STATE_FULL] = IST("FULL"), [PROMEX_SRV_STATE_NOLB] = IST("NOLB"), }; /* Return the server status. */ + + enum promex_srv_state promex_srv_status(struct server *sv) { int state = PROMEX_SRV_STATE_DOWN; @@ -402,6 +406,9 @@ enum promex_srv_state promex_srv_status(struct server *sv) state = PROMEX_SRV_STATE_UP; if (sv->cur_admin & SRV_ADMF_DRAIN) state = PROMEX_SRV_STATE_DRAIN; + if (server_is_full(sv)) + state = PROMEX_SRV_STATE_FULL; + } else if (sv->cur_state == SRV_ST_STOPPING) state = PROMEX_SRV_STATE_NOLB; diff --git a/include/haproxy/queue.h b/include/haproxy/queue.h index e77370cdd..2817b6c8b 100644 --- a/include/haproxy/queue.h +++ b/include/haproxy/queue.h @@ -77,9 +77,9 @@ static inline void pendconn_free(struct stream *s) } } -/* Returns 0 if all slots are full on a server, or 1 if there are slots available. */ -static inline int server_has_room(const struct server *s) { - return !s->maxconn || s->cur_sess < srv_dynamic_maxconn(s); +/* Returns 1 if all slots are full on a server, or 0 if there are slots available. */ +static inline int server_is_full(const struct server *s) { + return s->maxconn && s->cur_sess >= srv_dynamic_maxconn(s); } /* returns 0 if nothing has to be done for server regarding queued connections, diff --git a/src/backend.c b/src/backend.c index 2f9da87ab..9b9e63649 100644 --- a/src/backend.c +++ b/src/backend.c @@ -658,7 +658,7 @@ int assign_server(struct stream *s) if (tmpsrv && tmpsrv->proxy == s->be && ((s->sess->flags & SESS_FL_PREFER_LAST) || (!s->be->max_ka_queue || - server_has_room(tmpsrv) || ( + !server_is_full(tmpsrv) || ( tmpsrv->queue.length + 1 < s->be->max_ka_queue))) && srv_currently_usable(tmpsrv)) { list_for_each_entry(conn, &srv_list->conn_list, session_list) { -- 2.25.1
[PATCH] BUG/MINOR: promex: don't count servers in maintenance
The state of servers that were put in maintenance via the runtime API are reported within the "backend_agg_check_status" metric, which lead to inconsistent sums when compared to the "haproxy_server_check_status" metric. Now excluding them from this computation. --- addons/promex/service-prometheus.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index e02e8c0c7..7b6081fab 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) goto next_px; sv = px->srv; while (sv) { + if ((ctx->flags & PROMEX_FL_NO_MAINT_SRV) && (sv->cur_admin & SRV_ADMF_MAINT)) + goto next_sv; srv_check_status = sv->check.status; srv_check_count[srv_check_status] += 1; - sv = sv->next; + next_sv: + sv = sv->next; } for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { if (get_check_status_result(ctx->obj_state) < CHK_RES_FAILED) -- 2.25.1
RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance
> I guess you mean "backend_add_check_status". Because > "backend_agg_server_check_status" is a deprecated metric. It was replaced by > "backend_agg_server_status". Yes, sorry (and I should know, it's me who deprecated this value! 😊)" In this patch both will be up updated, as it's the same code ! I will update the commit message. (we can deprecate it in 2.9 ? do you want a patch ?) > Here we should check for PROMEX_FL_NO_MAINT_SRV promex flag, no ? I mean: Yes, sure. >And I guess we should also check the healthchecks are enabled for the server. >It is not really an issue because call to get_check_status_result() will >exclude neutral and unknown satuses. But there is no reason to count these >servers. What we observed is that the health check of servers that have been put into maintenance is no longer updated, and the status returned is the last known one. I need to double-check, but I believe we even saw L7OK when putting a "UP" server into maintenance. (What I'm sure of is that the majority of servers in maintenance were in L7STS and not in UNK). If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to display an incorrect backend_agg_server_status (and also haproxy_server_check_status) for servers in maintenance for those who don't set (no-maint=empty), and we probably then need another patch so that the status of these servers is HCHK_STATUS_UNKNOWN?" Cédric
RE: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric
Hi, Tanks for you feedbacks. > Indeed, so actually you found a pretty valid use case to the edge case I was > seeing as rare enough not to be of much interest :-) Just to provide some background, which might help in understanding our needs: In our real-life scenarios, servers do not have uniform performance. Our haproxy "servers", in our context, are typically containers. These containers can reside on very diverse hardware, depending on the age of the server. We attempt to mitigate these discrepancies with server weights, but that's not always sufficient; not all servers respond at the same speed. The performance can also be compromised by "noisy" neighbors co-hosted on the same server. Sometimes we even have instances that fail because the underlying code is poorly written. I don't know if it's common with others, but when a backend starts to become overloaded, it happens gradually, and during the initial days, the first symptoms are a few servers becoming FULL during peak hours. > Yes that's it. For this specific use case, watching the queue is much simpler > and provides more than just a boolean Your discussion about queues is very interesting, and I believe we will incorporate those into our metrics. My idea is not to present a boolean, but a percentage of FULL servers by backend. We believe it's more illustrative to see that around 7 a.m., all servers are functioning, but by 7:10 a.m., 10% of the servers begin to be "FULL" due to the increase in traffic. By 7:30 a.m., it's 50%, and by 8 a.m., it's 100%. We think this curve is simpler to grasp than just seeing, at 8 a.m., a queue beginning to grow, while all these servers still seem OK (haproxy_backend_agg_check_status=L7OK, and haproxy_backend_agg_server_status="UP"). Currently, we compute this metric using this Prometheus calculation: count by (proxy, instance) (haproxy_server_current_sessions > 180) (even though we have a maxconn set at 200). This calculation has several issues: - maxconn cannot be retrieved; it has to be hardcoded in Prometheus. - it forces us to use "/metrics?scope=server", which scrapes all server metrics, something we're trying to avoid, because it consumes a lot of resources to aggragate on prometheus. With my patch, we can achieve the same with count by (proxy, instance) (haproxy_backend_agg_server_status="FULL"), just by using /metrics?scope=backend. The main question I'm asking myself is whether my calculation method will work, because a server that starts to saturate will "oscillate" between FULL and OK. We risk missing it, not necessarily capturing the value at the right time. Cedric -Message d'origine- De : Willy Tarreau Envoyé : mercredi 6 septembre 2023 10:36 À : Cedric Paillet Cc : haproxy@formilux.org Objet : Re: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric Hi Cedric, On Tue, Sep 05, 2023 at 01:40:14PM +, Cedric Paillet wrote: > We are using Prometheus to provide feedback to our users about the > status of backend servers. Currently, we have no means of informing > them if a server exceeds the maxconn limit, and consequently why it's > no longer receiving new requests. > > Therefore, we would like to be able to display when a server surpasses > the maxconn limit and is in the "noroom" state. I have prepared a > patch specifically for Prometheus, but it might be better to include a > boolean directly in the server structure indicating whether the server > was considered to have no room the last time server_has_room was > called. However, this change seems to have a significant impact on other > parts of the code. I think that it might be more suitable to use the term "FULL" that we already use for the listeners, and that also matches what's used at a few places in the doc in association with servers or backends being "full". Also, a more accurate metric that is generally watched is the queue (both server and backend): instead of being a boolean, it directly indicates how many additional servers are needed to improve the processing time. Persistent requests are placed into the server's queue but all other requests go into the backend queue. So if you have a total capacity of 4 servers * 100 connections = 400 outstanding requests, and you see a queue of 200, you know that you'd need two extra servers to process these without queuing, and you can even predict that the total processing time will decrease by 200*the average queue time, so this allows to even add just the required number of servers to keep response time below a certain limit. The only case where a server will be full without having any queue is when the total number of outstanding requests on a backend is exactly equal to the sum of t
RE: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric
Thanks for your review. >I think that it might be more suitable to use the term "FULL" Ok, no problem with that. (Perhaps we can also rename server_has_room to !server_is_full ?) > Also, a more accurate metric that is generally watched is the queue (both > server and backend): My first use case is to detect/display one or more server(s) with problems, not to determine if the entire backend is undersized. As I understand it, if just one server in the pool is very slow (and the load balancing method is round-robin), the number of sessions on this server will increase until it reaches maxconn. At this juncture, the server will no longer be selected, and requests will be routed to the other servers. Then, no queue (either backend or server) will start to fill up, correct? But the slow server will cease receiving requests until its session count drops below maxconn, right? The second use case, as you've outlined, is to detect if a backend is undersized. My understanding is that if the backend is "nearly" undersized, the first symptom will be some servers reporting "FULL". Only when ALL servers are reporting "FULL" will the backend queue start to grow, correct? Cédric -Message d'origine- De : Willy Tarreau Envoyé : mercredi 6 septembre 2023 10:36 À : Cedric Paillet Cc : haproxy@formilux.org Objet : Re: [PATCH 0/1] Introduce 'NOROOM' status for srv_state metric Hi Cedric, On Tue, Sep 05, 2023 at 01:40:14PM +, Cedric Paillet wrote: > We are using Prometheus to provide feedback to our users about the > status of backend servers. Currently, we have no means of informing > them if a server exceeds the maxconn limit, and consequently why it's > no longer receiving new requests. > > Therefore, we would like to be able to display when a server surpasses > the maxconn limit and is in the "noroom" state. I have prepared a > patch specifically for Prometheus, but it might be better to include a > boolean directly in the server structure indicating whether the server > was considered to have no room the last time server_has_room was > called. However, this change seems to have a significant impact on other > parts of the code. I think that it might be more suitable to use the term "FULL" that we already use for the listeners, and that also matches what's used at a few places in the doc in association with servers or backends being "full". Also, a more accurate metric that is generally watched is the queue (both server and backend): instead of being a boolean, it directly indicates how many additional servers are needed to improve the processing time. Persistent requests are placed into the server's queue but all other requests go into the backend queue. So if you have a total capacity of 4 servers * 100 connections = 400 outstanding requests, and you see a queue of 200, you know that you'd need two extra servers to process these without queuing, and you can even predict that the total processing time will decrease by 200*the average queue time, so this allows to even add just the required number of servers to keep response time below a certain limit. The only case where a server will be full without having any queue is when the total number of outstanding requests on a backend is exactly equal to the sum of the servers' maxconn values, so as you see, the extra metric covers very little compared to the queue itself. But there might be good use cases for this, I'm not denying it, I just wanted to make sure that you're going to monitor what's really relevant for your use case ;-) Willy
[PATCH 1/1] MINOR: promex: Introduce 'NOROOM' status for srv_state metric
This adds a new 'noroom' status to the Prometheus metric 'srv_state'. It helps identify servers that have exceeded their maxconn limit and cannot accept new connections. --- addons/promex/service-prometheus.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index 7b61683dd..4551009d5 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -380,6 +380,7 @@ enum promex_srv_state { PROMEX_SRV_STATE_UP, PROMEX_SRV_STATE_MAINT, PROMEX_SRV_STATE_DRAIN, + PROMEX_SRV_STATE_NOROOM, PROMEX_SRV_STATE_NOLB, PROMEX_SRV_STATE_COUNT /* must be last */ @@ -390,10 +391,13 @@ const struct ist promex_srv_st[PROMEX_SRV_STATE_COUNT] = { [PROMEX_SRV_STATE_UP]= IST("UP"), [PROMEX_SRV_STATE_MAINT] = IST("MAINT"), [PROMEX_SRV_STATE_DRAIN] = IST("DRAIN"), + [PROMEX_SRV_STATE_NOROOM] = IST("NOROOM"), [PROMEX_SRV_STATE_NOLB] = IST("NOLB"), }; /* Return the server status. */ + + enum promex_srv_state promex_srv_status(struct server *sv) { int state = PROMEX_SRV_STATE_DOWN; @@ -402,6 +406,9 @@ enum promex_srv_state promex_srv_status(struct server *sv) state = PROMEX_SRV_STATE_UP; if (sv->cur_admin & SRV_ADMF_DRAIN) state = PROMEX_SRV_STATE_DRAIN; + if (!server_has_room(sv)) + state = PROMEX_SRV_STATE_NOROOM; + } else if (sv->cur_state == SRV_ST_STOPPING) state = PROMEX_SRV_STATE_NOLB; -- 2.25.1
[PATCH 0/1] Introduce 'NOROOM' status for srv_state metric
We are using Prometheus to provide feedback to our users about the status of backend servers. Currently, we have no means of informing them if a server exceeds the maxconn limit, and consequently why it's no longer receiving new requests. Therefore, we would like to be able to display when a server surpasses the maxconn limit and is in the "noroom" state. I have prepared a patch specifically for Prometheus, but it might be better to include a boolean directly in the server structure indicating whether the server was considered to have no room the last time server_has_room was called. However, this change seems to have a significant impact on other parts of the code. Cedric Paillet (1): MINOR: promex: Introduce 'NOROOM' status for srv_state metric addons/promex/service-prometheus.c | 7 +++ 1 file changed, 7 insertions(+) -- 2.25.1
[PATCH] BUG/MINOR: promex: don't count servers in maintenance
The state of servers that were put in maintenance via the runtime API are reported within the "backend_agg_server_check_status" metric, which lead to inconsistent sums when compared to the "haproxy_server_check_status" metric. Now excluding them from this computation. --- addons/promex/service-prometheus.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index e02e8c0c7..7b61683dd 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -863,9 +863,12 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) goto next_px; sv = px->srv; while (sv) { + if (sv->cur_admin & SRV_ADMF_MAINT) + goto next_sv; srv_check_status = sv->check.status; srv_check_count[srv_check_status] += 1; - sv = sv->next; + next_sv: + sv = sv->next; } for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { if (get_check_status_result(ctx->obj_state) < CHK_RES_FAILED) -- 2.25.1
[PATCH 1/2] [PATCH] BUG/MINOR: promex: create haproxy_backend_agg_server_status
haproxy_backend_agg_server_check_status currently aggregates haproxy_server_status instead of haproxy_server_check_status. We deprecate this and create a new one, haproxy_backend_agg_server_status to clarify what it really does. --- addons/promex/service-prometheus.c | 4 +++- include/haproxy/stats-t.h | 1 + src/stats.c| 6 -- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index b54695c17..9330a860c 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -302,6 +302,7 @@ const struct promex_metric promex_st_metrics[ST_F_TOTAL_FIELDS] = { [ST_F_NEED_CONN_EST]= { .n = IST("need_connections_current"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_SRV_METRIC) }, [ST_F_UWEIGHT] = { .n = IST("uweight"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) }, [ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, + [ST_F_AGG_SRV_STATUS ] = { .n = IST("agg_server_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, }; /* Description of overridden stats fields */ @@ -833,7 +834,8 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) return -1; switch (ctx->field_num) { - case ST_F_AGG_SRV_CHECK_STATUS: + case ST_F_AGG_SRV_CHECK_STATUS: // DEPRECATED + case ST_F_AGG_SRV_STATUS: if (!px->srv) goto next_px; sv = px->srv; diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h index efa0ac383..253fb6c48 100644 --- a/include/haproxy/stats-t.h +++ b/include/haproxy/stats-t.h @@ -455,6 +455,7 @@ enum stat_field { ST_F_USED_CONN_CUR, ST_F_NEED_CONN_EST, ST_F_UWEIGHT, + ST_F_AGG_SRV_STATUS, ST_F_AGG_SRV_CHECK_STATUS, /* must always be the last one */ diff --git a/src/stats.c b/src/stats.c index 8ed5a7133..c15bc27ec 100644 --- a/src/stats.c +++ b/src/stats.c @@ -259,7 +259,8 @@ const struct name_desc stat_fields[ST_F_TOTAL_FIELDS] = { [ST_F_USED_CONN_CUR] = { .name = "used_conn_cur", .desc = "Current number of connections in use"}, [ST_F_NEED_CONN_EST] = { .name = "need_conn_est", .desc = "Estimated needed number of connections"}, [ST_F_UWEIGHT] = { .name = "uweight", .desc = "Server's user weight, or sum of active servers' user weights for a backend" }, - [ST_F_AGG_SRV_CHECK_STATUS] = { .name = "agg_server_check_status", .desc = "Backend's aggregated gauge of servers' state check status" }, + [ST_F_AGG_SRV_CHECK_STATUS] = { .name = "agg_server_check_status", .desc = "[DEPRECATED] Backend's aggregated gauge of servers' status" }, + [ST_F_AGG_SRV_STATUS ] = { .name = "agg_server_status", .desc = "Backend's aggregated gauge of servers' status" }, }; /* one line of info */ @@ -2673,7 +2674,8 @@ int stats_fill_be_stats(struct proxy *px, int flags, struct field *stats, int le chunk_appendf(out, " (%d/%d)", nbup, nbsrv); metric = mkf_str(FO_STATUS, fld); break; - case ST_F_AGG_SRV_CHECK_STATUS: + case ST_F_AGG_SRV_CHECK_STATUS: // DEPRECATED + case ST_F_AGG_SRV_STATUS: metric = mkf_u32(FN_GAUGE, 0); break; case ST_F_WEIGHT: -- 2.25.1
[PATCH 2/2] [PATCH] MINOR: promex: introduce haproxy_backend_agg_check_status
This patch introduces haproxy_backend_agg_check_status metric as we wanted in 42d7c402d but with the right data source. --- addons/promex/service-prometheus.c | 26 ++ include/haproxy/stats-t.h | 1 + src/stats.c| 4 3 files changed, 31 insertions(+) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index 9330a860c..5622b8039 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -303,6 +303,7 @@ const struct promex_metric promex_st_metrics[ST_F_TOTAL_FIELDS] = { [ST_F_UWEIGHT] = { .n = IST("uweight"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) }, [ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, [ST_F_AGG_SRV_STATUS ] = { .n = IST("agg_server_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, + [ST_F_AGG_CHECK_STATUS] = { .n = IST("agg_check_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, }; /* Description of overridden stats fields */ @@ -812,6 +813,7 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) double secs; enum promex_back_state bkd_state; enum promex_srv_state srv_state; + enum healthcheck_status srv_check_status; for (;ctx->field_num < ST_F_TOTAL_FIELDS; ctx->field_num++) { if (!(promex_st_metrics[ctx->field_num].flags & ctx->flags)) @@ -820,6 +822,8 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) while (ctx->px) { struct promex_label labels[PROMEX_MAX_LABELS-1] = {}; unsigned int srv_state_count[PROMEX_SRV_STATE_COUNT] = { 0 }; + unsigned int srv_check_count[HCHK_STATUS_SIZE] = { 0 }; + const char *check_state; px = ctx->px; @@ -854,6 +858,28 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) } ctx->obj_state = 0; goto next_px; + case ST_F_AGG_CHECK_STATUS: + if (!px->srv) + goto next_px; + sv = px->srv; + while (sv) { + srv_check_status = sv->check.status; + srv_check_count[srv_check_status] += 1; + sv = sv->next; + } + for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { + if (get_check_status_result(ctx->obj_state) < CHK_RES_FAILED) + continue; + val = mkf_u32(FO_STATUS, srv_check_count[ctx->obj_state]); + check_state = get_check_status_info(ctx->obj_state); + labels[1].name = ist("state"); + labels[1].value = ist(check_state); + if (!promex_dump_metric(appctx, htx, prefix, &promex_st_metrics[ctx->field_num], + &val, labels, &out, max)) + goto full; + } + ctx->obj_state = 0; + goto next_px; case ST_F_STATUS: bkd_state = ((px->lbprm.tot_weight > 0 || !px->srv) ? 1 : 0); for (; ctx->obj_state < PROMEX_BACK_STATE_COUNT; ctx->obj_state++) { diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h index 253fb6c48..c60bf0487 100644 --- a/include/haproxy/stats-t.h +++ b/include/haproxy/stats-t.h @@ -457,6 +457,7 @@ enum stat_field { ST_F_UWEIGHT, ST_F_AGG_SRV_STATUS, ST_F_AGG_SRV_CHECK_STATUS, + ST_F_AGG_CHECK_STATUS, /* must always be the last one */ ST_F_TOTAL_FIELDS
[PATCH 0/2] BUG/MINOR: promex: create haproxy_backend_agg_check_status
As described in github issue #1312, the first intention of patch 42d7c402d was to aggregate haproxy_server_check_status. But we aggregated haproxy_server_status instead. To fix that: - Deprecated haproxy_backend_agg_server_check_status. (Modify the metric description) - create new haproxy_backend_agg_server_status metric, aggregation of haproxy_backend_server_status. (to replace misnamed haproxy_backend_agg_server_check_status) - create new haproxy_backend_agg_check_status metric, aggregation of haproxy_backend_server_check_status. Cedric Paillet (2): BUG/MINOR: promex: create haproxy_backend_agg_server_status MINOR: promex: introduce haproxy_backend_agg_check_status addons/promex/service-prometheus.c | 30 +- include/haproxy/stats-t.h | 2 ++ src/stats.c| 10 -- 3 files changed, 39 insertions(+), 3 deletions(-) -- 2.25.1
RE: RE: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status
>Sorry, indeed I mentioned the enum ID and not the promex name. >I proposed to keep "haproxy_bakcned_agg_server_check_status" > altering its description to announce it is deprecated >. And then to add "haproxy_backend_agg_check_status" >aggregate haproxy_server_check_status) and >"haproxy_backend_agg_server_status" > or "haproxy_backend_agg_srv_status" (to aggregate haproxy_server_status). To resume, is it ok for you ? - Deprecated haproxy_backend_agg_server_check_status. ( Modify the metric description ) - create new haproxy_backend_agg_server_status metric, aggregation of haproxy_backend_server_status. ( to replace misnamed haproxy_backend_agg_server_check_status ) - create new haproxy_backend_agg_check_status metric, aggregation of haproxy_backend_server_check_status. this will be merged/backported in 2.5/2.6/2.7/2.8-dev and haproxy_backend_agg_server_check_status will be removed in 2.9 ( or 2.10 )
RE: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status
> Sorry for the delay. We were busy because of the 2.7.0 and then I forgot your > patches :) And it was a bit late to add it into the 2.7.0. No problem. > Because the metric name is indeed badly chosen but we can't break the > compatibility, especially in a LTS version. Ok, I understand that. We need to find new names for the metric >IMHO, we should keep the current metric and its description should be updated >to mention it is deprecated. This way, we will be able to remove it in the 2.9 >or 3.0. This means we should find new names for the aggregated servers status >and the aggregated check status. "ST_F_AGG_SRV_STATUS" is pretty good for the >first one. I may propose "ST_F_AGG_CHECK_STATUS" for the second one. Not very clear, the value to change to keep the compatibility is the metric name itself ? do you propose to use?: - haproxy_backend_agg_check_status ( to aggregate haproxy_server_check_status ) - haproxy_backend_agg_status ( to aggregate haproxy_server_status ) Or - haproxy_backend_agg_srv_check_status ( to aggregate haproxy_server_check_status - haproxy_backend_agg_srv_status ( to aggregate haproxy_server_status ) For reminder we currently have - haproxy_backend_agg_server_check_status ( to aggregate haproxy_server_status ) ( and we have also have haproxy_backend_status, for the status of the backend itself ) Cedric Paillet
RE: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status
Hi William, For me we it's not a problem to backport until 2.4. Even if it's a breaking change for haproxy_backend_agg_server_check_status, it's clearer and make more sense if it's aggregate server_check_status and not server_status. Cedric. -Message d'origine- De : William Dauchy Envoyé : mardi 29 novembre 2022 12:39 À : Cedric Paillet Cc : haproxy@formilux.org; Christopher Faulet Objet : Re: [PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status On Tue, Nov 29, 2022 at 11:52 AM William Dauchy wrote: > On Mon, Nov 28, 2022 at 8:46 AM Cedric Paillet wrote: > > As described in github issue #1312, the first intention of patch > > 42d7c402d was to aggregate haproxy_server_check_status. But we > > aggregated haproxy_server_status instead. > > To fix that, rename haproxy_backend_agg_server_check_status to > > haproxy_backend_agg_server_status and use right data source for > > haproxy_backend_agg_server_check_status. > > Thank you for looking into this. This has been on my todo list for too long. > I am fine with those changes as long as we backport them until the > version where this was introduced (this was introduced in v2.5 and > then backported in v2.4 if my checks are correct). > I understand that's a breaking change, some people started to build > tooling on top of that can't deal with behaviour change between > versions (I am thinking about dashboards in my case where we deal with > different versions). In that regard, I believe it can be ok to have to > do a one off change to fix this past mistake. If we are not aligned > with a backport, we will need to find another name. I realised I was not necessarily very clear on my position: - either we merge this patch, and then backport it until 2.4 - either we find a new name for the new metric and don't introduce a breaking change I am mostly worried about metrics collectors who are not able to deal with different behaviour from one version to another. In my own case, we have different haproxy versions running, and there is no global way we can adapt the collector depending on the software version it is collecting. The collector expects the same behaviour across versions. Having a breaking change which affects all versions is fine as it will require a unique change on our side, but having a situation in the middle is not acceptable for us. -- William
[PATCH 1/2] [PATCH] BUG/MINOR: promex: rename haproxy_backend_agg_server_check_status
haproxy_backend_agg_server_check_status currently aggregates haproxy_server_status instead of haproxy_server_check_status. Rename the metric to haproxy_backend_agg_server_status to clarify what it really does. --- addons/promex/service-prometheus.c | 4 ++-- include/haproxy/stats-t.h | 2 +- src/stats.c| 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index d27aefaaa..c24f0895c 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -301,7 +301,7 @@ const struct promex_metric promex_st_metrics[ST_F_TOTAL_FIELDS] = { [ST_F_USED_CONN_CUR]= { .n = IST("used_connections_current"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_SRV_METRIC) }, [ST_F_NEED_CONN_EST]= { .n = IST("need_connections_current"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_SRV_METRIC) }, [ST_F_UWEIGHT] = { .n = IST("uweight"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) }, - [ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, + [ST_F_AGG_SRV_STATUS ] = { .n = IST("agg_server_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, }; /* Description of overridden stats fields */ @@ -833,7 +833,7 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) return -1; switch (ctx->field_num) { - case ST_F_AGG_SRV_CHECK_STATUS: + case ST_F_AGG_SRV_STATUS : if (!px->srv) goto next_px; sv = px->srv; diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h index efa0ac383..babea31e1 100644 --- a/include/haproxy/stats-t.h +++ b/include/haproxy/stats-t.h @@ -455,7 +455,7 @@ enum stat_field { ST_F_USED_CONN_CUR, ST_F_NEED_CONN_EST, ST_F_UWEIGHT, - ST_F_AGG_SRV_CHECK_STATUS, + ST_F_AGG_SRV_STATUS , /* must always be the last one */ ST_F_TOTAL_FIELDS diff --git a/src/stats.c b/src/stats.c index 97e7fd865..dd3044b58 100644 --- a/src/stats.c +++ b/src/stats.c @@ -259,7 +259,7 @@ const struct name_desc stat_fields[ST_F_TOTAL_FIELDS] = { [ST_F_USED_CONN_CUR] = { .name = "used_conn_cur", .desc = "Current number of connections in use"}, [ST_F_NEED_CONN_EST] = { .name = "need_conn_est", .desc = "Estimated needed number of connections"}, [ST_F_UWEIGHT] = { .name = "uweight", .desc = "Server's user weight, or sum of active servers' user weights for a backend" }, - [ST_F_AGG_SRV_CHECK_STATUS] = { .name = "agg_server_check_status", .desc = "Backend's aggregated gauge of servers' state check status" }, + [ST_F_AGG_SRV_STATUS ] = { .name = "agg_server_status", .desc = "Backend's aggregated gauge of servers' status" }, }; /* one line of info */ @@ -2673,7 +2673,7 @@ int stats_fill_be_stats(struct proxy *px, int flags, struct field *stats, int le chunk_appendf(out, " (%d/%d)", nbup, nbsrv); metric = mkf_str(FO_STATUS, fld); break; - case ST_F_AGG_SRV_CHECK_STATUS: + case ST_F_AGG_SRV_STATUS : metric = mkf_u32(FN_GAUGE, 0); break; case ST_F_WEIGHT: -- 2.25.1
[PATCH 2/2] [PATCH] MINOR: promex: reintroduce haproxy_backend_agg_server_check_status
This patch reintroduces haproxy_backend_agg_server_check_status metric as in 42d7c402d but with the right data source. --- addons/promex/service-prometheus.c | 26 ++ include/haproxy/stats-t.h | 1 + src/stats.c| 4 3 files changed, 31 insertions(+) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index c24f0895c..265ba97c4 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -302,6 +302,7 @@ const struct promex_metric promex_st_metrics[ST_F_TOTAL_FIELDS] = { [ST_F_NEED_CONN_EST]= { .n = IST("need_connections_current"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_SRV_METRIC) }, [ST_F_UWEIGHT] = { .n = IST("uweight"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC | PROMEX_FL_SRV_METRIC) }, [ST_F_AGG_SRV_STATUS ] = { .n = IST("agg_server_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, + [ST_F_AGG_SRV_CHECK_STATUS] = { .n = IST("agg_server_check_status"), .type = PROMEX_MT_GAUGE,.flags = ( PROMEX_FL_BACK_METRIC ) }, }; /* Description of overridden stats fields */ @@ -811,6 +812,7 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) double secs; enum promex_back_state bkd_state; enum promex_srv_state srv_state; + enum healthcheck_status srv_check_status; for (;ctx->field_num < ST_F_TOTAL_FIELDS; ctx->field_num++) { if (!(promex_st_metrics[ctx->field_num].flags & ctx->flags)) @@ -819,6 +821,8 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) while (ctx->px) { struct promex_label labels[PROMEX_MAX_LABELS-1] = {}; unsigned int srv_state_count[PROMEX_SRV_STATE_COUNT] = { 0 }; + unsigned int srv_check_count[HCHK_STATUS_SIZE] = { 0 }; + const char *check_state; px = ctx->px; @@ -852,6 +856,28 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) } ctx->obj_state = 0; goto next_px; + case ST_F_AGG_SRV_CHECK_STATUS: + if (!px->srv) + goto next_px; + sv = px->srv; + while (sv) { + srv_check_status = sv->check.status; + srv_check_count[srv_check_status] += 1; + sv = sv->next; + } + for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { + if (get_check_status_result(ctx->obj_state) < CHK_RES_FAILED) + continue; + val = mkf_u32(FO_STATUS, srv_check_count[ctx->obj_state]); + check_state = get_check_status_info(ctx->obj_state); + labels[1].name = ist("state"); + labels[1].value = ist(check_state); + if (!promex_dump_metric(appctx, htx, prefix, &promex_st_metrics[ctx->field_num], + &val, labels, &out, max)) + goto full; + } + ctx->obj_state = 0; + goto next_px; case ST_F_STATUS: bkd_state = ((px->lbprm.tot_weight > 0 || !px->srv) ? 1 : 0); for (; ctx->obj_state < PROMEX_BACK_STATE_COUNT; ctx->obj_state++) { diff --git a/include/haproxy/stats-t.h b/include/haproxy/stats-t.h index babea31e1..66bacd065 100644 --- a/include/haproxy/stats-t.h +++ b/include/haproxy/stats-t.h @@ -456,6 +456,7 @@ enum stat_field { ST_F_NEED_CONN_EST, ST_F_UWEIGHT, ST_F_AGG_SRV_STATUS , + ST_F_AGG_SRV_CHECK_STATUS, /* must always be the last one */ ST_F_TOTAL_FIELDS
[PATCH 0/2] BUG/MINOR: promex: fix haproxy_backend_agg_server_check_status
As described in github issue #1312, the first intention of patch 42d7c402d was to aggregate haproxy_server_check_status. But we aggregated haproxy_server_status instead. To fix that, rename haproxy_backend_agg_server_check_status to haproxy_backend_agg_server_status and use right data source for haproxy_backend_agg_server_check_status. Cedric Paillet (2): BUG/MINOR: promex: rename haproxy_backend_agg_server_check_status MINOR: promex: reintroduce haproxy_backend_agg_server_check_status addons/promex/service-prometheus.c | 28 +++- include/haproxy/stats-t.h | 1 + src/stats.c| 4 3 files changed, 32 insertions(+), 1 deletion(-) -- 2.25.1