Re: native prometheus exporter: retrieving check_status
Le 15/11/2019 à 15:55, Pierre Cheynier a écrit : * it's great to have new metrics (such as `haproxy_process_current_zlib_memory`), but we also noticed that some very useful ones were not present due to their type, example: [ST_F_CHECK_STATUS] = IST("untyped"), Hi Pierre, Here is a patch to add haproxy_server_check_status (ST_F_CHECK_STATUS) and haproxy_server_check_code (ST_F_CHECK_CODE) metrics for each server. Does this meet your needs ? -- Christopher Faulet >From 8bf04539cb3113769fce9ae1cc87e2b26868f966 Mon Sep 17 00:00:00 2001 From: Christopher Faulet Date: Thu, 21 Nov 2019 14:35:46 +0100 Subject: [PATCH] MINOR: contrib/prometheus-exporter: Add heathcheck status/code in server metrics ST_F_CHECK_STATUS and ST_F_CHECK_CODE are now part of exported server metrics: * haproxy_server_check_status * haproxy_server_check_code The heathcheck status is an integer corresponding to HCHK_STATUS value. --- contrib/prometheus-exporter/README| 29 +++ .../prometheus-exporter/service-prometheus.c | 22 ++ 2 files changed, 45 insertions(+), 6 deletions(-) diff --git a/contrib/prometheus-exporter/README b/contrib/prometheus-exporter/README index b19acc1bd..a9bd5e7a3 100644 --- a/contrib/prometheus-exporter/README +++ b/contrib/prometheus-exporter/README @@ -268,6 +268,8 @@ Exported metrics | haproxy_server_client_aborts_total | Total number of data transfers aborted by the client. | | haproxy_server_server_aborts_total | Total number of data transfers aborted by the server. | | haproxy_server_weight | Service weight. | +| haproxy_server_check_status| Status of last health check, if enabled. (see below for the mapping) | +| haproxy_server_check_code | layer5-7 code, if available of the last health check. | | haproxy_server_check_failures_total| Total number of failed check (Only when the server is up).| | haproxy_server_check_up_down_total | Total number of UP->DOWN transitions. | | haproxy_server_downtime_seconds_total | Total downtime (in seconds) for the service. | @@ -278,3 +280,30 @@ Exported metrics | haproxy_server_idle_connections_current| Current number of idle connections available for reuse. | | haproxy_server_idle_connections_limit | Limit on the number of available idle connections.| ++---+ + +Mapping of health check status : + + 0 : HCHK_STATUS_UNKNOWN (Unknown) + 1 : HCHK_STATUS_INI (Initializing) + + 4 : HCHK_STATUS_HANA (Health analyze detected enough consecutive errors) + + 5 : HCHK_STATUS_SOCKERR (Socket error) + + 6 : HCHK_STATUS_L4OK (L4 check passed, for example tcp connect) + 7 : HCHK_STATUS_L4TOUT (L4 timeout) + 8 : HCHK_STATUS_L4CON(L4 connection problem) + + 9 : HCHK_STATUS_L6OK (L6 check passed) + 10 : HCHK_STATUS_L6TOUT (L6 (SSL) timeout) + 11 : HCHK_STATUS_L6RSP(L6 invalid response - protocol error) + + 12 : HCHK_STATUS_L7TOUT (L7 (HTTP/SMTP) timeout) + 13 : HCHK_STATUS_L7RSP(L7 invalid response - protocol error) + 15 : HCHK_STATUS_L7OKD(L7 check passed) + 16 : HCHK_STATUS_L7OKCD (L7 check conditionally passed) + 17 : HCHK_STATUS_L7STS(L7 response error, for example HTTP 5xx) + + 18 : HCHK_STATUS_PROCERR (External process check failure) + 19 : HCHK_STATUS_PROCTOUT (External process check timeout) + 20 : HCHK_STATUS_PROCOK (External process check passed) diff --git a/contrib/prometheus-exporter/service-prometheus.c b/contrib/prometheus-exporter/service-prometheus.c index 0f178eb64..4cf216a23 100644 --- a/contrib/prometheus-exporter/service-prometheus.c +++ b/contrib/prometheus-exporter/service-prometheus.c @@ -367,7 +367,7 @@ const int promex_srv_metrics[ST_F_TOTAL_FIELDS] = { [ST_F_WRETR] = ST_F_WREDIS, [ST_F_WREDIS] = ST_F_WREW, [ST_F_STATUS] = ST_F_SCUR, - [ST_F_WEIGHT] = ST_F_CHKFAIL, + [ST_F_WEIGHT] = ST_F_CHECK_STATUS, [ST_F_ACT]= 0, [ST_F_BCK]= 0, [ST_F_CHKFAIL]= ST_F_CHKDOWN, @@ -385,8 +385,8 @@ const int promex_srv_metrics[ST_F_TOTAL_FIELDS] = { [ST_F_RATE] = 0, [ST_F_RATE_LIM] = 0, [ST_F_RATE_MAX] = ST_F_LASTSESS, - [ST_F_CHECK_STATUS] = 0, - [ST_F_CHECK_CODE] = 0, + [ST_F_CHECK_STATUS] = ST_F_CHECK_CODE, + [ST_F_CHECK_CODE] = ST_F_CHKFAIL, [ST_F_CHECK_DURATION] = 0, [ST_F_HRSP_1XX] = ST_F_HRSP_2XX, [ST_F_HRSP_2XX] = ST_F_HRSP_3XX, @@ -709,7 +709,7 @@ cons
RE: native prometheus exporter: retrieving check_status
>> My only fear for this point would be to make the code too complicated >> and harder to maintain. >> > > And slow down the exporter execution. Moreover, everyone will have a > different > opinion on how to aggregate the stats. My first idea was to sum all servers > counters. But Pierre's reply shown me that it's not what he expects. I agree it's probably too complex and opinionated. Let see how it goes with servers aggregations only, done on prometheus side, since it's a server-related field initially. If we identify issues/bottlenecks with output size we'll reopen this thread. -- Pierre
RE: native prometheus exporter: retrieving check_status
> Ok, so it is a new kind of metric. I mean, not exposed by HAProxy. It would > require an extra loop on all servers for each backend. It is probably doable > for > the check_status. For the code, I don't know. Because it is not exclusive to > HTTP checks. it is also used for SMTP and LDAP checks. At the end, I think a > better idea would be to have a way to get specifics metrics in each scope and > let Prometheus handling the aggregation. This way, everyone is free to choose > how to proceed while limiting the number of metrics exported. Fair enough, as stated on the other thread with William we'll see how it goes doing it this way. If we have issues related to output size we'll start a new discussion. Thanks! -- Pierre
Re: native prometheus exporter: retrieving check_status
Le 19/11/2019 à 17:12, Pierre Cheynier a écrit : * also for `check_status`, there is the case of L7STS and its associated values that are present in another field. Most probably it could benefit from a better representation in a prometheus output (thanks to labels)? We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have no idea. For now, the labels are static in the exporter. And I don't know if it is pertinent to add dynamic info in labels. If so, what is your idea ? Add a "code" label associated to the check_status metric ? Here again, my maybe-not-so-good idea was to keep the ability to retrieve all the underlying details at backend level, such as: * 100 servers are L7OK * 1 server is L4TOUT * 2 servers are L4CON * 2 servers are L7STS ** 1 due to a HTTP 429 ** 1 due to a HTTP 503 But this is maybe overkill in terms of complexity, we could maybe push more on our ability to retrieve non-maint servers status. Ok, so it is a new kind of metric. I mean, not exposed by HAProxy. It would require an extra loop on all servers for each backend. It is probably doable for the check_status. For the code, I don't know. Because it is not exclusive to HTTP checks. it is also used for SMTP and LDAP checks. At the end, I think a better idea would be to have a way to get specifics metrics in each scope and let Prometheus handling the aggregation. This way, everyone is free to choose how to proceed while limiting the number of metrics exported. -- Christopher Faulet
Re: native prometheus exporter: retrieving check_status
Le 19/11/2019 à 16:48, William Dauchy a écrit : On Tue, Nov 19, 2019 at 03:31:28PM +0100, Christopher Faulet wrote: * also for `check_status`, there is the case of L7STS and its associated values that are present in another field. Most probably it could benefit from a better representation in a prometheus output (thanks to labels)? We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have no idea. For now, the labels are static in the exporter. And I don't know if it is pertinent to add dynamic info in labels. If so, what is your idea ? Add a "code" label associated to the check_status metric ? we need to be very careful here indeed. It's not very clear in my mind how much values we are talking about, but labels trigger the creation of a new metric of each key/value pair. So it can quickly explode your memory on scrapping side. If there is different metric for each label, it is probably not the right way to do. However, I may have wrong, I'm not a Prometheus expert, far from it :) I will probably start by exporting metrics as present in HAProxy using a mapping to represent the check_status. * what about getting some backend-level aggregation of server metrics, such as the one that was previously mentioned, to avoid retrieving all the server metrics but still be able to get some insights? I'm thinking about an aggregation of some fields at backend level, which was not previously done with the CSV output. It is feasible. But only counters may be aggregated. It may be enabled using a parameter in the query-string. However, it is probably pertinent only when the server metrics are filtered out. Because otherwise, Prometheus can handle the aggregation itself. My only fear for this point would be to make the code too complicated and harder to maintain. And slow down the exporter execution. Moreover, everyone will have a different opinion on how to aggregate the stats. My first idea was to sum all servers counters. But Pierre's reply shown me that it's not what he expects. -- Christopher Faulet
RE: native prometheus exporter: retrieving check_status
> Hi Pierre, Hi!, > I addressed this issue based on a William's idea. I also proposed to add a > filter to exclude all servers in maintenance from the export. Let me know if > you > see a better way to do so. For the moment, from the exporter point of view, > it > is not really hard to do such filtering. Yes, that's a great addition, and should improve a lot, but I'm still not sure if it will be sufficient (meaning that we could end up dropping all servers if the endpoint are still too huge, as we used to do with the old exporter). BTW, we also did that since we're using server-templates, and the naming in templates make the server-name information useless (since we can't modify the name at runtime). So we previously had a sufficient level of info at backend level, thanks to native aggregations. >> [ST_F_CHECK_STATUS] = IST("untyped"), >> What could be done to be able to retrieve them? (I thought about something >> similar to >> `HRSP_[1-5]XX`, where the different check status could be defined and >> counted). >> > > Hum, I can add the check status. Mapping all status on integers is possible. > However, having a metric per status is probably not the right solution, > because > it is not a counter but just a state (a boolean). If we do so, all status > would > be set to 0 except the current status. It is not really handy. But a mapping > is > possible. We already do this for the frontend/backend/server status > (ST_F_STATUS). Yes, it would work perfectly. At the end the goal for us would be to be able to retrieve this state. My idea about a counter was more about backend-level aggregations, if consistent (I'm not sure it is actually, hence the feedback request). >> * also for `check_status`, there is the case of L7STS and its associated >> values that are present >> in another field. Most probably it could benefit from a better >> representation in a prometheus >> output (thanks to labels)? >> > We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have > no > idea. For now, the labels are static in the exporter. And I don't know if it > is > pertinent to add dynamic info in labels. If so, what is your idea ? Add a > "code" > label associated to the check_status metric ? Here again, my maybe-not-so-good idea was to keep the ability to retrieve all the underlying details at backend level, such as: * 100 servers are L7OK * 1 server is L4TOUT * 2 servers are L4CON * 2 servers are L7STS ** 1 due to a HTTP 429 ** 1 due to a HTTP 503 But this is maybe overkill in terms of complexity, we could maybe push more on our ability to retrieve non-maint servers status. > It is feasible. But only counters may be aggregated. It may be enabled using > a > parameter in the query-string. However, it is probably pertinent only when > the > server metrics are filtered out. Because otherwise, Prometheus can handle the > aggregation itself. Sure, we should rely on this as much as possible. -- Pierre
Re: native prometheus exporter: retrieving check_status
On Tue, Nov 19, 2019 at 03:31:28PM +0100, Christopher Faulet wrote: > > * also for `check_status`, there is the case of L7STS and its associated > > values that are present in another field. Most probably it could benefit > > from a better representation in a prometheus output (thanks to labels)? > > We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I > have no idea. For now, the labels are static in the exporter. And I don't > know if it is pertinent to add dynamic info in labels. If so, what is your > idea ? Add a "code" label associated to the check_status metric ? we need to be very careful here indeed. It's not very clear in my mind how much values we are talking about, but labels trigger the creation of a new metric of each key/value pair. So it can quickly explode your memory on scrapping side. > > * what about getting some backend-level aggregation of server metrics, such > > as the one that was previously mentioned, to avoid retrieving all the > > server metrics but still be able to get some insights? > > I'm thinking about an aggregation of some fields at backend level, which > > was not previously done with the CSV output. > > It is feasible. But only counters may be aggregated. It may be enabled using > a parameter in the query-string. However, it is probably pertinent only when > the server metrics are filtered out. Because otherwise, Prometheus can > handle the aggregation itself. My only fear for this point would be to make the code too complicated and harder to maintain. -- William
Re: native prometheus exporter: retrieving check_status
Hi Pierre, Sorry I missed you email. Thanks to William for the reminder. Le 15/11/2019 à 15:55, Pierre Cheynier a écrit : We've recently tried to switch to the native prometheus exporter, but went quickly stopped in our initiative given the output on one of our preprod server: $ wc -l metrics.out 1478543 metrics.out $ ls -lh metrics.out -rw-r--r-- 1 pierre pierre 130M nov. 15 15:33 metrics.out This is not only due to a large setup, but essentially related to server lines, since we extensively user server-templates for server addition/deletion at runtime. # backend & servers number $ echo "show stat -1 2 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l 1309 $ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l 36360 # But a lot of them are actually "waiting to be provisioned" (especially on this preprod environment) $ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | grep MAINT | wc -l 34113 We'll filter out the server metrics as a quick fix, and will hopefully submit something to do it natively, but we would also like to get your feedbacks about some use-cases we expected to solve with this native exporter. I addressed this issue based on a William's idea. I also proposed to add a filter to exclude all servers in maintenance from the export. Let me know if you see a better way to do so. For the moment, from the exporter point of view, it is not really hard to do such filtering. Ultimately, one of them would be a great value-added for us: being able to count check_status types (and their values in the L7STS case) per backend. So, there are 3 associated points: * it's great to have new metrics (such as `haproxy_process_current_zlib_memory`), but we also noticed that some very useful ones were not present due to their type, example: [ST_F_CHECK_STATUS] = IST("untyped"), What could be done to be able to retrieve them? (I thought about something similar to `HRSP_[1-5]XX`, where the different check status could be defined and counted). Hum, I can add the check status. Mapping all status on integers is possible. However, having a metric per status is probably not the right solution, because it is not a counter but just a state (a boolean). If we do so, all status would be set to 0 except the current status. It is not really handy. But a mapping is possible. We already do this for the frontend/backend/server status (ST_F_STATUS). * also for `check_status`, there is the case of L7STS and its associated values that are present in another field. Most probably it could benefit from a better representation in a prometheus output (thanks to labels)? We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have no idea. For now, the labels are static in the exporter. And I don't know if it is pertinent to add dynamic info in labels. If so, what is your idea ? Add a "code" label associated to the check_status metric ? * what about getting some backend-level aggregation of server metrics, such as the one that was previously mentioned, to avoid retrieving all the server metrics but still be able to get some insights? I'm thinking about an aggregation of some fields at backend level, which was not previously done with the CSV output. It is feasible. But only counters may be aggregated. It may be enabled using a parameter in the query-string. However, it is probably pertinent only when the server metrics are filtered out. Because otherwise, Prometheus can handle the aggregation itself. -- Christopher Faulet
native prometheus exporter: retrieving check_status
Hi list, We've recently tried to switch to the native prometheus exporter, but went quickly stopped in our initiative given the output on one of our preprod server: $ wc -l metrics.out 1478543 metrics.out $ ls -lh metrics.out -rw-r--r-- 1 pierre pierre 130M nov. 15 15:33 metrics.out This is not only due to a large setup, but essentially related to server lines, since we extensively user server-templates for server addition/deletion at runtime. # backend & servers number $ echo "show stat -1 2 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l 1309 $ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l 36360 # But a lot of them are actually "waiting to be provisioned" (especially on this preprod environment) $ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | grep MAINT | wc -l 34113 We'll filter out the server metrics as a quick fix, and will hopefully submit something to do it natively, but we would also like to get your feedbacks about some use-cases we expected to solve with this native exporter. Ultimately, one of them would be a great value-added for us: being able to count check_status types (and their values in the L7STS case) per backend. So, there are 3 associated points: * it's great to have new metrics (such as `haproxy_process_current_zlib_memory`), but we also noticed that some very useful ones were not present due to their type, example: [ST_F_CHECK_STATUS] = IST("untyped"), What could be done to be able to retrieve them? (I thought about something similar to `HRSP_[1-5]XX`, where the different check status could be defined and counted). * also for `check_status`, there is the case of L7STS and its associated values that are present in another field. Most probably it could benefit from a better representation in a prometheus output (thanks to labels)? * what about getting some backend-level aggregation of server metrics, such as the one that was previously mentioned, to avoid retrieving all the server metrics but still be able to get some insights? I'm thinking about an aggregation of some fields at backend level, which was not previously done with the CSV output. Thanks for your feedbacks, Pierre