Re: native prometheus exporter: retrieving check_status

2019-11-28 Thread Christopher Faulet

Le 15/11/2019 à 15:55, Pierre Cheynier a écrit :

* it's great to have new metrics (such as 
`haproxy_process_current_zlib_memory`), but we also noticed that some very 
useful ones were not present due to their type, example:
[ST_F_CHECK_STATUS]   = IST("untyped"),


Hi Pierre,

Here is a patch to add haproxy_server_check_status (ST_F_CHECK_STATUS) and 
haproxy_server_check_code (ST_F_CHECK_CODE) metrics for each server. Does this 
meet your needs ?


--
Christopher Faulet
>From 8bf04539cb3113769fce9ae1cc87e2b26868f966 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Thu, 21 Nov 2019 14:35:46 +0100
Subject: [PATCH] MINOR: contrib/prometheus-exporter: Add heathcheck
 status/code in server metrics

ST_F_CHECK_STATUS and ST_F_CHECK_CODE are now part of exported server metrics:

  * haproxy_server_check_status
  * haproxy_server_check_code

The heathcheck status is an integer corresponding to HCHK_STATUS value.
---
 contrib/prometheus-exporter/README| 29 +++
 .../prometheus-exporter/service-prometheus.c  | 22 ++
 2 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/contrib/prometheus-exporter/README b/contrib/prometheus-exporter/README
index b19acc1bd..a9bd5e7a3 100644
--- a/contrib/prometheus-exporter/README
+++ b/contrib/prometheus-exporter/README
@@ -268,6 +268,8 @@ Exported metrics
 | haproxy_server_client_aborts_total | Total number of data transfers aborted by the client. |
 | haproxy_server_server_aborts_total | Total number of data transfers aborted by the server. |
 | haproxy_server_weight  | Service weight.   |
+| haproxy_server_check_status| Status of last health check, if enabled. (see below for the mapping)  |
+| haproxy_server_check_code  | layer5-7 code, if available of the last health check. |
 | haproxy_server_check_failures_total| Total number of failed check (Only when the server is up).|
 | haproxy_server_check_up_down_total | Total number of UP->DOWN transitions. |
 | haproxy_server_downtime_seconds_total  | Total downtime (in seconds) for the service.  |
@@ -278,3 +280,30 @@ Exported metrics
 | haproxy_server_idle_connections_current| Current number of idle connections available for reuse.   |
 | haproxy_server_idle_connections_limit  | Limit on the number of available idle connections.|
 ++---+
+
+Mapping of health check status :
+
+   0 : HCHK_STATUS_UNKNOWN  (Unknown)
+   1 : HCHK_STATUS_INI  (Initializing)
+
+   4 : HCHK_STATUS_HANA (Health analyze detected enough consecutive errors)
+
+   5 : HCHK_STATUS_SOCKERR  (Socket error)
+
+   6 : HCHK_STATUS_L4OK (L4 check passed, for example tcp connect)
+   7 : HCHK_STATUS_L4TOUT   (L4 timeout)
+   8 : HCHK_STATUS_L4CON(L4 connection problem)
+
+   9 : HCHK_STATUS_L6OK (L6 check passed)
+  10 : HCHK_STATUS_L6TOUT   (L6 (SSL) timeout)
+  11 : HCHK_STATUS_L6RSP(L6 invalid response - protocol error)
+
+  12 : HCHK_STATUS_L7TOUT   (L7 (HTTP/SMTP) timeout)
+  13 : HCHK_STATUS_L7RSP(L7 invalid response - protocol error)
+  15 : HCHK_STATUS_L7OKD(L7 check passed)
+  16 : HCHK_STATUS_L7OKCD   (L7 check conditionally passed)
+  17 : HCHK_STATUS_L7STS(L7 response error, for example HTTP 5xx)
+
+  18 : HCHK_STATUS_PROCERR  (External process check failure)
+  19 : HCHK_STATUS_PROCTOUT (External process check timeout)
+  20 : HCHK_STATUS_PROCOK   (External process check passed)
diff --git a/contrib/prometheus-exporter/service-prometheus.c b/contrib/prometheus-exporter/service-prometheus.c
index 0f178eb64..4cf216a23 100644
--- a/contrib/prometheus-exporter/service-prometheus.c
+++ b/contrib/prometheus-exporter/service-prometheus.c
@@ -367,7 +367,7 @@ const int promex_srv_metrics[ST_F_TOTAL_FIELDS] = {
 	[ST_F_WRETR]  = ST_F_WREDIS,
 	[ST_F_WREDIS] = ST_F_WREW,
 	[ST_F_STATUS] = ST_F_SCUR,
-	[ST_F_WEIGHT] = ST_F_CHKFAIL,
+	[ST_F_WEIGHT] = ST_F_CHECK_STATUS,
 	[ST_F_ACT]= 0,
 	[ST_F_BCK]= 0,
 	[ST_F_CHKFAIL]= ST_F_CHKDOWN,
@@ -385,8 +385,8 @@ const int promex_srv_metrics[ST_F_TOTAL_FIELDS] = {
 	[ST_F_RATE]   = 0,
 	[ST_F_RATE_LIM]   = 0,
 	[ST_F_RATE_MAX]   = ST_F_LASTSESS,
-	[ST_F_CHECK_STATUS]   = 0,
-	[ST_F_CHECK_CODE] = 0,
+	[ST_F_CHECK_STATUS]   = ST_F_CHECK_CODE,
+	[ST_F_CHECK_CODE] = ST_F_CHKFAIL,
 	[ST_F_CHECK_DURATION] = 0,
 	[ST_F_HRSP_1XX]   = ST_F_HRSP_2XX,
 	[ST_F_HRSP_2XX]   = ST_F_HRSP_3XX,
@@ -709,7 +709,7 @@ 

RE: native prometheus exporter: retrieving check_status

2019-11-20 Thread Pierre Cheynier
>> My only fear for this point would be to make the code too complicated
>> and harder to maintain.
>>
>
> And slow down the exporter execution. Moreover, everyone will have a 
> different 
> opinion on how to aggregate the stats. My first idea was to sum all servers 
> counters. But Pierre's reply shown me that it's not what he expects.

I agree it's probably too complex and opinionated. Let see how it goes with 
servers
aggregations only, done on prometheus side,  since it's a server-related field 
initially.
If we identify issues/bottlenecks with output size we'll reopen this thread.

-- 
Pierre


RE: native prometheus exporter: retrieving check_status

2019-11-20 Thread Pierre Cheynier
> Ok, so it is a new kind of metric. I mean, not exposed by HAProxy. It would 
> require an extra loop on all servers for each backend. It is probably doable 
> for 
> the check_status. For the code, I don't know. Because it is not exclusive to 
> HTTP checks. it is also used for SMTP and LDAP checks. At the end, I think a 
> better idea would be to have a way to get specifics metrics in each scope and 
> let Prometheus handling the aggregation. This way, everyone is free to choose 
> how to proceed while limiting the number of metrics exported.

Fair enough, as stated on the other thread with William we'll see how it goes 
doing
it this way. If we have issues related to output size we'll start a new 
discussion.

Thanks!

-- 
Pierre




Re: native prometheus exporter: retrieving check_status

2019-11-20 Thread Christopher Faulet

Le 19/11/2019 à 17:12, Pierre Cheynier a écrit :

* also for `check_status`, there is the case of L7STS and its associated values 
that are present
in another field. Most probably it could benefit from a better representation 
in a prometheus
output (thanks to labels)?


We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have no
idea. For now, the labels are static in the exporter. And I don't know if it is
pertinent to add dynamic info in labels. If so, what is your idea ? Add a "code"
label associated to the check_status metric ?


Here again, my maybe-not-so-good idea was to keep the ability to retrieve all 
the
underlying details at backend level, such as:
* 100 servers are L7OK
* 1 server is L4TOUT
* 2 servers are L4CON
* 2 servers are L7STS
** 1 due to a HTTP 429
** 1 due to a HTTP 503

But this is maybe overkill in terms of complexity, we could maybe push more on
our ability to retrieve non-maint servers status.



Ok, so it is a new kind of metric. I mean, not exposed by HAProxy. It would 
require an extra loop on all servers for each backend. It is probably doable for 
the check_status. For the code, I don't know. Because it is not exclusive to 
HTTP checks. it is also used for SMTP and LDAP checks. At the end, I think a 
better idea would be to have a way to get specifics metrics in each scope and 
let Prometheus handling the aggregation. This way, everyone is free to choose 
how to proceed while limiting the number of metrics exported.


--
Christopher Faulet



Re: native prometheus exporter: retrieving check_status

2019-11-20 Thread Christopher Faulet

Le 19/11/2019 à 16:48, William Dauchy a écrit :

On Tue, Nov 19, 2019 at 03:31:28PM +0100, Christopher Faulet wrote:

* also for `check_status`, there is the case of L7STS and its associated
values that are present in another field. Most probably it could benefit
from a better representation in a prometheus output (thanks to labels)?


We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I
have no idea. For now, the labels are static in the exporter. And I don't
know if it is pertinent to add dynamic info in labels. If so, what is your
idea ? Add a "code" label associated to the check_status metric ?


we need to be very careful here indeed. It's not very clear in my mind
how much values we are talking about, but labels trigger the creation of
a new metric of each key/value pair. So it can quickly explode your
memory on scrapping side.


If there is different metric for each label, it is probably not the right way to 
do. However, I may have wrong, I'm not a Prometheus expert, far from it :) I 
will probably start by exporting metrics as present in HAProxy using a mapping 
to represent the check_status.





* what about getting some backend-level aggregation of server metrics, such as 
the one that was previously mentioned, to avoid retrieving all the server 
metrics but still be able to get some insights?
I'm thinking about an aggregation of some fields at backend level, which was 
not previously done with the CSV output.


It is feasible. But only counters may be aggregated. It may be enabled using
a parameter in the query-string. However, it is probably pertinent only when
the server metrics are filtered out. Because otherwise, Prometheus can
handle the aggregation itself.


My only fear for this point would be to make the code too complicated
and harder to maintain.


And slow down the exporter execution. Moreover, everyone will have a different 
opinion on how to aggregate the stats. My first idea was to sum all servers 
counters. But Pierre's reply shown me that it's not what he expects.


--
Christopher Faulet



RE: native prometheus exporter: retrieving check_status

2019-11-19 Thread Pierre Cheynier
> Hi Pierre,

Hi!,

> I addressed this issue based on a William's idea. I also proposed to add a 
> filter to exclude all servers in maintenance from the export. Let me know if 
> you 
> see a better way to do so. For the moment, from the exporter point of view, 
> it 
> is not really hard to do such filtering.

Yes, that's a great addition, and should improve a lot, but I'm still not sure 
if it will be
sufficient (meaning that we could end up dropping all servers if the endpoint 
are still
too huge, as we used to do with the old exporter).

BTW, we also did that since we're using server-templates, and the naming in 
templates
make the server-name information useless (since we can't modify the name at 
runtime).
So we previously had a sufficient level of info at backend level, thanks to 
native
aggregations.

>> [ST_F_CHECK_STATUS]   = IST("untyped"),
>> What could be done to be able to retrieve them? (I thought about something 
>> similar to 
>> `HRSP_[1-5]XX`, where the different check status could be defined and 
>> counted).
>> 
>
> Hum, I can add the check status. Mapping all status on integers is possible. 
> However, having a metric per status is probably not the right solution, 
> because 
> it is not a counter but just a state (a boolean). If we do so, all status 
> would 
> be set to 0 except the current status. It is not really handy. But a mapping 
> is 
> possible. We already do this for the frontend/backend/server status 
> (ST_F_STATUS).

Yes, it would work perfectly. At the end the goal for us would be to be able to 
retrieve this
state. My idea about a counter was more about backend-level aggregations, if 
consistent
(I'm not sure it is actually, hence the feedback request).

>> * also for `check_status`, there is the case of L7STS and its associated 
>> values that are present
>> in another field. Most probably it could benefit from a better 
>> representation in a prometheus
>> output (thanks to labels)?
>>
> We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have 
> no 
> idea. For now, the labels are static in the exporter. And I don't know if it 
> is 
> pertinent to add dynamic info in labels. If so, what is your idea ? Add a 
> "code" 
> label associated to the check_status metric ?

Here again, my maybe-not-so-good idea was to keep the ability to retrieve all 
the
underlying details at backend level, such as:
* 100 servers are L7OK
* 1 server is L4TOUT
* 2 servers are L4CON
* 2 servers are L7STS
** 1 due to a HTTP 429
** 1 due to a HTTP 503

But this is maybe overkill in terms of complexity, we could maybe push more on
our ability to retrieve non-maint servers status.

> It is feasible. But only counters may be aggregated. It may be enabled using 
> a 
> parameter in the query-string. However, it is probably pertinent only when 
> the 
> server metrics are filtered out. Because otherwise, Prometheus can handle the 
> aggregation itself.

Sure, we should rely on this as much as possible.

--
Pierre


Re: native prometheus exporter: retrieving check_status

2019-11-19 Thread William Dauchy
On Tue, Nov 19, 2019 at 03:31:28PM +0100, Christopher Faulet wrote:
> > * also for `check_status`, there is the case of L7STS and its associated
> > values that are present in another field. Most probably it could benefit
> > from a better representation in a prometheus output (thanks to labels)?
> 
> We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I
> have no idea. For now, the labels are static in the exporter. And I don't
> know if it is pertinent to add dynamic info in labels. If so, what is your
> idea ? Add a "code" label associated to the check_status metric ?

we need to be very careful here indeed. It's not very clear in my mind
how much values we are talking about, but labels trigger the creation of
a new metric of each key/value pair. So it can quickly explode your
memory on scrapping side.

> > * what about getting some backend-level aggregation of server metrics, such 
> > as the one that was previously mentioned, to avoid retrieving all the 
> > server metrics but still be able to get some insights?
> > I'm thinking about an aggregation of some fields at backend level, which 
> > was not previously done with the CSV output.
> 
> It is feasible. But only counters may be aggregated. It may be enabled using
> a parameter in the query-string. However, it is probably pertinent only when
> the server metrics are filtered out. Because otherwise, Prometheus can
> handle the aggregation itself.

My only fear for this point would be to make the code too complicated
and harder to maintain.

-- 
William



Re: native prometheus exporter: retrieving check_status

2019-11-19 Thread Christopher Faulet

Hi Pierre,

Sorry I missed you email. Thanks to William for the reminder.

Le 15/11/2019 à 15:55, Pierre Cheynier a écrit :

We've recently tried to switch to the native prometheus exporter, but went 
quickly stopped in our initiative given the output on one of our preprod server:

$ wc -l metrics.out
1478543 metrics.out
$ ls -lh metrics.out
-rw-r--r-- 1 pierre pierre 130M nov.  15 15:33 metrics.out

This is not only due to a large setup, but essentially related to server lines, 
since we extensively user server-templates for server addition/deletion at 
runtime.

# backend & servers number
$ echo "show stat -1 2 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l
1309
$ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l
36360
# But a lot of them are actually "waiting to be provisioned" (especially on 
this preprod environment)
$ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | grep 
MAINT | wc -l
34113

We'll filter out the server metrics as a quick fix, and will hopefully submit 
something to do it natively, but we would also like to get your feedbacks about 
some use-cases we expected to solve with this native exporter.



I addressed this issue based on a William's idea. I also proposed to add a 
filter to exclude all servers in maintenance from the export. Let me know if you 
see a better way to do so. For the moment, from the exporter point of view, it 
is not really hard to do such filtering.



Ultimately, one of them would be a great value-added for us: being able to 
count check_status types (and their values in the L7STS case) per backend.

So, there are 3 associated points:
* it's great to have new metrics (such as 
`haproxy_process_current_zlib_memory`), but we also noticed that some very 
useful ones were not present due to their type, example:
[ST_F_CHECK_STATUS]   = IST("untyped"),
What could be done to be able to retrieve them? (I thought about something 
similar to `HRSP_[1-5]XX`, where the different check status could be defined 
and counted).



Hum, I can add the check status. Mapping all status on integers is possible. 
However, having a metric per status is probably not the right solution, because 
it is not a counter but just a state (a boolean). If we do so, all status would 
be set to 0 except the current status. It is not really handy. But a mapping is 
possible. We already do this for the frontend/backend/server status (ST_F_STATUS).


* also for `check_status`, there is the case of L7STS and its associated values that are present in another field. Most probably it could benefit from a better representation in a prometheus output (thanks to labels)? 


We can also export the metrics ST_F_CHECK_CODE. For the use of labels, I have no 
idea. For now, the labels are static in the exporter. And I don't know if it is 
pertinent to add dynamic info in labels. If so, what is your idea ? Add a "code" 
label associated to the check_status metric ?



* what about getting some backend-level aggregation of server metrics, such as 
the one that was previously mentioned, to avoid retrieving all the server 
metrics but still be able to get some insights?
I'm thinking about an aggregation of some fields at backend level, which was 
not previously done with the CSV output.



It is feasible. But only counters may be aggregated. It may be enabled using a 
parameter in the query-string. However, it is probably pertinent only when the 
server metrics are filtered out. Because otherwise, Prometheus can handle the 
aggregation itself.


--
Christopher Faulet



native prometheus exporter: retrieving check_status

2019-11-15 Thread Pierre Cheynier
Hi list,

We've recently tried to switch to the native prometheus exporter, but went 
quickly stopped in our initiative given the output on one of our preprod server:

$ wc -l metrics.out 
1478543 metrics.out
$ ls -lh metrics.out 
-rw-r--r-- 1 pierre pierre 130M nov.  15 15:33 metrics.out

This is not only due to a large setup, but essentially related to server lines, 
since we extensively user server-templates for server addition/deletion at 
runtime.

# backend & servers number
$ echo "show stat -1 2 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l
1309
$ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | wc -l
36360
# But a lot of them are actually "waiting to be provisioned" (especially on 
this preprod environment)
$ echo "show stat -1 4 -1" | sudo socat stdio /var/lib/haproxy/stats | grep 
MAINT | wc -l
34113

We'll filter out the server metrics as a quick fix, and will hopefully submit 
something to do it natively, but we would also like to get your feedbacks about 
some use-cases we expected to solve with this native exporter.

Ultimately, one of them would be a great value-added for us: being able to 
count check_status types (and their values in the L7STS case) per backend.

So, there are 3 associated points:
* it's great to have new metrics (such as 
`haproxy_process_current_zlib_memory`), but we also noticed that some very 
useful ones were not present due to their type, example:
[ST_F_CHECK_STATUS]   = IST("untyped"),
What could be done to be able to retrieve them? (I thought about something 
similar to `HRSP_[1-5]XX`, where the different check status could be defined 
and counted).

* also for `check_status`, there is the case of L7STS and its associated values 
that are present in another field. Most probably it could benefit from a better 
representation in a prometheus output (thanks to labels)?

* what about getting some backend-level aggregation of server metrics, such as 
the one that was previously mentioned, to avoid retrieving all the server 
metrics but still be able to get some insights?
I'm thinking about an aggregation of some fields at backend level, which was 
not previously done with the CSV output.

Thanks for your feedbacks,

Pierre