Hello
Please CC me as I'm not subscribed on the list.
I've been trying to make haproxy send logs to a fluent-bit
in such a way, that when the fluent-bit can't keep up parsing them and
logs start getting dropped, I will get an alert.
The haproxy_process_dropped_logs_total metric
from haproxy's prometheus exporter has been very useful.
However, that counter doesn't work when logging to a TCP socket
through a ring buffer, like:
global
nbthread 2
[...]
log ring@syslog-tcp len 1024 format rfc5424 local0
[...]
ring syslog-tcp
description "For sending logs over TCP to local fluent-bit"
format rfc5424
maxlen 1024
size 256k
server local-fluent-bit 127.0.0.1:5141 log-proto legacy
When I throw 8k req/s at the haproxy,
while tightly limiting fluent-bit's CPU to force it to fall behind,
logs are getting dropped but the counter is not increasing.
(haproxy and fluent-bit run in one kubernetes pod
but separate containers, so they get separate CPU limits)
I think this might be because `__do_send_log` expects
`sent < 0` and `errno == EAGAIN` on dropped logs [1]
but for rings it uses `sink_write`[2] which can [3] return zero
on error, and it does when `ring_write` returns it [4][5][6].
Now, for me this problem happened on haproxy 2.4.17-9f97155
but looks like the return code handling for these functions
didn't change much.
Would a patch to fix this be welcome?
Locally, (I think) I've managed to fix this by applying
a patch like this (except this one is rebased to master):
> diff --git a/src/log.c b/src/log.c
> index a58c6fc3c..e854f3012 100644
> --- a/src/log.c
> +++ b/src/log.c
> @@ -2727,6 +2727,12 @@ static inline void __do_send_log(struct log_target
> *target, struct log_header hd
> e_maxlen -= 1;
>
> sent = sink_write(target->sink, hdr, e_maxlen, &msg, 1);
> + // sink_write can return zero if there's no space in the ring and
> the log was dropped'
> + // we still want to count that
> + if (sent == 0) {
> + sent = -1;
> + errno = EAGAIN;
> + }
> }
> else if (target->addr->ss_family == AF_CUST_EXISTING_FD) {
> struct ist msg;
I'm not sure if this is correct, I haven't tested it on master branch,
and it obviously lacks many things a proper patch needs.
But I figured I'd first ask whether this is a valid bug,
and a valid approach to fixing it.
I can provide more info if needed, but didn't want to make my message
too long.
Regards,
Wojciech Dubiel
[1]:
https://github.com/haproxy/haproxy/blob/7fc52032e3a7c95ee6798703738981c64f1c5c5f/src/log.c#L2770-L2774
[2]:
https://github.com/haproxy/haproxy/blob/7fc52032e3a7c95ee6798703738981c64f1c5c5f/src/log.c#L2729
[3]:
https://github.com/haproxy/haproxy/blob/7fc52032e3a7c95ee6798703738981c64f1c5c5f/include/haproxy/sink.h#L49-L53
[4]:
https://github.com/haproxy/haproxy/blob/7fc52032e3a7c95ee6798703738981c64f1c5c5f/src/ring.c#L193
[5]:
https://github.com/haproxy/haproxy/blob/7fc52032e3a7c95ee6798703738981c64f1c5c5f/src/ring.c#L231-L232
[6]:
https://github.com/haproxy/haproxy/blob/7fc52032e3a7c95ee6798703738981c64f1c5c5f/src/ring.c#L448-L449