Hi,

we use haproxy (1.6.9) to balance very long running POST requests
(around 50 seconds) to backend servers. It generally works like a charm,
but the average queue time and average total session time statistic
values are totally screwed up.

The problem is that the average is calculated like this for every request:

sum = sum * 511 / 512 + value

for a fixed value and enough iterations:

sum = value * 511

the problem is that at every iteration sum will first be multiplied by
511 and therefore the maximum value during the calculation is:

value * 511 * 511

A unsigned int can store a maximum value of 4294967296. Divided by
511*511 results in 16448. That means any backend with average times
above 16448ms will be affected by integer overflow and have wrong values.

The attached patch tries to solve this by storing and calculating sum as
unsigned long long instead of a unsigned int. I don't know if the
attached patch will work in every case, but during my limited testing it
worked. I'm also not sure if the implicit conversion from unsigned long
long to unsigned int in swrate_avg is a good idea. It should never be a
problem, because value is an unsigned int and the average can never be
higher then the maximum value. But I can understand if there are coding
conventions against implicit conversions. Generally I don't have a lot
of c experience, so feel free to improve the patch or solve the problem
another way.

Kind regards
Reinhard
From b298b230b541bb33d1cbea92f5472c926e10b7f0 Mon Sep 17 00:00:00 2001
From: "Vicinus, Reinhard" <r.vici...@metaways.de>
Date: Thu, 24 Nov 2016 21:41:17 +0100
Subject: [PATCH] allow higer averages then 16448ms

---
 include/proto/freq_ctr.h | 4 ++--
 include/types/counters.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/proto/freq_ctr.h b/include/proto/freq_ctr.h
index 65388b1..099a58d 100644
--- a/include/proto/freq_ctr.h
+++ b/include/proto/freq_ctr.h
@@ -218,7 +218,7 @@ unsigned int freq_ctr_remain_period(struct freq_ctr_period *ctr, unsigned int pe
 /* Adds sample value <v> to sliding window sum <sum> configured for <n> samples.
  * The sample is returned. Better if <n> is a power of two.
  */
-static inline unsigned int swrate_add(unsigned int *sum, unsigned int n, unsigned int v)
+static inline unsigned long long swrate_add(unsigned long long *sum, unsigned int n, unsigned int v)
 {
 	return *sum = *sum * (n - 1) / n + v;
 }
@@ -227,7 +227,7 @@ static inline unsigned int swrate_add(unsigned int *sum, unsigned int n, unsigne
  * <n> samples. Better if <n> is a power of two. It must be the same <n> as the
  * one used above in all additions.
  */
-static inline unsigned int swrate_avg(unsigned int sum, unsigned int n)
+static inline unsigned int swrate_avg(unsigned long long sum, unsigned int n)
 {
 	return (sum + n - 1) / n;
 }
diff --git a/include/types/counters.h b/include/types/counters.h
index 3e62763..08bcee7 100644
--- a/include/types/counters.h
+++ b/include/types/counters.h
@@ -56,7 +56,7 @@ struct pxcounters {
 	long long redispatches;                 /* retried and redispatched connections (BE only) */
 	long long intercepted_req;              /* number of monitoring or stats requests intercepted by the frontend */
 
-	unsigned int q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
+	unsigned long long q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
 
 	union {
 		struct {
@@ -100,7 +100,7 @@ struct srvcounters {
 	long long retries, redispatches;	/* retried and redispatched connections */
 	long long failed_secu;			/* blocked responses because of security concerns */
 
-	unsigned int q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
+	unsigned long long q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
 
 	union {
 		struct {
-- 
2.10.2

Reply via email to