Second try for Patch series for New Balancing algorithm (Peak) EWMA

Aleksandar Lazic Mon, 16 Mar 2026 06:19:31 -0700

Hi.

Here another try with much more information and less invasiv.

I'm not sure if the algorithm helps anyone because Peak EWMA was designed formany short RPCs, where round-trip time (RTT) is measured per request in realtime. HAProxy, however, measures RTT per connection at teardown.

Anyhow here the patches, it's up to the Team to decide if the effort is worth toadd another lb algorithm or just ignore this mail :-).


Best Regards
Aleks

From 3b540e0ec4c46a4a10d33de434ab1e0fed9f85e6 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:52:22 +0100
Subject: [PATCH 9/9] DOC: configuration: document the peak-ewma balance
 algorithm

Add an entry for "balance peak-ewma" in the balance directive section,
placed after leastconn as both are connection-based algorithms.

The entry explains the score formula, the peak EWMA update rule, the
latency source selection (d_time for HTTP, t_time for TCP), the
bootstrapping behaviour for new servers, weight handling, and the
known limitation that samples are only collected at connection close.
---
 doc/configuration.txt | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index a1cfd032c1..8003c9f284 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -6174,6 +6174,45 @@ balance url_param <param> [check_post]
                   the established ones in order to minimize queuing. This
                   algorithm is not usable in LOG mode.
 
+      peak-ewma   The server with the lowest score receives the connection,
+                  where the score is computed as the product of the server's
+                  peak EWMA (peak exponentially-weighted moving average)
+                  response time and its current number of inflight connections
+                  plus one:
+
+                    score = rtt_peak_ewma * (inflight + 1) / eweight
+
+                  The peak EWMA estimator reacts to latency spikes
+                  immediately: whenever a completed request observes a higher
+                  latency than the current peak estimate, the estimate is
+                  raised to the new value at once.  It decays slowly
+                  otherwise, using a sliding window of TIME_STATS_SAMPLES
+                  (512) samples, so that a temporary slowdown does not
+                  permanently penalise a server.
+
+                  For HTTP backends, the response time is measured as the
+                  server processing and transfer time (counters.d_time),
+                  excluding queue wait and TCP connection time.  For TCP
+                  backends, the total response time (counters.t_time) is
+                  used because the finer-grained measurement is not
+                  available.
+
+                  Servers with no latency history yet receive score 0 and
+                  are always selected first to bootstrap the estimator.
+                  Server weights are honoured: a server with twice the
+                  weight of another must show twice the score to be
+                  considered equally loaded.
+
+                  This algorithm is well suited for backends where server
+                  response times vary significantly or where individual
+                  servers may experience temporary slowdowns (e.g. garbage
+                  collection pauses, CPU contention).  It is less useful
+                  for long-lived connections or HTTP/2 multiplexed backends
+                  because latency samples are collected only at connection
+                  close.  This algorithm is dynamic, which means that server
+                  weights may be adjusted on the fly.  It is not usable in
+                  LOG mode.
+
       first       The first server with available connection slots receives the
                   connection. The servers are chosen from the lowest numeric
                   identifier to the highest (see server parameter "id"), which
-- 
2.43.0

From 8dab98bb2378ce22e1f3780455ea548dfc64eeb9 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:43:57 +0100
Subject: [PATCH 8/9] MINOR: proxy: wire peak-ewma initialization through the
 FWLC tree

Extend the BE_LB_KIND_CB init block in proxy_finalize() to accept
BE_LB_CB_PEWMA alongside BE_LB_CB_LC.  Both algorithms share the FWLC
ebtree infrastructure; the distinction between leastconn scoring and
peak-ewma scoring is handled entirely inside fwlc_init_server_tree().
---
 src/proxy.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/proxy.c b/src/proxy.c
index 1b6c8e4dbb..6da653e633 100644
--- a/src/proxy.c
+++ b/src/proxy.c
@@ -2632,7 +2632,11 @@ int proxy_finalize(struct proxy *px, int *err_code)
 		break;
 
 	case BE_LB_KIND_CB:
-		if ((px->lbprm.algo & BE_LB_PARM) == BE_LB_CB_LC) {
+		if ((px->lbprm.algo & BE_LB_PARM) == BE_LB_CB_LC ||
+		    (px->lbprm.algo & BE_LB_PARM) == BE_LB_CB_PEWMA) {
+			/* leastconn and peak-ewma both use the FWLC tree;
+			 * peak-ewma adds latency weighting on top.
+			 */
 			px->lbprm.algo |= BE_LB_LKUP_LCTREE | BE_LB_PROP_DYN;
 			fwlc_init_server_tree(px);
 		} else {
-- 
2.43.0

From c888f0be60bac35dad7b496982e7f4f416321a25 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:43:52 +0100
Subject: [PATCH 7/9] MINOR: backend: register peak-ewma as a balance algorithm

Add "peak-ewma" to backend_parse_balance() so the algorithm can be
configured via "balance peak-ewma" in backend and listen sections.

Add the reverse mapping in backend_lb_algo_str() so the algorithm name
is correctly reported in "show info", stats and configuration dumps.

Update the error message in the else-branch to list peak-ewma among the
supported options.
---
 src/backend.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend.c b/src/backend.c
index 46eba155db..c88f15bd93 100644
--- a/src/backend.c
+++ b/src/backend.c
@@ -3095,6 +3095,8 @@ const char *backend_lb_algo_str(int algo) {
 		return "static-rr";
 	else if (algo == BE_LB_ALGO_FAS)
 		return "first";
+	else if (algo == BE_LB_ALGO_PEWMA)
+		return "peak-ewma";
 	else if (algo == BE_LB_ALGO_LC)
 		return "leastconn";
 	else if (algo == BE_LB_ALGO_SH)
@@ -3147,6 +3149,10 @@ int backend_parse_balance(const char **args, char **err, struct proxy *curproxy)
 		curproxy->lbprm.algo &= ~BE_LB_ALGO;
 		curproxy->lbprm.algo |= BE_LB_ALGO_LC;
 	}
+	else if (strcmp(args[0], "peak-ewma") == 0) {
+		curproxy->lbprm.algo &= ~BE_LB_ALGO;
+		curproxy->lbprm.algo |= BE_LB_ALGO_PEWMA;
+	}
 	else if (!strncmp(args[0], "random", 6)) {
 		curproxy->lbprm.algo &= ~BE_LB_ALGO;
 		curproxy->lbprm.algo |= BE_LB_ALGO_RND;
@@ -3327,7 +3333,7 @@ int backend_parse_balance(const char **args, char **err, struct proxy *curproxy)
 		curproxy->lbprm.algo |= BE_LB_ALGO_SS;
 	}
 	else {
-		memprintf(err, "only supports 'roundrobin', 'static-rr', 'leastconn', 'source', 'uri', 'url_param', 'hash', 'hdr(name)', 'rdp-cookie(name)', 'log-hash' and 'sticky' options.");
+		memprintf(err, "only supports 'roundrobin', 'static-rr', 'leastconn', 'peak-ewma', 'source', 'uri', 'url_param', 'hash', 'hdr(name)', 'rdp-cookie(name)', 'log-hash' and 'sticky' options.");
 		return -1;
 	}
 	return 0;
-- 
2.43.0

From 7f990b0524a47c1c531cbcb859e16067fabaacbf Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:56:28 +0100
Subject: [PATCH 6/9] MINOR: lb_fwlc: add drop-conn callback and wire peak-ewma
 into fwlc_init_server_tree()

Add fwlc_srv_drop_conn_pewma() which calls fwlc_update_pewma_rtt() to
update the peak EWMA estimator from the just-completed request's timing,
then repositions the server in the FWLC tree via fwlc_srv_reposition().

Patch fwlc_init_server_tree() to install fwlc_srv_drop_conn_pewma as the
server_drop_conn callback when BE_LB_CB_PEWMA is active.  For leastconn
the existing fwlc_srv_reposition callback is used unchanged.
---
 src/lb_fwlc.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/lb_fwlc.c b/src/lb_fwlc.c
index 2356e4bcd4..e98cdf5bfb 100644
--- a/src/lb_fwlc.c
+++ b/src/lb_fwlc.c
@@ -592,6 +592,15 @@ static void fwlc_update_pewma_rtt(struct server *s)
 	} while (!HA_ATOMIC_CAS(&s->pewma_sum, &old_sum, new_sum) && __ha_cpu_relax());
 }
 
+/* Drop-connection callback for the Peak EWMA variant: update the latency
+ * estimator from the completed request's timing, then reposition the server.
+ */
+static void fwlc_srv_drop_conn_pewma(struct server *s)
+{
+	fwlc_update_pewma_rtt(s);
+	fwlc_srv_reposition(s);
+}
+
 /* This function updates the server trees according to server <srv>'s new
  * state. It should be called when server <srv>'s status changes to down.
  * It is not important whether the server was already down or not. It is not
@@ -784,11 +793,18 @@ void fwlc_init_server_tree(struct proxy *p)
 	p->lbprm.set_server_status_down = fwlc_set_server_status_down;
 	p->lbprm.update_server_eweight  = fwlc_update_server_weight;
 	p->lbprm.server_take_conn = fwlc_srv_reposition;
-	p->lbprm.server_drop_conn = fwlc_srv_reposition;
 	p->lbprm.server_requeue   = fwlc_srv_reposition;
 	p->lbprm.server_deinit    = fwlc_server_deinit;
 	p->lbprm.proxy_deinit     = fwlc_proxy_deinit;
 
+	/* Peak EWMA variant: update the latency estimator on each completed
+	 * request before repositioning.
+	 */
+	if ((p->lbprm.algo & BE_LB_PARM) == BE_LB_CB_PEWMA)
+		p->lbprm.server_drop_conn = fwlc_srv_drop_conn_pewma;
+	else
+		p->lbprm.server_drop_conn = fwlc_srv_reposition;
+
 	p->lbprm.wdiv = BE_WEIGHT_SCALE;
 	for (srv = p->srv; srv; srv = srv->next) {
 		srv->next_eweight = (srv->uweight * p->lbprm.wdiv + p->lbprm.wmult - 1) / p->lbprm.wmult;
-- 
2.43.0

From d0e4b69e3e744ac64cfb1275eb9133dbff95abbc Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:55:38 +0100
Subject: [PATCH 5/9] MINOR: lb_fwlc: add peak-ewma latency estimator
 fwlc_update_pewma_rtt()

Add fwlc_update_pewma_rtt() which applies the peak EWMA update rule to
s->pewma_sum on every completed request:

  - if the new sample v >= current peak average: spike up immediately
    by setting new_sum = v * TIME_STATS_SAMPLES
  - otherwise: decay slowly using the standard sliding-window step
    new_sum = old_sum - ceil(old_sum/N) + v

The update is lock-free via HA_ATOMIC_CAS.

The latency source depends on the proxy mode: counters.d_time (server
processing + transfer, excluding queue and connect overhead) is used for
HTTP backends.  For TCP backends d_time is always zero because stream.c
sets t_data = t_connect before the subtraction, so counters.t_time
(total response time) is used as a fallback.
---
 src/lb_fwlc.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/src/lb_fwlc.c b/src/lb_fwlc.c
index d2f531c0dc..2356e4bcd4 100644
--- a/src/lb_fwlc.c
+++ b/src/lb_fwlc.c
@@ -552,6 +552,46 @@ static void fwlc_srv_reposition(struct server *s)
 	fwlc_check_srv_key(s, new_key);
 }
 
+/* Update the Peak EWMA latency estimator for server <s>.
+ *
+ * Reads the current standard EWMA average from the appropriate latency counter
+ * (already updated by stream_update_time_stats() before this is called) and
+ * applies the peak update rule: spike up immediately if the new value exceeds
+ * the current peak estimate, otherwise decay slowly.
+ *
+ * For HTTP backends, srv->counters.d_time (server processing + transfer time,
+ * excluding queue and connect overhead) is used. For TCP backends, d_time is
+ * always zero (stream.c sets t_data=t_connect before subtraction), so
+ * srv->counters.t_time (total response time) is used instead.
+ *
+ * Only called when the backend uses BE_LB_CB_PEWMA.
+ */
+static void fwlc_update_pewma_rtt(struct server *s)
+{
+	unsigned int rtt_sum, v, old_sum, new_sum;
+
+	if (s->proxy->mode == PR_MODE_HTTP)
+		rtt_sum = _HA_ATOMIC_LOAD(&s->counters.d_time);
+	else
+		rtt_sum = _HA_ATOMIC_LOAD(&s->counters.t_time);
+
+	if (!rtt_sum)
+		return;
+
+	/* v is the current standard EWMA average in milliseconds */
+	v = swrate_avg(rtt_sum, TIME_STATS_SAMPLES);
+
+	do {
+		old_sum = _HA_ATOMIC_LOAD(&s->pewma_sum);
+		if (v >= swrate_avg(old_sum, TIME_STATS_SAMPLES))
+			/* spike up: set peak EWMA average to v immediately */
+			new_sum = v * TIME_STATS_SAMPLES;
+		else
+			/* decay down using standard sliding-window formula */
+			new_sum = old_sum - (old_sum + TIME_STATS_SAMPLES - 1) / TIME_STATS_SAMPLES + v;
+	} while (!HA_ATOMIC_CAS(&s->pewma_sum, &old_sum, new_sum) && __ha_cpu_relax());
+}
+
 /* This function updates the server trees according to server <srv>'s new
  * state. It should be called when server <srv>'s status changes to down.
  * It is not important whether the server was already down or not. It is not
-- 
2.43.0

From 16bd07b4622b7d177479c461d339aec645ae95d7 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:51:50 +0100
Subject: [PATCH 4/9] MINOR: lb_fwlc: extend key computation for peak-ewma
 scoring

fwlc_queue_srv() and fwlc_get_key() now branch on BE_LB_CB_PEWMA to
compute a latency-weighted sort key:

  key = rtt * (inflight + 1) * SRV_EWGHT_MAX / eweight

instead of the plain connection-count formula.  The rtt is the peak EWMA
average recovered from s->pewma_sum via swrate_avg().  Servers with no
latency history yet (pewma_sum == 0) produce key 0 and are always selected
first, bootstrapping the estimator.

For leastconn the existing formula is unchanged.  An eweight zero-guard
is added to both fwlc_queue_srv() and fwlc_get_key() to avoid a
division-by-zero regardless of the active algorithm, even though callers
guarantee eweight > 0 by invariant.
---
 src/lb_fwlc.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/lb_fwlc.c b/src/lb_fwlc.c
index 54f68b7887..d2f531c0dc 100644
--- a/src/lb_fwlc.c
+++ b/src/lb_fwlc.c
@@ -232,7 +232,21 @@ static inline void fwlc_queue_srv(struct server *s, unsigned int eweight)
 	unsigned int list_nb;
 	u32 key;
 
-	key = inflight ? (inflight + 1) * SRV_EWGHT_MAX / eweight : 0;
+	if (!eweight)
+		eweight = 1;
+
+	if ((s->proxy->lbprm.algo & BE_LB_PARM) == BE_LB_CB_PEWMA) {
+		/* Peak EWMA: key = rtt * (inflight+1) * SRV_EWGHT_MAX / eweight.
+		 * Servers with no latency history (pewma_sum==0) get key 0 and
+		 * are always tried first to seed the estimator.
+		 */
+		unsigned int rtt = swrate_avg(_HA_ATOMIC_LOAD(&s->pewma_sum), TIME_STATS_SAMPLES);
+		uint64_t k = (uint64_t)rtt * (inflight + 1) * SRV_EWGHT_MAX / eweight;
+
+		key = (k > UINT32_MAX) ? UINT32_MAX : (u32)k;
+	} else {
+		key = inflight ? (inflight + 1) * SRV_EWGHT_MAX / eweight : 0;
+	}
 	tree_elt = fwlc_get_tree_elt(s, key);
 	if (tree_elt == NULL) {
 		/*
@@ -292,7 +306,17 @@ static inline unsigned int fwlc_get_key(struct server *s)
 
 	inflight = _HA_ATOMIC_LOAD(&s->served) + _HA_ATOMIC_LOAD(&s->queueslength);
 	eweight = _HA_ATOMIC_LOAD(&s->cur_eweight);
-	new_key = inflight ? (inflight + 1) * SRV_EWGHT_MAX / (eweight ? eweight : 1) : 0;
+	if (!eweight)
+		eweight = 1;
+
+	if ((s->proxy->lbprm.algo & BE_LB_PARM) == BE_LB_CB_PEWMA) {
+		unsigned int rtt = swrate_avg(_HA_ATOMIC_LOAD(&s->pewma_sum), TIME_STATS_SAMPLES);
+		uint64_t k = (uint64_t)rtt * (inflight + 1) * SRV_EWGHT_MAX / eweight;
+
+		new_key = (k > UINT32_MAX) ? UINT32_MAX : (unsigned int)k;
+	} else {
+		new_key = inflight ? (inflight + 1) * SRV_EWGHT_MAX / eweight : 0;
+	}
 
 	return new_key;
 }
-- 
2.43.0

From ae43e2f6b9c3818d383ce82606a212dbff3d35d8 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:42:48 +0100
Subject: [PATCH 3/9] MINOR: lb_fwlc: add includes for peak-ewma support

Include <haproxy/defaults.h> for TIME_STATS_SAMPLES and SRV_EWGHT_MAX,
and <haproxy/freq_ctr.h> for swrate_avg(), both needed by the peak-ewma
latency estimator added in subsequent commits.
---
 src/lb_fwlc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/lb_fwlc.c b/src/lb_fwlc.c
index 20a679e3af..54f68b7887 100644
--- a/src/lb_fwlc.c
+++ b/src/lb_fwlc.c
@@ -13,6 +13,8 @@
 #include <import/eb32tree.h>
 #include <haproxy/api.h>
 #include <haproxy/backend.h>
+#include <haproxy/defaults.h>
+#include <haproxy/freq_ctr.h>
 #include <haproxy/queue.h>
 #include <haproxy/server-t.h>
 #include <haproxy/task.h>
-- 
2.43.0

From 1f6b388603177552b86709a96f03ac788bd7fe23 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:42:13 +0100
Subject: [PATCH 2/9] MINOR: server-t: add pewma_sum latency estimator field

Add an unsigned int pewma_sum to struct server to hold the peak EWMA
sliding-window sum of response time in milliseconds.

The field uses the same swrate sliding-window format as the existing
be_counters fields: swrate_avg(pewma_sum, TIME_STATS_SAMPLES) gives the
current peak EWMA average.  A value of zero means no latency history yet;
such servers are always selected first to bootstrap the estimator.
---
 include/haproxy/server-t.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/haproxy/server-t.h b/include/haproxy/server-t.h
index c8f318c5ce..634d0efb27 100644
--- a/include/haproxy/server-t.h
+++ b/include/haproxy/server-t.h
@@ -435,6 +435,7 @@ struct server {
 	int consecutive_errors;			/* current number of consecutive errors */
 	int consecutive_errors_limit;		/* number of consecutive errors that triggers an event */
 	struct be_counters counters;		/* statistics counters */
+	unsigned int pewma_sum;			/* peak EWMA sliding-window sum of response time (ms), used by peak-ewma LB */
 
 	/* Below are some relatively stable settings, only changed under the lock */
 	THREAD_ALIGN();
-- 
2.43.0

From 04d4382a0d6e213bce2d92edfdb0c99697088ac0 Mon Sep 17 00:00:00 2001
From: Aleksandar Lazic <[email protected]>
Date: Mon, 16 Mar 2026 13:42:07 +0100
Subject: [PATCH 1/9] MINOR: backend-t: introduce peak-ewma LB algorithm
 constants

Add BE_LB_CB_PEWMA (0x00000002) as a new connection-based LB parameter
within BE_LB_KIND_CB, and BE_LB_ALGO_PEWMA combining BE_LB_KIND_CB with
the new parameter.

Peak EWMA (peak exponentially-weighted moving average) is a latency-aware
load balancing variant that scores servers by rtt*(inflight+1), reacting
to latency spikes immediately while decaying slowly.  Both leastconn and
peak-ewma share BE_LB_KIND_CB and the FWLC tree infrastructure.
---
 include/haproxy/backend-t.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/haproxy/backend-t.h b/include/haproxy/backend-t.h
index da3c10a5a4..12a445791b 100644
--- a/include/haproxy/backend-t.h
+++ b/include/haproxy/backend-t.h
@@ -58,6 +58,7 @@
 /* BE_LB_CB_* is used with BE_LB_KIND_CB */
 #define BE_LB_CB_LC     0x00000000  /* least-connections */
 #define BE_LB_CB_FAS    0x00000001  /* first available server (opposite of leastconn) */
+#define BE_LB_CB_PEWMA  0x00000002  /* peak EWMA latency + inflight */
 
 /* BE_LB_SA_* is used with BE_LB_KIND_SA */
 #define BE_LB_SA_SS     0x00000000  /* stick to server as long as it is available */
@@ -87,7 +88,8 @@
 #define BE_LB_ALGO_RR   (BE_LB_KIND_RR | BE_LB_NEED_NONE)      /* round robin */
 #define BE_LB_ALGO_RND  (BE_LB_KIND_RR | BE_LB_NEED_NONE | BE_LB_RR_RANDOM) /* random value */
 #define BE_LB_ALGO_LC   (BE_LB_KIND_CB | BE_LB_NEED_NONE | BE_LB_CB_LC)    /* least connections */
-#define BE_LB_ALGO_FAS  (BE_LB_KIND_CB | BE_LB_NEED_NONE | BE_LB_CB_FAS)   /* first available server */
+#define BE_LB_ALGO_FAS   (BE_LB_KIND_CB | BE_LB_NEED_NONE | BE_LB_CB_FAS)   /* first available server */
+#define BE_LB_ALGO_PEWMA (BE_LB_KIND_CB | BE_LB_NEED_NONE | BE_LB_CB_PEWMA) /* peak EWMA */
 #define BE_LB_ALGO_SS   (BE_LB_KIND_SA | BE_LB_NEED_NONE | BE_LB_SA_SS)    /* sticky */
 #define BE_LB_ALGO_SRR  (BE_LB_KIND_RR | BE_LB_NEED_NONE | BE_LB_RR_STATIC) /* static round robin */
 #define BE_LB_ALGO_SH	(BE_LB_KIND_HI | BE_LB_NEED_ADDR | BE_LB_HASH_SRC) /* hash: source IP */
-- 
2.43.0

Second try for Patch series for New Balancing algorithm (Peak) EWMA

Reply via email to