On 5/18/26 6:39 PM, Rukomoinikova Aleksandra wrote:
> Hi!
> This email is an update regarding stateless load balancing, along with a
> couple of questions before I start working on the implementation.
>
> In my previous email
> (https://mail.openvswitch.org/pipermail/ovs-dev/2026-March/431365.html
> <https://mail.openvswitch.org/pipermail/ovs-dev/2026-March/431365.html>),
> I described several approaches for how I see the implementation of
> stateless load balancing in OVN.
>
> To summarize the points from the previous email:
>
> I described the idea of deferring conntrack usage from the gateway node
> to the compute node where the virtual machine is located, as well as the
> main issue with this approach: since backend selection happens in a
> stateless manner and we do not store the connection in conntrack on the
> gateway, some connections may break during backend reconfiguration.
>
> Also, thanks to Dumitru for the idea he suggested.
>
> His approach is to use conntrack on the gateway node and store the MAC
> address of the selected backend in conntrack labels, which solves the
> session persistence problem. However, this makes it mandatory for the
> return traffic to pass through the same gateway that initially handled
> the connection. In general, I think this idea could be made more generic
> — instead of doing any L2 balancing with MAC selection on the gateway
> node, we could fully rely on DNAT in this case. The main requirement
> would still be ensuring that the return traffic goes back through the
> same gateway. This is exactly the part I could not figure out for our
> topology: we have two gateways (in the simplest case, there may be more)
> for incoming traffic, while the backend VMs are located on different
> compute nodes.
>
> Here is the topology diagram for your convenience:
> https://s3.ru-msk.k2.cloud/stateless-lb-topology/stateless-lb-topology.drawio
> <https://s3.ru-msk.k2.cloud/stateless-lb-topology/stateless-lb-topology.drawio>
>
> If I understand correctly, I cannot use ecmp-symmetric-reply in such a
> topology. Such a route has to be attached to a router that has a chassis
> assigned to it, meaning it has to be a centralized router bound to some
> chassis, and our topology does not have such a router.
>
> At this point, based on the analysis from the previous email, I would
> like to start implementing the following approach: using L2 stateless
> balancing on the gateway node and then do dnat relying on conntrack on
> the compute node where the virtual machine is located, while using
> rendezvous hashing in OVS.
>
> Currently, rendezvous hashing works only with the hash selection method,
> which has the downside of requiring an upcall for every SYN packet. I
> made a small hack to reuse the code path currently used for dp_hash, but
> without using Webster distribution that rn used in dp_hash, and instead
> selecting the backend using rendezvous hashing
> (https://github.com/Sashhkaa/ovs/commit/ad82205ed4df125e7072c6b2e480c26e4af297ae
>
> <https://github.com/Sashhkaa/ovs/commit/ad82205ed4df125e7072c6b2e480c26e4af297ae>)
>
> I tested this by incrementally inserting and removing buckets
> (insert-buckets/ remove-buckets commands in ovs-ofctrl), and it behaves
> as expected, similarly to the existing hash method - when I remove
> backend, established connections on other backends are not broken, and
> when I add a backend, only 1/(n + 1) sessions are rebuilt.
>
> Unfortunately, I do not yet have enough understanding now to determine
> whether there are any fundamental limitations with this approach. The
> way I currently see it is the following: I would use dp_hash at the
> datapath level, meaning there would be a single upcall for a group of
> packets going through the load balancer, after which I would get
> recirculation for each individual hash value that was calculated, and
> then in upcall processing of recirc packet i select the backend for the
> recirculation flow using rendezvous hashing how hash work.
>
> The most obvious downside I currently see is that rendezvous hashing is
> inherently more expensive than the current dp_hash approach because it
> calculates the hash for each backend, so I plan to measure the
> performance impact under high traffic load and high backends number
>
> I was also considering possible differences in traffic distribution
> across backends, but if my understanding of the math is correct, both
> algorithms should provide roughly similar distribution properties.
>
> So here my questions:
> Did I miss anything at first glance and how valid would it be to use
> dp_hash in such an implementation? If there are no strict limitations
> requiring this approach, would it make sense to introduce a new
> selection_method, for example smth like consistent-dp-hash, which would
> still calculate packet hash at the datapath level, but would use
> rendezvous hashing instead of the Webster method for backend selection?
FWIW, rendezvous hashing a.k.a. highest random weight and the consistent
hashing are different things, so we should not mix up the terms.
For the implementation, it seems like you're essentially trying to revert
the previous improvement:
2e3fd24c7c44 ("ofproto-dpif: Improve dp_hash selection method for select
groups")
While doing that you need to consider drawbacks of using dp_hash for the
HRW implementation. As stated in the commit above, the distribution
that you get from the HRW that runs on the masked dp_hash is not great
and often quite different from the desired. You want to use masked
dp_hash to reduce the number of datapath flows and upcalls, but you're
loosing accuracy. Without dp_hash, the full length of the hash is used
for the bucket selection and the distribution is much more aligned with
the desired. Webster method is avoiding the problem by pre-computation
that is allowed to use the full size of the hash.
Note that you have to mask the hash during upcall before starting the HRW
computation, the same way as they are masked in the datapath, otherwise
packets going through the datapath and the ones going through the userspace
will hit different buckets.
You could allocate more bits for the mask and have a better distribution,
but it's a trade off with the amount of upcalls and you'll be risking the
datapath flow explosion. Also, if the mask changes, all connections will
be re-distributed.
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev