Hi,

I’d like to bring up an idea for discussion regarding implementation of 
stateless load balancing
support. I would really appreciate your feedback.I’m also open to alternative 
approaches or ideas
that I may not have considered. Thank you in advance for your time and input!

So, we would like to move towards stateless traffic load balancing. Here is how 
it would differ
from current approach:
At the moment, when we have a load balancer on router with DGP ports, conntrack 
state is stored
directly on the gateway which hosting DGP right now. So, we select backend for 
the first packet
of a connection, and after that, based on existing conntrack entry, we no 
longer perform backend
selection. Instead, we rely entirely on stored conntrack record for subsequent 
packets.

One of the key limitations of this solution is that gateway nodes cannot be 
horizontally scaled,
since conntrack state is stored on a single node. Achieving ability to 
horizontally scale gateway
nodes is actually one of our goals.

Here are a few possible approaches I see:
1) The idea of synchronizing conntrack state already exists in the community, 
but it seems rather
   outdated and not very promising.
2) Avoid storing conntrack state on GW node and instead perform stateless load 
balancing for every
   packet in connection, while keeping conntrack state on node where virtual 
machine resides - This
   is approach I am currently leaning toward.

Now in OVN we have use_stateless_nat option for lb, but it comes with several 
limitations:
1) It works by selecting a backend in a stateless manner for every packet, 
while for return traffic
   it performs 1:1 snat. So, for traffic from backend (ip.src && src.port), 
source ip is rewritten to
   lb.vip. This only works correctly if a backend belongs to a single lb.vip. 
Otherwise, there is a
   risk that return traffic will be snated to wrong lb.vip.
2) There is also an issue with preserving tcp sessions when number of backends 
changes. Since backend
   selection is done using select(), any change in number of backends can break 
existing tcp sessions.
   This problem is also relevant to solution I proposed above—I will elaborate 
on it below.

More details about my idea (I will attach some code below—this is not 
production-ready, just a
prototype for testing [1]):
1. Packet arrives at gateway with as this one:
   eth.dst == dgp_mac
   eth.src == client_mac
   ip.dst == lb.vip
   ip.src == client_ip
   tcp.dst == lb.port
   tcp.src == client_port
2. We detect that packet is addressed to load balancer → perform select over 
backends.
3. We route traffic directly to selected backend by changing eth.dst to 
backend’s MAC,
   while keeping `ip.dst == lb.vip`.
4. We pass packet further down processing pipeline. At this point, it looks 
like:
   eth.dst == backend_mac
   eth.src == (source port of switch where backend is connected)
   ip.dst == lb.vip
   ip.src == client_ip
   tcp.dst == lb.port
   tcp.src == client_port
5. packet goes through ingress pipeline of switch where backed port resides → 
then it is sent
   through tunnel.
6. packet arrives at node hosting backend vm, where we perform egress
   load balancing and store conntrack state.
7. Server responds, and return traffic performs SNAT at ingress of switch 
pipeline.

This has already been implemented in code for testing and its working. So, 
storing conntrack state on
node where virtual machine resides - this helps address first limitation of 
use_stateless_nat option.

Regarding second issue: we need to ensure session persistence when number of 
backends changes.
In general, as I understand it, stateless load balancing in such systems is 
typically based on
consistent hashing. However, consistent hashing alone is not sufficient, since 
it does not preserve
100% of connections. I see solution as some kind of additional layer on top of 
consistent hashing.

First, about consistent hashing in OVS, Correct me if I'm wrong:
Currently, out of two hashing methods in OVS, only `hash` provides consistency, 
since it is based
on rendezvous hashing and relies on bucket id. For this to work correctly, 
bucket_ids must be
preserved when number of backends changes. However, OVN recreates OpenFlow 
group every time number
of backends changes. At moment, when number of backends changes, we create a 
bundle where first piece
is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE 
messages. If we could rewrite
this part to support granular insertion/removal of backends in group — by using 
INSERT_BUCKET/REMOVE
without recreating group — we could make backend selection consistent for hash 
method. I wasn’t able
to fully determine whether there are any limitations to this approach, but I 
did manually test session
preservation using ovs-ofctl while i am inserting/removing buckets. When 
removing buckets from a group,
sessions associated with other buckets were preserved.

At same time, I understand downsides of using hash: it is expensive, since we 
install dp_flows that match
full 5-tuple of connection, which leads to a large number of datapath flows. 
Additionally, there is an
upcall for every SYN packet. Because of this, I’m not sure how feasible it is, 
but it might be worth
thinking about ways to make dp_hash consistent. I don’t yet have a concrete 
proposal here, but I’d
really appreciate any ideas or suggestions in this direction.

About additional layer on top of consistent hashing I’ve come up with two 
potential approaches, and I’m not
yet sure which one would be better:
1) Using learn action
2) Hash-based sticky sessions in OVS

With an additional layer, we can account for hash rebuilds and inaccuracies 
introduced by consistent
hashing. It’s not entirely clear how to handle these cases, considering that 
after some time we need to
remove flows when using `learn`, or clean up connections in hash. We also need 
to manage
connection removal when number of backends changes.

By using such a two-stage approach, ideally, we would always preserve session 
for a given connection,
losing connections only in certain cases. For example, suppose we have two GW 
nodes:
* first SYN packet arrives at first gw — we record this connection in our 
additional layer
  (using learn or OVS hash) and continue routing all packets for this 
connection based on it.
* If a packet from this same connection arrives at second GW, additional layer 
there
  (learn or OVS hash) will not have session. If number of backends has changed 
since backend was selected
  for this connection on first GW, there is a chance it will not hit same 
backend and will lost session.

The downsides I see for first solution are:
1. More complex handling of cleanup for old connections and removal of hashes 
for expired backends.

The downsides of learn action are:
1. High load on OVS at a high connection rate.
2. If I understand correctly, this also results in a large number of dp_flows 
that track each individual 5-tuple hash.

I would be glad to hear your thoughts on a possible implementation of dp_hash 
consistency, as well
as feedback on overall architecture. These are all ideas I’ve been able to 
develop so far. I would greatly
appreciate any criticism, suggestions, or alternative approaches. It would also 
be interesting to know
whether anyone else is interested in this new optional mode for load balancers 
=)

[1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's the 
code, just in case

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to