Hi, I’d like to bring up an idea for discussion regarding implementation of stateless load balancing support. I would really appreciate your feedback.I’m also open to alternative approaches or ideas that I may not have considered. Thank you in advance for your time and input!
So, we would like to move towards stateless traffic load balancing. Here is how it would differ from current approach: At the moment, when we have a load balancer on router with DGP ports, conntrack state is stored directly on the gateway which hosting DGP right now. So, we select backend for the first packet of a connection, and after that, based on existing conntrack entry, we no longer perform backend selection. Instead, we rely entirely on stored conntrack record for subsequent packets. One of the key limitations of this solution is that gateway nodes cannot be horizontally scaled, since conntrack state is stored on a single node. Achieving ability to horizontally scale gateway nodes is actually one of our goals. Here are a few possible approaches I see: 1) The idea of synchronizing conntrack state already exists in the community, but it seems rather outdated and not very promising. 2) Avoid storing conntrack state on GW node and instead perform stateless load balancing for every packet in connection, while keeping conntrack state on node where virtual machine resides - This is approach I am currently leaning toward. Now in OVN we have use_stateless_nat option for lb, but it comes with several limitations: 1) It works by selecting a backend in a stateless manner for every packet, while for return traffic it performs 1:1 snat. So, for traffic from backend (ip.src && src.port), source ip is rewritten to lb.vip. This only works correctly if a backend belongs to a single lb.vip. Otherwise, there is a risk that return traffic will be snated to wrong lb.vip. 2) There is also an issue with preserving tcp sessions when number of backends changes. Since backend selection is done using select(), any change in number of backends can break existing tcp sessions. This problem is also relevant to solution I proposed above—I will elaborate on it below. More details about my idea (I will attach some code below—this is not production-ready, just a prototype for testing [1]): 1. Packet arrives at gateway with as this one: eth.dst == dgp_mac eth.src == client_mac ip.dst == lb.vip ip.src == client_ip tcp.dst == lb.port tcp.src == client_port 2. We detect that packet is addressed to load balancer → perform select over backends. 3. We route traffic directly to selected backend by changing eth.dst to backend’s MAC, while keeping `ip.dst == lb.vip`. 4. We pass packet further down processing pipeline. At this point, it looks like: eth.dst == backend_mac eth.src == (source port of switch where backend is connected) ip.dst == lb.vip ip.src == client_ip tcp.dst == lb.port tcp.src == client_port 5. packet goes through ingress pipeline of switch where backed port resides → then it is sent through tunnel. 6. packet arrives at node hosting backend vm, where we perform egress load balancing and store conntrack state. 7. Server responds, and return traffic performs SNAT at ingress of switch pipeline. This has already been implemented in code for testing and its working. So, storing conntrack state on node where virtual machine resides - this helps address first limitation of use_stateless_nat option. Regarding second issue: we need to ensure session persistence when number of backends changes. In general, as I understand it, stateless load balancing in such systems is typically based on consistent hashing. However, consistent hashing alone is not sufficient, since it does not preserve 100% of connections. I see solution as some kind of additional layer on top of consistent hashing. First, about consistent hashing in OVS, Correct me if I'm wrong: Currently, out of two hashing methods in OVS, only `hash` provides consistency, since it is based on rendezvous hashing and relies on bucket id. For this to work correctly, bucket_ids must be preserved when number of backends changes. However, OVN recreates OpenFlow group every time number of backends changes. At moment, when number of backends changes, we create a bundle where first piece is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE messages. If we could rewrite this part to support granular insertion/removal of backends in group — by using INSERT_BUCKET/REMOVE without recreating group — we could make backend selection consistent for hash method. I wasn’t able to fully determine whether there are any limitations to this approach, but I did manually test session preservation using ovs-ofctl while i am inserting/removing buckets. When removing buckets from a group, sessions associated with other buckets were preserved. At same time, I understand downsides of using hash: it is expensive, since we install dp_flows that match full 5-tuple of connection, which leads to a large number of datapath flows. Additionally, there is an upcall for every SYN packet. Because of this, I’m not sure how feasible it is, but it might be worth thinking about ways to make dp_hash consistent. I don’t yet have a concrete proposal here, but I’d really appreciate any ideas or suggestions in this direction. About additional layer on top of consistent hashing I’ve come up with two potential approaches, and I’m not yet sure which one would be better: 1) Using learn action 2) Hash-based sticky sessions in OVS With an additional layer, we can account for hash rebuilds and inaccuracies introduced by consistent hashing. It’s not entirely clear how to handle these cases, considering that after some time we need to remove flows when using `learn`, or clean up connections in hash. We also need to manage connection removal when number of backends changes. By using such a two-stage approach, ideally, we would always preserve session for a given connection, losing connections only in certain cases. For example, suppose we have two GW nodes: * first SYN packet arrives at first gw — we record this connection in our additional layer (using learn or OVS hash) and continue routing all packets for this connection based on it. * If a packet from this same connection arrives at second GW, additional layer there (learn or OVS hash) will not have session. If number of backends has changed since backend was selected for this connection on first GW, there is a chance it will not hit same backend and will lost session. The downsides I see for first solution are: 1. More complex handling of cleanup for old connections and removal of hashes for expired backends. The downsides of learn action are: 1. High load on OVS at a high connection rate. 2. If I understand correctly, this also results in a large number of dp_flows that track each individual 5-tuple hash. I would be glad to hear your thoughts on a possible implementation of dp_hash consistency, as well as feedback on overall architecture. These are all ideas I’ve been able to develop so far. I would greatly appreciate any criticism, suggestions, or alternative approaches. It would also be interesting to know whether anyone else is interested in this new optional mode for load balancers =) [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's the code, just in case _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
