On 27.03.2026 12:58, Dumitru Ceara wrote: > On 3/26/26 10:35 PM, Rukomoinikova Aleksandra wrote: >> Hi, >> > Hi Aleksandra,
Hi Dumitru! Thank you for your answer! The main reason I didn’t consider conntrack on gw node is that I’m afraid return traffic might go through a different gateway. Like, SYN would go through one gateway, but SYN+ACK from server might come back through another gateway. In that case, next client packet would be dropped at first gateway because conntrack would consider it invalid. Or do you mean using it not as a “real” conntrack in usual sense, but just as a convenient base to extract packet metadata from? In that case, it is indeed convenient for removing inactive backends, but we would need to clean up conntrack entries ourselves after some time, because we can’t rely on the conntrack timers anymore since states won’t be updated properly. Updating conntrack entries from OVN probably also seems crazy > >> I’d like to bring up an idea for discussion regarding implementation of >> stateless load balancing >> support. I would really appreciate your feedback.I’m also open to >> alternative approaches or ideas >> that I may not have considered. Thank you in advance for your time and input! >> > Thanks for researching this, it's very nice work! > >> So, we would like to move towards stateless traffic load balancing. Here is >> how it would differ >> from current approach: >> At the moment, when we have a load balancer on router with DGP ports, >> conntrack state is stored >> directly on the gateway which hosting DGP right now. So, we select backend >> for the first packet >> of a connection, and after that, based on existing conntrack entry, we no >> longer perform backend >> selection. Instead, we rely entirely on stored conntrack record for >> subsequent packets. >> >> One of the key limitations of this solution is that gateway nodes cannot be >> horizontally scaled, >> since conntrack state is stored on a single node. Achieving ability to >> horizontally scale gateway >> nodes is actually one of our goals. >> >> Here are a few possible approaches I see: >> 1) The idea of synchronizing conntrack state already exists in the >> community, but it seems rather >> outdated and not very promising. > Yeah, this on its own has a lot of potential scalability issues so we > never really pursued it I guess. > >> 2) Avoid storing conntrack state on GW node and instead perform stateless >> load balancing for every >> packet in connection, while keeping conntrack state on node where >> virtual machine resides - This >> is approach I am currently leaning toward. > Maybe we _also_ need to store conntrack state on the GW node, I'll > detail below. > >> Now in OVN we have use_stateless_nat option for lb, but it comes with >> several limitations: >> 1) It works by selecting a backend in a stateless manner for every packet, >> while for return traffic >> it performs 1:1 snat. So, for traffic from backend (ip.src && src.port), >> source ip is rewritten to >> lb.vip. This only works correctly if a backend belongs to a single >> lb.vip. Otherwise, there is a >> risk that return traffic will be snated to wrong lb.vip. >> 2) There is also an issue with preserving tcp sessions when number of >> backends changes. Since backend >> selection is done using select(), any change in number of backends can >> break existing tcp sessions. >> This problem is also relevant to solution I proposed above—I will >> elaborate on it below. >> >> More details about my idea (I will attach some code below—this is not >> production-ready, just a >> prototype for testing [1]): >> 1. Packet arrives at gateway with as this one: >> eth.dst == dgp_mac >> eth.src == client_mac >> ip.dst == lb.vip >> ip.src == client_ip >> tcp.dst == lb.port >> tcp.src == client_port >> 2. We detect that packet is addressed to load balancer → perform select over >> backends. >> 3. We route traffic directly to selected backend by changing eth.dst to >> backend’s MAC, >> while keeping `ip.dst == lb.vip`. > So what needs to be persisted is actually the MAC address of the backend > that was selected (through whatever hashing method). > >> 4. We pass packet further down processing pipeline. At this point, it looks >> like: >> eth.dst == backend_mac >> eth.src == (source port of switch where backend is connected) >> ip.dst == lb.vip >> ip.src == client_ip >> tcp.dst == lb.port >> tcp.src == client_port >> 5. packet goes through ingress pipeline of switch where backed port resides >> → then it is sent >> through tunnel. > This is neat indeed because the LS doesn't care about the destination IP > but it cares about the destination MAC (for the L2 lookup stage)! > >> 6. packet arrives at node hosting backend vm, where we perform egress >> load balancing and store conntrack state. >> 7. Server responds, and return traffic performs SNAT at ingress of switch >> pipeline. >> >> This has already been implemented in code for testing and its working. So, >> storing conntrack state on >> node where virtual machine resides - this helps address first limitation of >> use_stateless_nat option. >> >> Regarding second issue: we need to ensure session persistence when number of >> backends changes. >> In general, as I understand it, stateless load balancing in such systems is >> typically based on >> consistent hashing. However, consistent hashing alone is not sufficient, >> since it does not preserve >> 100% of connections. I see solution as some kind of additional layer on top >> of consistent hashing. >> >> First, about consistent hashing in OVS, Correct me if I'm wrong: >> Currently, out of two hashing methods in OVS, only `hash` provides >> consistency, since it is based >> on rendezvous hashing and relies on bucket id. For this to work correctly, >> bucket_ids must be >> preserved when number of backends changes. However, OVN recreates OpenFlow >> group every time number >> of backends changes. At moment, when number of backends changes, we create a >> bundle where first piece >> is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE >> messages. If we could rewrite >> this part to support granular insertion/removal of backends in group — by >> using INSERT_BUCKET/REMOVE >> without recreating group — we could make backend selection consistent for >> hash method. I wasn’t able >> to fully determine whether there are any limitations to this approach, but I >> did manually test session >> preservation using ovs-ofctl while i am inserting/removing buckets. When >> removing buckets from a group, >> sessions associated with other buckets were preserved. >> >> At same time, I understand downsides of using hash: it is expensive, since >> we install dp_flows that match >> full 5-tuple of connection, which leads to a large number of datapath flows. >> Additionally, there is an >> upcall for every SYN packet. Because of this, I’m not sure how feasible it >> is, but it might be worth >> thinking about ways to make dp_hash consistent. I don’t yet have a concrete >> proposal here, but I’d >> really appreciate any ideas or suggestions in this direction. >> >> About additional layer on top of consistent hashing I’ve come up with two >> potential approaches, and I’m not >> yet sure which one would be better: >> 1) Using learn action >> 2) Hash-based sticky sessions in OVS >> > I'd say we can avoid the need for consistent hashing if we add a third > alternative here: > > In step "3" in your idea above, on the GW node, after we selected the > backend (MAC) somehow (e.g., we could still just use dp-hash) we commit > that session to conntrack and store the MAC address as metadata (e.g. in > the ct_label like we do for ecmp-symmetric-reply routes). > > Subsequent packets on the same sessions don't actually need to be hashed > we'll get the MAC to be used as destination from the conntrack state so > any changes to the set of backends won't be problematic. We would have > to flush conntrack for backends that get removed (we do that for regular > load balancers too). > > I know this might sound counter intuitive because your proposal was to > make the load balancing "stateless" on the GW node and I'm actually > suggesting the GW node processing to rely on conntrack (stateful). > > But.. > > You mentioned that one of your goals is: > >> One of the key limitations of this solution is that gateway nodes cannot be >> horizontally scaled, >> since conntrack state is stored on a single node. Achieving ability to >> horizontally scale gateway >> nodes is actually one of our goals. > With my proposal I think you'll achieve that. We'd have to be careful > with the case when a DGP moves to a different chassis (HA failover) but > for that we could combine this solution with using a consistent hash on > all chassis for the backend selection (something you mention you're > looking into anyway). > > Now, if your goal is to avoid conntrack completely on the gateway > chassis, my proposal breaks that. Then we could implement a similar > solution as you suggested with learn action, that might be heavier on > the datapath than using conntrack. I don't have numbers to back this up > but maybe we can find a way to benchmark it. > >> With an additional layer, we can account for hash rebuilds and inaccuracies >> introduced by consistent >> hashing. It’s not entirely clear how to handle these cases, considering that >> after some time we need to >> remove flows when using `learn`, or clean up connections in hash. We also >> need to manage >> connection removal when number of backends changes. > Or when backends go away.. which makes it more complex to manage from > ovn-controller if we use learn flows, I suspect (if we use conntrack we > have the infra to do that already, we use it for regular load balancers > that have ct_flush=true). > >> By using such a two-stage approach, ideally, we would always preserve >> session for a given connection, >> losing connections only in certain cases. For example, suppose we have two >> GW nodes: >> * first SYN packet arrives at first gw — we record this connection in our >> additional layer >> (using learn or OVS hash) and continue routing all packets for this >> connection based on it. >> * If a packet from this same connection arrives at second GW, additional >> layer there >> (learn or OVS hash) will not have session. If number of backends has >> changed since backend was selected >> for this connection on first GW, there is a chance it will not hit same >> backend and will lost session. >> >> The downsides I see for first solution are: >> 1. More complex handling of cleanup for old connections and removal of >> hashes for expired backends. >> >> The downsides of learn action are: >> 1. High load on OVS at a high connection rate. >> 2. If I understand correctly, this also results in a large number of >> dp_flows that track each individual 5-tuple hash. >> >> I would be glad to hear your thoughts on a possible implementation of >> dp_hash consistency, as well >> as feedback on overall architecture. These are all ideas I’ve been able to >> develop so far. I would greatly >> appreciate any criticism, suggestions, or alternative approaches. It would >> also be interesting to know >> whether anyone else is interested in this new optional mode for load >> balancers =) >> >> [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's the >> code, just in case >> > Hope my thoughts above make some sense and that I don't just create > confusion. :) > > Regards, > Dumitru > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
