Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals
Hi everyone, just want to add my experience below On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote: > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry wrote: > > > > Hi Han, > > > > Please see my comments/questions inline. > > > > Han Zhou, Sep 30, 2023 at 21:59: > > > > Distributed mac learning > > > > > > > > > > > > Use one OVS bridge per logical switch with mac learning enabled. Only > > > > create the bridge if the logical switch has a port bound to the local > > > > chassis. > > > > > > > > Pros: > > > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT > mostly). > > > > - No central mac binding table required. > > > > > > Firstly to clarify the terminology of "mac binding" to avoid confusion, > the > > > mac_binding table currently in SB DB has nothing to do with L2 MAC > > > learning. It is actually the ARP/Neighbor table of distributed logical > > > routers. We should probably call it IP_MAC_binding table, or just > Neighbor > > > table. > > > > Yes sorry about the confusion. I actually meant the FDB table. > > > > > Here what you mean is actually L2 MAC learning, which today is > implemented > > > by the FDB table in SB DB, and it is only for uncommon use cases when > the > > > NB doesn't have the knowledge of a MAC address of a VIF. > > > > This is not that uncommon in telco use cases where VNFs can send packets > > from mac addresses unknown to OVN. > > > Understand, but VNFs contributes a very small portion of the workloads, > right? Maybe I should rephrase that: it is uncommon to have "unknown" > addresses for the majority of ports in a large scale cloud. Is this > understanding correct? I can only share numbers for our usecase with ~650 chassis we have the following distribution of "unknown" in the `addresses` field of Logical_Switch_Port: * 23000 with a mac address + ip and without "unknown" * 250 with a mac address + ip and with "unknown" * 30 with just "unknown" The usecase is a generic public cloud and we do not have any telco related things. > > > > The purpose of this proposal is clear - to avoid using a central table > in > > > DB for L2 information but instead using L2 MAC learning to populate such > > > information on chassis, which is a reasonable alternative with pros and > > > cons. > > > However, I don't think it is necessary to use separate OVS bridges for > this > > > purpose. L2 MAC learning can be easily implemented in the br-int bridge > > > with OVS flows, which is much simpler than managing dynamic number of > OVS > > > bridges just for the purpose of using the builtin OVS mac-learning. > > > > I agree that this could also be implemented with VLAN tags on the > > appropriate ports. But since OVS does not support trunk ports, it may > > require complicated OF pipelines. My intent with this idea was two fold: > > > > 1) Avoid a central point of failure for mac learning/aging. > > 2) Simplify the OF pipeline by making all FDB operations dynamic. > > IMHO, the L2 pipeline is not really complex. It is probably the simplest > part (compared with other features for L3, NAT, ACL, LB, etc.). > Adding dynamic learning to this part probably makes it *a little* more > complex, but should still be straightforward. We don't need any VLAN tag > because the incoming packet has geneve VNI in the metadata. We just need a > flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a > new flow (with related tunnel endpont information) if the src MAC is not > found, with the help of the "learn" action. The entries are > per-logical_switch (VNI). This would serve your purpose of avoiding a > central DB for L2. At least this looks much simpler to me than managing > dynamic number of OVS bridges and the patch pairs between them. > > > > > > Now back to the distributed MAC learning idea itself. Essentially for > two > > > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet > to > > > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will > discuss > > > this later), Chassis1 needs to know that VM2's MAC is located on > Chassis2. > > > > > > In OVN today this information is conveyed through: > > > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis) > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis) > > > > > > In your proposal: > > > > > > - MAC and Chassis mapping (can be learned through initial L2 > > > broadcast/flood) > > > > > > This indeed would avoid the control plane cost through the centralized > > > components (for this L2 binding part). Given that today's SB OVSDB is a > > > bottleneck, this idea may sound attractive. But please also take into > > > consideration the below improvement that could mitigate the OVN central > > > scale issue: > > > > > > - For MAC and LSP mapping, northd is now capable of incrementally > > > processing VIF related L2/L3 changes, so the cost of NB -> northd -> > > > SB is very small. For SB -> Chassis, a more scalable DB deployment, > >
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals
On Sat, Sep 30, 2023 at 3:50 PM Robin Jarry wrote: [ snip ] > > > Also, will such behavior be compatible with HW-offload-capable to > > > smartnics/DPUs? > > > > I am also a bit concerned about this, what would be the typical number > > of bridges supported by hardware? > > As far as I understand, only the datapath flows are offloaded to > hardware. The OF pipeline is only parsed when there is an upcall for the > first packet. Once resolved, the datapath flow is reused. OVS bridges > are only logical constructs, they are neither reflected in the datapath > nor in hardware. True, but you never know what odd bugs might pop out when doing things like this, hence my concern :) > > > >>> Use multicast for overlay networks > > > >>> == > > > > [snip] > > > >>> - 24bit VNI allows for more than 16 million logical switches. No need > > > >>> for extended GENEVE tunnel options. > > > >> Note that using vxlan at the moment significantly reduces the ovn > > > >> featureset. This is because the geneve header options are currently > > > >> used > > > >> for data that would not fit into the vxlan vni. > > > >> > > > >> From ovn-architecture.7.xml: > > > >> ``` > > > >> The maximum number of networks is reduced to 4096. > > > >> The maximum number of ports per network is reduced to 2048. > > > >> ACLs matching against logical ingress port identifiers are not > > > >> supported. > > > >> OVN interconnection feature is not supported. > > > >> ``` > > > > > > > > In my understanding, the main reason why GENEVE replaced VXLAN is > > > > because Openstack uses full mesh point to point tunnels and that the > > > > sender needs to know behind which chassis any mac address is to send it > > > > into the correct tunnel. GENEVE allowed to reduce the lookup time both > > > > on the sender and receiver thanks to ingress/egress port metadata. > > > > > > > > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/ > > > > https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/ > > > > > > > > If VXLAN + multicast and address learning was used, the "correct" tunnel > > > > would be established ad-hoc and both sender and receiver lookups would > > > > only be a simple mac forwarding with learning. The ingress pipeline > > > > would probably cost a little more. > > > > > > > > Maybe multicast + address learning could be implemented for GENEVE as > > > > well. But it would not be interoperable with other VTEPs. > > > > While it is true that it takes time before switch hardware picks up > > support for emerging protocols, I do not think it is a valid argument > > for limiting the development of OVN. Most hardware offload capable > > NICs already have GENEVE support, and if you survey recent or upcoming > > releases from top of rack switch vendors you will also find that they > > have added support for using GENEVE for hardware VTEPs. The fact that > > SDNs with a large customer footprint (such as NSX and OVN) make use of > > GENEVE is most likely a deciding factor for their adoption, and I see > > no reason why we should stop defining the edge of development in this > > space. > > GENEVE could perfectly be suitable with a multicast based control plane > to establish ad-hoc tunnels without any centralized involvement. > > I was only proposing VXLAN since this multicast group system was part of > the original RFC (supported in Linux since 3.12). > > > > >>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in > > > >>> top-of-rack switches. Multicast is only used for BUM traffic. > > > >>> - Only one VXLAN output port per implemented logical switch on a given > > > >>> chassis. > > > >> > > > >> Would this actually work with one VXLAN output port? Would you not need > > > >> one port per target node to send unicast traffic (as you otherwise > > > >> flood > > > >> all packets to all participating nodes)? > > > > > > > > You would need one VXLAN output port per implemented logical switch on > > > > a given chassis. The port would have a VNI (unique per logical switch) > > > > and an associated multicast IP address. Any chassis that implement this > > > > logical switch would subscribe to that multicast group. The flooding > > > > would be limited to first packets and broadcast/multicast traffic (ARP > > > > requests, mostly). Once the receiver node replies, all communication > > > > will happen with unicast. > > > > > > > > https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic > > > > > > > >>> Cons: > > > >>> > > > >>> - OVS does not support VXLAN address learning yet. > > > >>> - The number of usable multicast groups in a fabric network may be > > > >>> limited? > > > >>> - How to manage seamless upgrades and interoperability with older OVN > > > >>> versions? > > > >> - This pushes all logic related to chassis management to the > > > >> underlying networking fabric. It thereby places additional > > > >>
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Han Zhou, Oct 01, 2023 at 21:30: > Please note that tunnels are needed not only between nodes related to same > logical switches, but also when they are related to different logical > switches connected by logical routers (even multiple LR+LS hops away). Yep. > To clarify a little more, openstack deployment can have different logical > topologies. So to evaluate the impact of monitor_all settings there should > be different test cases to capture different types of deployment, e.g. > full-mesh topology (monitor_all=true is better) v.s. "small islands" > toplogy (monitor_all=false is reasonable). This is one thing to note for the recent ovn-heater work that adds openstack test cases. > FDB and MAC_binding tables are used by ovn-controllers. They are > essentially the central storage for MAC tables of the distributed logical > switches (FDB) and ARP/Neighbour tables for distributed logical routers > (MAC_binding). A record can be populate by one chassis and consumed by many > other chassis. > > monitor_all should work the same way for these tables: if monitor_all = > false, only rows related to "local datapaths" should be downloaded to the > chassis. However, for FDB table, the condition is not set for now (which > may have been a miss in the initial implementation). Perhaps this is not > noticed because MAC learning is not a very widely used feature and no scale > impact noticed, but I just proposed a patch to enable the conditional > monitoring: > https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/ Ok thanks! ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss