Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hi Robin, Thanks a bunch for putting these two emails together. I've read through them and the replies. I think there's one major issue: a lack of data. I think the four bullet points you listed below are admirable goals. The problem is that I think we're putting the cart before the horse with both the issues and proposals. In other words, before being able to properly evaluate these emails, we need to see a scenario that 1) Has clear goals for what scalability metrics are desired. 2) Shows evidence that these scalability goals are not being met. 3) Shows evidence that one or more of the issues listed in this email are the cause for the scalability issues in the scenario. 4) Shows evidence that the proposed changes would fix the scalability issues in the scenario. I listed them in this order because without a failing scenario, we can't claim the scalability is poor. Then if we have a failing scenario, it's possible that the problem and solution is much simpler than any of the issues or proposals that have been brought up here. Then, it's also possible that maybe only a subset of the issues listed in this email are contributing to the failure. Even if the issues identified here are directly causing the scenario to fail, there may still be simpler solutions than what has been proposed. And finally, it's possible that the proposed solutions don't actually result in the expected scale increase. I want to make sure my tone is coming across clearly here. I don't think the current OVN architecture is perfect, and I don't want to be dismissive of the issues you've raised. If there are changes we can make to simplify OVN and scale better at the same time, I'm all for it. The problem is that, as you pointed out in your proposal email, most of these proposals result in difficulties for upgrades/downgrades, as well as code maintenance. Therefore, if we are going to do any of these, we need to first be certain that we aren't scaling as well as we would like, and that there are not simpler paths to reach our scalability targets. On 9/28/23 11:18, Robin Jarry wrote: Hello OVN community, I'm glad the subject of this message has caught your attention :-) I would like to start a discussion about how we could improve OVN on the following topics: * Reduce the memory and CPU footprint of ovn-controller, ovn-northd. * Support scaling of L2 connectivity across larger clusters. * Simplify CMS interoperability. * Allow support for alternative datapath implementations. This first email will focus on the current issues that (in my view) are preventing OVN from scaling L2 networks on larger clusters. I will send another message with some change proposals to remove or fix these issues. Disclaimer: I am fairly new to this project and my perception and understanding may be incorrect in some aspects. Please forgive me in advance if I use the wrong terms and/or make invalid statements. My intent is only to make things better and not to put the blame on anyone for the current design choices. Southbound Design = In the current architecture, both databases contain a mix of state and configuration. While this does not seem to cause any scaling issues for the northbound DB, it can become a bottleneck for the southbound with large numbers of chassis and logical network constructs. The southbound database contains a mix of configuration (logical flows transformed from the logical network topology) and state (chassis, port bindings, mac bindings, FDB entries, etc.). The "configuration" part is consumed by ovn-controller to implement the network on every chassis and the "state" part is consumed by ovn-northd to update the northbound "state" entries and to update logical flows. Some CMS's [1] also depend on the southbound "state" in order to function properly. [1] https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40 Centralized decisions = Every chassis needs to be "aware" of all other chassis in the cluster. This requirement mainly comes from overlay networks that are implemented over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some limitations). It is not a scaling issue by itself, but it implies a centralized decision which in turn puts pressure on the central node at scale. Due to ovsdb monitoring and caching, any change in the southbound DB (either by northd or by any of the chassis controllers) is replicated on every chassis. The monitor_all option is often enabled on large clusters to avoid the conditional monitoring CPU cost on the central node. This leads to high memory usage on all chassis, control plane traffic and possible disruptions in the ovs-vswitchd datapath flow cache. Unfortunately, I don't have any hard data to back this claim. This is mainly coming from discussions I had with neutron contributors and from brainstorming sessions with colleagues. I hope that the current work on OVN
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals
Frode Nordahl, Oct 02, 2023 at 10:05: > > I must admit I don't have enough field experience operating large > > network fabrics to state what issues multicast can cause with these. > > This is why I raised this in the cons list :) > > > > What specific issues did you have in mind? > > It has been a while since I was overseeing the operations of large > metro networks, but I have vivid memories of multicast routing being a > recurring issue. A fabric supporting 10k computes would most likely > not be one large L2, there would be L3 routing involved and as a > consequence your proposal imposes configuration and scale requirements > on the fabric. > > Datapoints that suggest other people see this as an issue too can be > found in the fact that popular top of rack vendors have chosen control > plane based MAC learning for their EVPN implementations (RFC 7432). > There are also multiple papers discussing the scaling issues of > Multicast. Thanks, I will try to educate myself better about this :) ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals
Hi all, Felix Huettner, Oct 02, 2023 at 09:35: > Hi everyone, > > just want to add my experience below > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote: > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry wrote: > > > > > > Hi Han, > > > > > > Please see my comments/questions inline. > > > > > > Han Zhou, Sep 30, 2023 at 21:59: > > > > > Distributed mac learning > > > > > > > > > > > > > > > Use one OVS bridge per logical switch with mac learning > > > > > enabled. Only create the bridge if the logical switch has > > > > > a port bound to the local chassis. > > > > > > > > > > Pros: > > > > > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT > > > > > mostly). > > > > > - No central mac binding table required. > > > > > > > > Firstly to clarify the terminology of "mac binding" to avoid > > > > confusion, the mac_binding table currently in SB DB has nothing > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor > > > > table of distributed logical routers. We should probably call it > > > > IP_MAC_binding table, or just Neighbor table. > > > > > > Yes sorry about the confusion. I actually meant the FDB table. > > > > > > > Here what you mean is actually L2 MAC learning, which today is > > > > implemented by the FDB table in SB DB, and it is only for > > > > uncommon use cases when the NB doesn't have the knowledge of > > > > a MAC address of a VIF. > > > > > > This is not that uncommon in telco use cases where VNFs can send > > > packets from mac addresses unknown to OVN. > > > > > Understand, but VNFs contributes a very small portion of the > > workloads, right? Maybe I should rephrase that: it is uncommon to > > have "unknown" addresses for the majority of ports in a large scale > > cloud. Is this understanding correct? > > I can only share numbers for our usecase with ~650 chassis we have the > following distribution of "unknown" in the `addresses` field of > Logical_Switch_Port: > * 23000 with a mac address + ip and without "unknown" > * 250 with a mac address + ip and with "unknown" > * 30 with just "unknown" > > The usecase is a generic public cloud and we do not have any telco > related things. I don't have any numbers from telco deployments at hand but I will poke around. > > > > The purpose of this proposal is clear - to avoid using a central > > > > table in DB for L2 information but instead using L2 MAC learning > > > > to populate such information on chassis, which is a reasonable > > > > alternative with pros and cons. > > > > However, I don't think it is necessary to use separate OVS > > > > bridges for this purpose. L2 MAC learning can be easily > > > > implemented in the br-int bridge with OVS flows, which is much > > > > simpler than managing dynamic number of OVS bridges just for the > > > > purpose of using the builtin OVS mac-learning. > > > > > > I agree that this could also be implemented with VLAN tags on the > > > appropriate ports. But since OVS does not support trunk ports, it > > > may require complicated OF pipelines. My intent with this idea was > > > two fold: > > > > > > 1) Avoid a central point of failure for mac learning/aging. > > > 2) Simplify the OF pipeline by making all FDB operations dynamic. > > > > IMHO, the L2 pipeline is not really complex. It is probably the > > simplest part (compared with other features for L3, NAT, ACL, LB, > > etc.). Adding dynamic learning to this part probably makes it *a > > little* more complex, but should still be straightforward. We don't > > need any VLAN tag because the incoming packet has geneve VNI in the > > metadata. We just need a flow that resubmits to lookup > > a MAC-tunnelSrc mapping table, and inject a new flow (with related > > tunnel endpont information) if the src MAC is not found, with the > > help of the "learn" action. The entries are per-logical_switch > > (VNI). This would serve your purpose of avoiding a central DB for > > L2. At least this looks much simpler to me than managing dynamic > > number of OVS bridges and the patch pairs between them. Would that work for non GENEVE networks (localnet) when there is no VNI? Does that apply as well? > > > > > > > > > Now back to the distributed MAC learning idea itself. > > > > Essentially for two VMs/pods to communicate on L2, say, > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming > > > > VM1 already has VM2's MAC address (we will discuss this later), > > > > Chassis1 needs to know that VM2's MAC is located on Chassis2. > > > > > > > > In OVN today this information is conveyed through: > > > > > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis) > > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis) > > > > > > > > In your proposal: > > > > > > > > - MAC and Chassis mapping (can be learned through initial L2 > > > > broadcast/flood) > > > > > > > > This indeed would avoid the control plane cost through the > > > > centralized components (for this L2 binding part)