Hi Felix, Thanks a lot for your message.
Felix Huettner, Sep 29, 2023 at 14:35: > I can get that when running 10k ovn-controllers the benefits of > optimizing cpu and memory load are quite significant. However i am > unsure about reducing the footprint of ovn-northd. > When running so many nodes i would have assumed that having an > additional (or maybe two) dedicated machines for ovn-northd would > be completely acceptable, as long as it can still actually do what > it should in a reasonable timeframe. > Would the goal for ovn-northd be more like "Reduce the full/incremental > recompute time" then? The main goal of this thread is to get a consensus on the actual issues that prevent scaling at the moment. We can discuss solutions in the other thread. > > * Allow support for alternative datapath implementations. > > Does this mean ovs datapths (e.g. dpdk) or something different? See the other thread. > > Southbound Design > > ================= ... > Note that also ovn-controller consumes the "state" of other chassis to > e.g build the tunnels to other chassis. To visualize my understanding > > +----------------+---------------+------------+ > | | configuration | state | > +----------------+---------------+------------+ > | ovn-northd | write-only | read-only | > +----------------+---------------+------------+ > | ovn-controller | read-only | read-write | > +----------------+---------------+------------+ > | some cms | no access? | read-only | > +----------------+---------------+------------+ I think ovn-controller only consumes the logical flows. The chassis and port bindings tables are used by northd to updated these logical flows. > > Centralized decisions > > ===================== > > > > Every chassis needs to be "aware" of all other chassis in the cluster. > > I think we need to accept this as fundamental truth. Indepentent if you > look at centralized designs like ovn or the neutron-l2 implementation > or if you look at decentralized designs like bgp or spanning tree. In > all cases if we need some kind of organized communication we need to > know all relevant peers. > Designs might diverge if you need to be "aware" of all peers or just > some of them, but that is just a tradeoff between data size and options > you have to forward data. > > > This requirement mainly comes from overlay networks that are implemented > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some > > limitations). It is not a scaling issue by itself, but it implies > > a centralized decision which in turn puts pressure on the central node > > at scale. > > +1. On the other hand it removes signaling needs between the nodes (like > you would have with bgp). Exactly, but was the signaling between the nodes ever an issue? > > Due to ovsdb monitoring and caching, any change in the southbound DB > > (either by northd or by any of the chassis controllers) is replicated on > > every chassis. The monitor_all option is often enabled on large clusters > > to avoid the conditional monitoring CPU cost on the central node. > > This is, i guess, something that should be possible to fix. We have also > enabled this setting as it gave us stability improvements and we do not > yet see performance issues with it So you have enabled monitor_all=true as well? Or did you test at scale with monitor_all=false. What I am saying is that without monitor_all=true, the southbound ovsdb-server needs to do checks to determine what updates to send to which client. Since the server is single threaded, it becomes an issue at scale. I know that there were some significant improvements made recently but it will only push the limit further. I don't have hard data to prove my point yet unfortunately. > > This leads to high memory usage on all chassis, control plane traffic > > and possible disruptions in the ovs-vswitchd datapath flow cache. > > Unfortunately, I don't have any hard data to back this claim. This is > > mainly coming from discussions I had with neutron contributors and from > > brainstorming sessions with colleagues. > > Could you maybe elaborate on the datapath flow cache issue, as it sounds > like it might affect actual live traffic and i am not aware of details > there. I may have had a wrong understanding of the mechanisms of OVS here. I was under the impression that any update of the openflow rules would invalidate of all datapath flows. It is far more subtle than this [1]. So unless there is an actual change in the packet pipeline, live traffic should not be affected. [1] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained > The memory usage and the traffic would be fixed by not having to rely on > monitor_all, right? The memory usage would be reduced but I don't know to which point. One of the main consumers is the logical flows table which is required everywhere. Unless there is a way to only sync a portion of this table depending on the chassis, disabling monitor_all would save syncing the unneeded tables for ovn-controller: chassis, port bindings, etc. > > Dynamic mac learning > > ==================== > > > > Logical switch ports on a given chassis are all connected to the same > > OVS bridge, in the same VLAN. This prevents from using local mac address > > learning and shifts the responsibility to a centralized ovn-northd to > > create all the required logical flows to properly segment the network. > > > > When using mac_address=unknown ports, centralized mac learning is > > enabled and when a new address is seen entering a port, OVS sends it to > > the local controller which updates the FDB table and recomputes flow > > rules accordingly. With logical switches spanning across a large number > > of chassis, this centralized mac address learning and aging can have an > > impact on control plane and dataplane performance. > > Might one solution for that be to generate such lookup flows without the > help from northd like it is already done for some other tables? Northd > could precreate the flow that is then templated for each entry in > MAC_Bindings relevant to the respective router. Let's discuss solutions in the other thread. What do you think? _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss