Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

Felix Huettner via discuss Fri, 29 Sep 2023 05:36:02 -0700

Hi Robin and everyone else,

On Thu, Sep 28, 2023 at 05:18:19PM +0200, Robin Jarry via discuss wrote:
> Hello OVN community,
>
> I'm glad the subject of this message has caught your attention :-)
>
> I would like to start a discussion about how we could improve OVN on the
> following topics:
>
> * Reduce the memory and CPU footprint of ovn-controller, ovn-northd.


I can get that when running 10k ovn-controllers the benefits of
optimizing cpu and memory load are quite significant. However i am
unsure about reducing the footprint of ovn-northd.
When running so many nodes i would have assumed that having an
additional (or maybe two) dedicated machines for ovn-northd would
be completely acceptable, as long as it can still actually do what
it should in a reasonable timeframe.
Would the goal for ovn-northd be more like "Reduce the full/incremental
recompute time" then?

> * Support scaling of L2 connectivity across larger clusters.
> * Simplify CMS interoperability.
> * Allow support for alternative datapath implementations.

Does this mean ovs datapths (e.g. dpdk) or something different?

>
> This first email will focus on the current issues that (in my view) are
> preventing OVN from scaling L2 networks on larger clusters. I will send
> another message with some change proposals to remove or fix these
> issues.
>
> Disclaimer:
>
> I am fairly new to this project and my perception and understanding may
> be incorrect in some aspects. Please forgive me in advance if I use the
> wrong terms and/or make invalid statements. My intent is only to make
> things better and not to put the blame on anyone for the current design
> choices.

Please apply the same disclaimer to my comments as well.

>
> Southbound Design
> =================
>
> In the current architecture, both databases contain a mix of state and
> configuration. While this does not seem to cause any scaling issues for
> the northbound DB, it can become a bottleneck for the southbound with
> large numbers of chassis and logical network constructs.
>
> The southbound database contains a mix of configuration (logical flows
> transformed from the logical network topology) and state (chassis, port
> bindings, mac bindings, FDB entries, etc.).
>
> The "configuration" part is consumed by ovn-controller to implement the
> network on every chassis and the "state" part is consumed by ovn-northd
> to update the northbound "state" entries and to update logical flows.
> Some CMS's [1] also depend on the southbound "state" in order to
> function properly.

Note that also ovn-controller consumes the "state" of other chassis to
e.g build the tunnels to other chassis. To visualize my understanding

+----------------+---------------+------------+
|                | configuration |   state    |
+----------------+---------------+------------+
|   ovn-northd   |  write-only   | read-only  |
+----------------+---------------+------------+
| ovn-controller |   read-only   | read-write |
+----------------+---------------+------------+
|    some cms    |  no access?   | read-only  |
+----------------+---------------+------------+

>
> [1] 
> https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40
>
> Centralized decisions
> =====================
>
> Every chassis needs to be "aware" of all other chassis in the cluster.

I think we need to accept this as fundamental truth. Indepentent if you
look at centralized designs like ovn or the neutron-l2 implementation
or if you look at decentralized designs like bgp or spanning tree. In
all cases if we need some kind of organized communication we need to
know all relevant peers.
Designs might diverge if you need to be "aware" of all peers or just
some of them, but that is just a tradeoff between data size and options
you have to forward data.

> This requirement mainly comes from overlay networks that are implemented
> over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> limitations). It is not a scaling issue by itself, but it implies
> a centralized decision which in turn puts pressure on the central node
> at scale.

+1. On the other hand it removes signaling needs between the nodes (like
you would have with bgp).

>
> Due to ovsdb monitoring and caching, any change in the southbound DB
> (either by northd or by any of the chassis controllers) is replicated on
> every chassis. The monitor_all option is often enabled on large clusters
> to avoid the conditional monitoring CPU cost on the central node.

This is, i guess, something that should be possible to fix. We have also
enabled this setting as it gave us stability improvements and we do not
yet see performance issues with it

>
> This leads to high memory usage on all chassis, control plane traffic
> and possible disruptions in the ovs-vswitchd datapath flow cache.
> Unfortunately, I don't have any hard data to back this claim. This is
> mainly coming from discussions I had with neutron contributors and from
> brainstorming sessions with colleagues.

Could you maybe elaborate on the datapath flow cache issue, as it sounds
like it might affect actual live traffic and i am not aware of details
there.
The memory usage and the traffic would be fixed by not having to rely on
monitor_all, right?

>
> I hope that the current work on OVN heater to integrate openstack
> support [2] will allow getting more insight.
>
> [2] https://github.com/ovn-org/ovn-heater/pull/179
>
> Dynamic mac learning
> ====================
>
> Logical switch ports on a given chassis are all connected to the same
> OVS bridge, in the same VLAN. This prevents from using local mac address
> learning and shifts the responsibility to a centralized ovn-northd to
> create all the required logical flows to properly segment the network.
>
> When using mac_address=unknown ports, centralized mac learning is
> enabled and when a new address is seen entering a port, OVS sends it to
> the local controller which updates the FDB table and recomputes flow
> rules accordingly. With logical switches spanning across a large number
> of chassis, this centralized mac address learning and aging can have an
> impact on control plane and dataplane performance.

Might one solution for that be to generate such lookup flows without the
help from northd like it is already done for some other tables? Northd
could precreate the flow that is then templated for each entry in
MAC_Bindings relevant to the respective router.

>
> Closing thoughts
> ================
>
> My understanding of L3 and L4 capabilities of OVN are too limited to
> discuss if there are other issues that would prevent scaling to
> thousands of nodes. My point was mainly focused on L2 network scaling.
>
> I would love to get other opinions on these statements.
>
> Cheers!
>
> --
> Robin Jarry
> Red Hat, Telco/NFV
>

Best Regards
Felix Huettner
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here<https://www.datenschutz.schwarz>.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

Reply via email to