Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

Robin Jarry via discuss Fri, 29 Sep 2023 07:26:20 -0700

Hi Felix,

Thanks a lot for your message.


Felix Huettner, Sep 29, 2023 at 14:35:
> I can get that when running 10k ovn-controllers the benefits of
> optimizing cpu and memory load are quite significant. However i am
> unsure about reducing the footprint of ovn-northd.
> When running so many nodes i would have assumed that having an
> additional (or maybe two) dedicated machines for ovn-northd would
> be completely acceptable, as long as it can still actually do what
> it should in a reasonable timeframe.
> Would the goal for ovn-northd be more like "Reduce the full/incremental
> recompute time" then?

The main goal of this thread is to get a consensus on the actual issues
that prevent scaling at the moment. We can discuss solutions in the
other thread.

> > * Allow support for alternative datapath implementations.
>
> Does this mean ovs datapths (e.g. dpdk) or something different?

See the other thread.

> > Southbound Design
> > =================
...
> Note that also ovn-controller consumes the "state" of other chassis to
> e.g build the tunnels to other chassis. To visualize my understanding
>
> +----------------+---------------+------------+
> |                | configuration |   state    |
> +----------------+---------------+------------+
> |   ovn-northd   |  write-only   | read-only  |
> +----------------+---------------+------------+
> | ovn-controller |   read-only   | read-write |
> +----------------+---------------+------------+
> |    some cms    |  no access?   | read-only  |
> +----------------+---------------+------------+

I think ovn-controller only consumes the logical flows. The chassis and
port bindings tables are used by northd to updated these logical flows.

> > Centralized decisions
> > =====================
> >
> > Every chassis needs to be "aware" of all other chassis in the cluster.
>
> I think we need to accept this as fundamental truth. Indepentent if you
> look at centralized designs like ovn or the neutron-l2 implementation
> or if you look at decentralized designs like bgp or spanning tree. In
> all cases if we need some kind of organized communication we need to
> know all relevant peers.
> Designs might diverge if you need to be "aware" of all peers or just
> some of them, but that is just a tradeoff between data size and options
> you have to forward data.
>
> > This requirement mainly comes from overlay networks that are implemented
> > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > limitations). It is not a scaling issue by itself, but it implies
> > a centralized decision which in turn puts pressure on the central node
> > at scale.
>
> +1. On the other hand it removes signaling needs between the nodes (like
> you would have with bgp).

Exactly, but was the signaling between the nodes ever an issue?

> > Due to ovsdb monitoring and caching, any change in the southbound DB
> > (either by northd or by any of the chassis controllers) is replicated on
> > every chassis. The monitor_all option is often enabled on large clusters
> > to avoid the conditional monitoring CPU cost on the central node.
>
> This is, i guess, something that should be possible to fix. We have also
> enabled this setting as it gave us stability improvements and we do not
> yet see performance issues with it

So you have enabled monitor_all=true as well? Or did you test at scale
with monitor_all=false.

What I am saying is that without monitor_all=true, the southbound
ovsdb-server needs to do checks to determine what updates to send to
which client. Since the server is single threaded, it becomes an issue
at scale. I know that there were some significant improvements made
recently but it will only push the limit further. I don't have hard data
to prove my point yet unfortunately.

> > This leads to high memory usage on all chassis, control plane traffic
> > and possible disruptions in the ovs-vswitchd datapath flow cache.
> > Unfortunately, I don't have any hard data to back this claim. This is
> > mainly coming from discussions I had with neutron contributors and from
> > brainstorming sessions with colleagues.
>
> Could you maybe elaborate on the datapath flow cache issue, as it sounds
> like it might affect actual live traffic and i am not aware of details
> there.

I may have had a wrong understanding of the mechanisms of OVS here.
I was under the impression that any update of the openflow rules would
invalidate of all datapath flows. It is far more subtle than this [1].
So unless there is an actual change in the packet pipeline, live traffic
should not be affected.

[1] 
https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained

> The memory usage and the traffic would be fixed by not having to rely on
> monitor_all, right?

The memory usage would be reduced but I don't know to which point. One
of the main consumers is the logical flows table which is required
everywhere. Unless there is a way to only sync a portion of this table
depending on the chassis, disabling monitor_all would save syncing the
unneeded tables for ovn-controller: chassis, port bindings, etc.

> > Dynamic mac learning
> > ====================
> >
> > Logical switch ports on a given chassis are all connected to the same
> > OVS bridge, in the same VLAN. This prevents from using local mac address
> > learning and shifts the responsibility to a centralized ovn-northd to
> > create all the required logical flows to properly segment the network.
> >
> > When using mac_address=unknown ports, centralized mac learning is
> > enabled and when a new address is seen entering a port, OVS sends it to
> > the local controller which updates the FDB table and recomputes flow
> > rules accordingly. With logical switches spanning across a large number
> > of chassis, this centralized mac address learning and aging can have an
> > impact on control plane and dataplane performance.
>
> Might one solution for that be to generate such lookup flows without the
> help from northd like it is already done for some other tables? Northd
> could precreate the flow that is then templated for each entry in
> MAC_Bindings relevant to the respective router.

Let's discuss solutions in the other thread.

What do you think?

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

Reply via email to