Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-04 Thread Robin Jarry via discuss

Hi Mark,

Mark Michelson, Oct 03, 2023 at 23:09:

Hi Robin,

Thanks a bunch for putting these two emails together. I've read through 
them and the replies.


I think there's one major issue: a lack of data.


That's my concern as well... The problem is, it is very hard to get 
reliable and actionable data when it comes to that level of scale. 
I have been trying to collect such data and put together realistic 
scenarios but failed until now.


I think the four bullet points you listed below are admirable goals. The 
problem is that I think we're putting the cart before the horse with 
both the issues and proposals.


In other words, before being able to properly evaluate these emails, we 
need to see a scenario that

1) Has clear goals for what scalability metrics are desired.
2) Shows evidence that these scalability goals are not being met.
3) Shows evidence that one or more of the issues listed in this email 
are the cause for the scalability issues in the scenario.
4) Shows evidence that the proposed changes would fix the scalability 
issues in the scenario.


I hope that the ongoing work on ovn-heater will help in that regard.

I listed them in this order because without a failing scenario, we can't 
claim the scalability is poor. Then if we have a failing scenario, it's 
possible that the problem and solution is much simpler than any of the 
issues or proposals that have been brought up here. Then, it's also 
possible that maybe only a subset of the issues listed in this email are 
contributing to the failure. Even if the issues identified here are 
directly causing the scenario to fail, there may still be simpler 
solutions than what has been proposed. And finally, it's possible that 
the proposed solutions don't actually result in the expected scale increase.


I want to make sure my tone is coming across clearly here. I don't think 
the current OVN architecture is perfect, and I don't want to be 
dismissive of the issues you've raised. If there are changes we can make 
to simplify OVN and scale better at the same time, I'm all for it. The 
problem is that, as you pointed out in your proposal email, most of 
these proposals result in difficulties for upgrades/downgrades, as well 
as code maintenance. Therefore, if we are going to do any of these, we 
need to first be certain that we aren't scaling as well as we would 
like, and that there are not simpler paths to reach our scalability targets.


I get your point and this is specifically why I did split the 
conversation in two. I did not want my proposals to be mixed up with 
the issues.


I will see if I can get hard data that can demonstrate what I claim.

Thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-03 Thread Mark Michelson via discuss

Hi Robin,

Thanks a bunch for putting these two emails together. I've read through 
them and the replies.


I think there's one major issue: a lack of data.

I think the four bullet points you listed below are admirable goals. The 
problem is that I think we're putting the cart before the horse with 
both the issues and proposals.


In other words, before being able to properly evaluate these emails, we 
need to see a scenario that

1) Has clear goals for what scalability metrics are desired.
2) Shows evidence that these scalability goals are not being met.
3) Shows evidence that one or more of the issues listed in this email 
are the cause for the scalability issues in the scenario.
4) Shows evidence that the proposed changes would fix the scalability 
issues in the scenario.


I listed them in this order because without a failing scenario, we can't 
claim the scalability is poor. Then if we have a failing scenario, it's 
possible that the problem and solution is much simpler than any of the 
issues or proposals that have been brought up here. Then, it's also 
possible that maybe only a subset of the issues listed in this email are 
contributing to the failure. Even if the issues identified here are 
directly causing the scenario to fail, there may still be simpler 
solutions than what has been proposed. And finally, it's possible that 
the proposed solutions don't actually result in the expected scale increase.


I want to make sure my tone is coming across clearly here. I don't think 
the current OVN architecture is perfect, and I don't want to be 
dismissive of the issues you've raised. If there are changes we can make 
to simplify OVN and scale better at the same time, I'm all for it. The 
problem is that, as you pointed out in your proposal email, most of 
these proposals result in difficulties for upgrades/downgrades, as well 
as code maintenance. Therefore, if we are going to do any of these, we 
need to first be certain that we aren't scaling as well as we would 
like, and that there are not simpler paths to reach our scalability targets.


On 9/28/23 11:18, Robin Jarry wrote:

Hello OVN community,

I'm glad the subject of this message has caught your attention :-)

I would like to start a discussion about how we could improve OVN on the
following topics:

* Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
* Support scaling of L2 connectivity across larger clusters.
* Simplify CMS interoperability.
* Allow support for alternative datapath implementations.

This first email will focus on the current issues that (in my view) are
preventing OVN from scaling L2 networks on larger clusters. I will send
another message with some change proposals to remove or fix these
issues.

Disclaimer:

I am fairly new to this project and my perception and understanding may
be incorrect in some aspects. Please forgive me in advance if I use the
wrong terms and/or make invalid statements. My intent is only to make
things better and not to put the blame on anyone for the current design
choices.

Southbound Design
=

In the current architecture, both databases contain a mix of state and
configuration. While this does not seem to cause any scaling issues for
the northbound DB, it can become a bottleneck for the southbound with
large numbers of chassis and logical network constructs.

The southbound database contains a mix of configuration (logical flows
transformed from the logical network topology) and state (chassis, port
bindings, mac bindings, FDB entries, etc.).

The "configuration" part is consumed by ovn-controller to implement the
network on every chassis and the "state" part is consumed by ovn-northd
to update the northbound "state" entries and to update logical flows.
Some CMS's [1] also depend on the southbound "state" in order to
function properly.

[1] 
https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40

Centralized decisions
=

Every chassis needs to be "aware" of all other chassis in the cluster.
This requirement mainly comes from overlay networks that are implemented
over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
limitations). It is not a scaling issue by itself, but it implies
a centralized decision which in turn puts pressure on the central node
at scale.

Due to ovsdb monitoring and caching, any change in the southbound DB
(either by northd or by any of the chassis controllers) is replicated on
every chassis. The monitor_all option is often enabled on large clusters
to avoid the conditional monitoring CPU cost on the central node.

This leads to high memory usage on all chassis, control plane traffic
and possible disruptions in the ovs-vswitchd datapath flow cache.
Unfortunately, I don't have any hard data to back this claim. This is
mainly coming from discussions I had with neutron contributors and from
brainstorming sessions with colleagues.

I hope that the current work on OVN 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-02 Thread Robin Jarry via discuss
Han Zhou, Oct 01, 2023 at 21:30:
> Please note that tunnels are needed not only between nodes related to same
> logical switches, but also when they are related to different logical
> switches connected by logical routers (even multiple LR+LS hops away).

Yep.

> To clarify a little more, openstack deployment can have different logical
> topologies. So to evaluate the impact of monitor_all settings there should
> be different test cases to capture different types of deployment, e.g.
> full-mesh topology (monitor_all=true is better) v.s. "small islands"
> toplogy (monitor_all=false is reasonable).

This is one thing to note for the recent ovn-heater work that adds
openstack test cases.

> FDB and MAC_binding tables are used by ovn-controllers. They are
> essentially the central storage for MAC tables of the distributed logical
> switches (FDB) and ARP/Neighbour tables for distributed logical routers
> (MAC_binding). A record can be populate by one chassis and consumed by many
> other chassis.
>
> monitor_all should work the same way for these tables: if monitor_all =
> false, only rows related to "local datapaths" should be downloaded to the
> chassis. However, for FDB table, the condition is not set for now (which
> may have been a miss in the initial implementation). Perhaps this is not
> noticed because MAC learning is not a very widely used feature and no scale
> impact noticed, but I just proposed a patch to enable the conditional
> monitoring:
> https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/

Ok thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-01 Thread Han Zhou via discuss
On Sun, Oct 1, 2023 at 9:06 AM Robin Jarry  wrote:
>
> Hi Han,
>
> thanks a lot for your detailed answer.
>
> Han Zhou, Sep 30, 2023 at 01:03:
> > > I think ovn-controller only consumes the logical flows. The chassis
and
> > > port bindings tables are used by northd to updated these logical
flows.
> >
> > Felix was right. For example, port-binding is firstly a configuration
from
> > north-bound, but the states such as its physical location (the chassis
> > column) are populated by ovn-controller of the owning chassis and
consumed
> > by other ovn-controllers that are interested in that port-binding.
>
> I was not aware of this. Thanks.
>
> > > Exactly, but was the signaling between the nodes ever an issue?
> >
> > I am not an expert of BGP, but at least for what I am aware of, there
are
> > scaling issues in things like BGP full mesh signaling, and there are
> > solutions such as route reflector (which is again centralized) to solve
> > such issues.
>
> I am not familiar with BGP full mesh signaling. But from what can tell,
> it looks like the same concept than the full mesh GENEVE tunnels. Except
> that the tunnels are only used when the same logical switch is
> implemented between two nodes.
>
Please note that tunnels are needed not only between nodes related to same
logical switches, but also when they are related to different logical
switches connected by logical routers (even multiple LR+LS hops away).

> > > So you have enabled monitor_all=true as well? Or did you test at scale
> > > with monitor_all=false.
> > >
> > We do use monitor_all=false, primarily to reduce memory footprint (and
also
> > CPU cost of IDL processing) on each chassis. There are trade-offs to
the SB
> > DB server performance:
> >
> > - On one hand it increases the cost of conditional monitoring, which
> >   is expensive for sure
> > - On the other hand, it reduces the total amount of data for the
> >   server to propagate to clients
> >
> > It really depends on your topology for making the choice. If most of the
> > nodes would anyway monitor most of the DB data (something similar to a
> > full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
> > topology like ovn-kubernetes where each node has its dedicated part of
the
> > data, or in topologies where you have lots of small "island" such as a
> > cloud with many small tenants that never talks to each other, using
> > monitor_all=false could make sense (but still need to be carefully
> > evaluated and tested for your own use cases).
>
> I didn't see recent scale testing for openstack, but in past testing we
> had to set monitor_all=true because the CPU usage of the SB ovsdb was
> a bottleneck.
>
To clarify a little more, openstack deployment can have different logical
topologies. So to evaluate the impact of monitor_all settings there should
be different test cases to capture different types of deployment, e.g.
full-mesh topology (monitor_all=true is better) v.s. "small islands"
toplogy (monitor_all=false is reasonable).

> > > The memory usage would be reduced but I don't know to which point. One
> > > of the main consumers is the logical flows table which is required
> > > everywhere. Unless there is a way to only sync a portion of this table
> > > depending on the chassis, disabling monitor_all would save syncing the
> > > unneeded tables for ovn-controller: chassis, port bindings, etc.
> >
> > Probably it wasn't what you meant, but I'd like to clarify that it is
not
> > about unneeded tables, but unneeded rows in those tables (mainly
> > logical_flow and port_binding).
> > It indeed syncs only a portion of the tables. It is not depending
directly
> > on chassis, but depending on what port-bindings are on the chassis and
what
> > logical connectivity those port-bindings have. So, again, the choice
really
> > depends on your use cases.
>
> What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought
> ovn-controller does not need them. If that is the case, I thought that
> by default, the whole tables (not only some of their rows) were excluded
> from the synchronized data.
>
FDB and MAC_binding tables are used by ovn-controllers. They are
essentially the central storage for MAC tables of the distributed logical
switches (FDB) and ARP/Neighbour tables for distributed logical routers
(MAC_binding). A record can be populate by one chassis and consumed by many
other chassis.

monitor_all should work the same way for these tables: if monitor_all =
false, only rows related to "local datapaths" should be downloaded to the
chassis. However, for FDB table, the condition is not set for now (which
may have been a miss in the initial implementation). Perhaps this is not
noticed because MAC learning is not a very widely used feature and no scale
impact noticed, but I just proposed a patch to enable the conditional
monitoring:
https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/

Thanks,
Han

> Thanks!
>

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-01 Thread Robin Jarry via discuss
Hi Han,

thanks a lot for your detailed answer.

Han Zhou, Sep 30, 2023 at 01:03:
> > I think ovn-controller only consumes the logical flows. The chassis and
> > port bindings tables are used by northd to updated these logical flows.
>
> Felix was right. For example, port-binding is firstly a configuration from
> north-bound, but the states such as its physical location (the chassis
> column) are populated by ovn-controller of the owning chassis and consumed
> by other ovn-controllers that are interested in that port-binding.

I was not aware of this. Thanks.

> > Exactly, but was the signaling between the nodes ever an issue?
>
> I am not an expert of BGP, but at least for what I am aware of, there are
> scaling issues in things like BGP full mesh signaling, and there are
> solutions such as route reflector (which is again centralized) to solve
> such issues.

I am not familiar with BGP full mesh signaling. But from what can tell,
it looks like the same concept than the full mesh GENEVE tunnels. Except
that the tunnels are only used when the same logical switch is
implemented between two nodes.

> > So you have enabled monitor_all=true as well? Or did you test at scale
> > with monitor_all=false.
> >
> We do use monitor_all=false, primarily to reduce memory footprint (and also
> CPU cost of IDL processing) on each chassis. There are trade-offs to the SB
> DB server performance:
>
> - On one hand it increases the cost of conditional monitoring, which
>   is expensive for sure
> - On the other hand, it reduces the total amount of data for the
>   server to propagate to clients
>
> It really depends on your topology for making the choice. If most of the
> nodes would anyway monitor most of the DB data (something similar to a
> full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
> topology like ovn-kubernetes where each node has its dedicated part of the
> data, or in topologies where you have lots of small "island" such as a
> cloud with many small tenants that never talks to each other, using
> monitor_all=false could make sense (but still need to be carefully
> evaluated and tested for your own use cases).

I didn't see recent scale testing for openstack, but in past testing we
had to set monitor_all=true because the CPU usage of the SB ovsdb was
a bottleneck.

> > The memory usage would be reduced but I don't know to which point. One
> > of the main consumers is the logical flows table which is required
> > everywhere. Unless there is a way to only sync a portion of this table
> > depending on the chassis, disabling monitor_all would save syncing the
> > unneeded tables for ovn-controller: chassis, port bindings, etc.
>
> Probably it wasn't what you meant, but I'd like to clarify that it is not
> about unneeded tables, but unneeded rows in those tables (mainly
> logical_flow and port_binding).
> It indeed syncs only a portion of the tables. It is not depending directly
> on chassis, but depending on what port-bindings are on the chassis and what
> logical connectivity those port-bindings have. So, again, the choice really
> depends on your use cases.

What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought
ovn-controller does not need them. If that is the case, I thought that
by default, the whole tables (not only some of their rows) were excluded
from the synchronized data.

Thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-29 Thread Han Zhou via discuss
On Fri, Sep 29, 2023 at 7:26 AM Robin Jarry  wrote:
>
> Hi Felix,
>
> Thanks a lot for your message.
>
> Felix Huettner, Sep 29, 2023 at 14:35:
> > I can get that when running 10k ovn-controllers the benefits of
> > optimizing cpu and memory load are quite significant. However i am
> > unsure about reducing the footprint of ovn-northd.
> > When running so many nodes i would have assumed that having an
> > additional (or maybe two) dedicated machines for ovn-northd would
> > be completely acceptable, as long as it can still actually do what
> > it should in a reasonable timeframe.
> > Would the goal for ovn-northd be more like "Reduce the full/incremental
> > recompute time" then?

+1

>
> The main goal of this thread is to get a consensus on the actual issues
> that prevent scaling at the moment. We can discuss solutions in the
> other thread.
>

Thanks for the good discussions!

> > > * Allow support for alternative datapath implementations.
> >
> > Does this mean ovs datapths (e.g. dpdk) or something different?
>
> See the other thread.
>
> > > Southbound Design
> > > =
> ...
> > Note that also ovn-controller consumes the "state" of other chassis to
> > e.g build the tunnels to other chassis. To visualize my understanding
> >
> > ++---++
> > || configuration |   state|
> > ++---++
> > |   ovn-northd   |  write-only   | read-only  |
> > ++---++
> > | ovn-controller |   read-only   | read-write |
> > ++---++
> > |some cms|  no access?   | read-only  |
> > ++---++
>
> I think ovn-controller only consumes the logical flows. The chassis and
> port bindings tables are used by northd to updated these logical flows.
>

Felix was right. For example, port-binding is firstly a configuration from
north-bound, but the states such as its physical location (the chassis
column) are populated by ovn-controller of the owning chassis and consumed
by other ovn-controllers that are interested in that port-binding.

> > > Centralized decisions
> > > =
> > >
> > > Every chassis needs to be "aware" of all other chassis in the cluster.
> >
> > I think we need to accept this as fundamental truth. Indepentent if you
> > look at centralized designs like ovn or the neutron-l2 implementation
> > or if you look at decentralized designs like bgp or spanning tree. In
> > all cases if we need some kind of organized communication we need to
> > know all relevant peers.
> > Designs might diverge if you need to be "aware" of all peers or just
> > some of them, but that is just a tradeoff between data size and options
> > you have to forward data.
> >
> > > This requirement mainly comes from overlay networks that are
implemented
> > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > > limitations). It is not a scaling issue by itself, but it implies
> > > a centralized decision which in turn puts pressure on the central node
> > > at scale.
> >
> > +1. On the other hand it removes signaling needs between the nodes (like
> > you would have with bgp).
>
> Exactly, but was the signaling between the nodes ever an issue?

I am not an expert of BGP, but at least for what I am aware of, there are
scaling issues in things like BGP full mesh signaling, and there are
solutions such as route reflector (which is again centralized) to solve
such issues.

>
> > > Due to ovsdb monitoring and caching, any change in the southbound DB
> > > (either by northd or by any of the chassis controllers) is replicated
on
> > > every chassis. The monitor_all option is often enabled on large
clusters
> > > to avoid the conditional monitoring CPU cost on the central node.
> >
> > This is, i guess, something that should be possible to fix. We have also
> > enabled this setting as it gave us stability improvements and we do not
> > yet see performance issues with it
>
> So you have enabled monitor_all=true as well? Or did you test at scale
> with monitor_all=false.
>
We do use monitor_all=false, primarily to reduce memory footprint (and also
CPU cost of IDL processing) on each chassis. There are trade-offs to the SB
DB server performance:
- On one hand it increases the cost of conditional monitoring, which is
expensive for sure
- On the other hand, it reduces the total amount of data for the server to
propagate to clients

It really depends on your topology for making the choice. If most of the
nodes would anyway monitor most of the DB data (something similar to a
full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
topology like ovn-kubernetes where each node has its dedicated part of the
data, or in topologies where you have lots of small "island" such as a
cloud with many small tenants that never talks to each other, using
monitor_all=false could make sense (but 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-29 Thread Robin Jarry via discuss
Hi Felix,

Thanks a lot for your message.

Felix Huettner, Sep 29, 2023 at 14:35:
> I can get that when running 10k ovn-controllers the benefits of
> optimizing cpu and memory load are quite significant. However i am
> unsure about reducing the footprint of ovn-northd.
> When running so many nodes i would have assumed that having an
> additional (or maybe two) dedicated machines for ovn-northd would
> be completely acceptable, as long as it can still actually do what
> it should in a reasonable timeframe.
> Would the goal for ovn-northd be more like "Reduce the full/incremental
> recompute time" then?

The main goal of this thread is to get a consensus on the actual issues
that prevent scaling at the moment. We can discuss solutions in the
other thread.

> > * Allow support for alternative datapath implementations.
>
> Does this mean ovs datapths (e.g. dpdk) or something different?

See the other thread.

> > Southbound Design
> > =
...
> Note that also ovn-controller consumes the "state" of other chassis to
> e.g build the tunnels to other chassis. To visualize my understanding
>
> ++---++
> || configuration |   state|
> ++---++
> |   ovn-northd   |  write-only   | read-only  |
> ++---++
> | ovn-controller |   read-only   | read-write |
> ++---++
> |some cms|  no access?   | read-only  |
> ++---++

I think ovn-controller only consumes the logical flows. The chassis and
port bindings tables are used by northd to updated these logical flows.

> > Centralized decisions
> > =
> >
> > Every chassis needs to be "aware" of all other chassis in the cluster.
>
> I think we need to accept this as fundamental truth. Indepentent if you
> look at centralized designs like ovn or the neutron-l2 implementation
> or if you look at decentralized designs like bgp or spanning tree. In
> all cases if we need some kind of organized communication we need to
> know all relevant peers.
> Designs might diverge if you need to be "aware" of all peers or just
> some of them, but that is just a tradeoff between data size and options
> you have to forward data.
>
> > This requirement mainly comes from overlay networks that are implemented
> > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > limitations). It is not a scaling issue by itself, but it implies
> > a centralized decision which in turn puts pressure on the central node
> > at scale.
>
> +1. On the other hand it removes signaling needs between the nodes (like
> you would have with bgp).

Exactly, but was the signaling between the nodes ever an issue?

> > Due to ovsdb monitoring and caching, any change in the southbound DB
> > (either by northd or by any of the chassis controllers) is replicated on
> > every chassis. The monitor_all option is often enabled on large clusters
> > to avoid the conditional monitoring CPU cost on the central node.
>
> This is, i guess, something that should be possible to fix. We have also
> enabled this setting as it gave us stability improvements and we do not
> yet see performance issues with it

So you have enabled monitor_all=true as well? Or did you test at scale
with monitor_all=false.

What I am saying is that without monitor_all=true, the southbound
ovsdb-server needs to do checks to determine what updates to send to
which client. Since the server is single threaded, it becomes an issue
at scale. I know that there were some significant improvements made
recently but it will only push the limit further. I don't have hard data
to prove my point yet unfortunately.

> > This leads to high memory usage on all chassis, control plane traffic
> > and possible disruptions in the ovs-vswitchd datapath flow cache.
> > Unfortunately, I don't have any hard data to back this claim. This is
> > mainly coming from discussions I had with neutron contributors and from
> > brainstorming sessions with colleagues.
>
> Could you maybe elaborate on the datapath flow cache issue, as it sounds
> like it might affect actual live traffic and i am not aware of details
> there.

I may have had a wrong understanding of the mechanisms of OVS here.
I was under the impression that any update of the openflow rules would
invalidate of all datapath flows. It is far more subtle than this [1].
So unless there is an actual change in the packet pipeline, live traffic
should not be affected.

[1] 
https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained

> The memory usage and the traffic would be fixed by not having to rely on
> monitor_all, right?

The memory usage would be reduced but I don't know to which point. One
of the main consumers is the logical flows table which is required
everywhere. Unless there is a way to only sync a portion of this table

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-29 Thread Felix Huettner via discuss
Hi Robin and everyone else,

On Thu, Sep 28, 2023 at 05:18:19PM +0200, Robin Jarry via discuss wrote:
> Hello OVN community,
>
> I'm glad the subject of this message has caught your attention :-)
>
> I would like to start a discussion about how we could improve OVN on the
> following topics:
>
> * Reduce the memory and CPU footprint of ovn-controller, ovn-northd.

I can get that when running 10k ovn-controllers the benefits of
optimizing cpu and memory load are quite significant. However i am
unsure about reducing the footprint of ovn-northd.
When running so many nodes i would have assumed that having an
additional (or maybe two) dedicated machines for ovn-northd would
be completely acceptable, as long as it can still actually do what
it should in a reasonable timeframe.
Would the goal for ovn-northd be more like "Reduce the full/incremental
recompute time" then?

> * Support scaling of L2 connectivity across larger clusters.
> * Simplify CMS interoperability.
> * Allow support for alternative datapath implementations.

Does this mean ovs datapths (e.g. dpdk) or something different?

>
> This first email will focus on the current issues that (in my view) are
> preventing OVN from scaling L2 networks on larger clusters. I will send
> another message with some change proposals to remove or fix these
> issues.
>
> Disclaimer:
>
> I am fairly new to this project and my perception and understanding may
> be incorrect in some aspects. Please forgive me in advance if I use the
> wrong terms and/or make invalid statements. My intent is only to make
> things better and not to put the blame on anyone for the current design
> choices.

Please apply the same disclaimer to my comments as well.

>
> Southbound Design
> =
>
> In the current architecture, both databases contain a mix of state and
> configuration. While this does not seem to cause any scaling issues for
> the northbound DB, it can become a bottleneck for the southbound with
> large numbers of chassis and logical network constructs.
>
> The southbound database contains a mix of configuration (logical flows
> transformed from the logical network topology) and state (chassis, port
> bindings, mac bindings, FDB entries, etc.).
>
> The "configuration" part is consumed by ovn-controller to implement the
> network on every chassis and the "state" part is consumed by ovn-northd
> to update the northbound "state" entries and to update logical flows.
> Some CMS's [1] also depend on the southbound "state" in order to
> function properly.

Note that also ovn-controller consumes the "state" of other chassis to
e.g build the tunnels to other chassis. To visualize my understanding

++---++
|| configuration |   state|
++---++
|   ovn-northd   |  write-only   | read-only  |
++---++
| ovn-controller |   read-only   | read-write |
++---++
|some cms|  no access?   | read-only  |
++---++

>
> [1] 
> https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40
>
> Centralized decisions
> =
>
> Every chassis needs to be "aware" of all other chassis in the cluster.

I think we need to accept this as fundamental truth. Indepentent if you
look at centralized designs like ovn or the neutron-l2 implementation
or if you look at decentralized designs like bgp or spanning tree. In
all cases if we need some kind of organized communication we need to
know all relevant peers.
Designs might diverge if you need to be "aware" of all peers or just
some of them, but that is just a tradeoff between data size and options
you have to forward data.

> This requirement mainly comes from overlay networks that are implemented
> over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> limitations). It is not a scaling issue by itself, but it implies
> a centralized decision which in turn puts pressure on the central node
> at scale.

+1. On the other hand it removes signaling needs between the nodes (like
you would have with bgp).

>
> Due to ovsdb monitoring and caching, any change in the southbound DB
> (either by northd or by any of the chassis controllers) is replicated on
> every chassis. The monitor_all option is often enabled on large clusters
> to avoid the conditional monitoring CPU cost on the central node.

This is, i guess, something that should be possible to fix. We have also
enabled this setting as it gave us stability improvements and we do not
yet see performance issues with it

>
> This leads to high memory usage on all chassis, control plane traffic
> and possible disruptions in the ovs-vswitchd datapath flow cache.
> Unfortunately, I don't have any hard data to back this claim. This is
> mainly coming from discussions I had with neutron contributors and from
>