Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hi Mark, Mark Michelson, Oct 03, 2023 at 23:09: Hi Robin, Thanks a bunch for putting these two emails together. I've read through them and the replies. I think there's one major issue: a lack of data. That's my concern as well... The problem is, it is very hard to get reliable and actionable data when it comes to that level of scale. I have been trying to collect such data and put together realistic scenarios but failed until now. I think the four bullet points you listed below are admirable goals. The problem is that I think we're putting the cart before the horse with both the issues and proposals. In other words, before being able to properly evaluate these emails, we need to see a scenario that 1) Has clear goals for what scalability metrics are desired. 2) Shows evidence that these scalability goals are not being met. 3) Shows evidence that one or more of the issues listed in this email are the cause for the scalability issues in the scenario. 4) Shows evidence that the proposed changes would fix the scalability issues in the scenario. I hope that the ongoing work on ovn-heater will help in that regard. I listed them in this order because without a failing scenario, we can't claim the scalability is poor. Then if we have a failing scenario, it's possible that the problem and solution is much simpler than any of the issues or proposals that have been brought up here. Then, it's also possible that maybe only a subset of the issues listed in this email are contributing to the failure. Even if the issues identified here are directly causing the scenario to fail, there may still be simpler solutions than what has been proposed. And finally, it's possible that the proposed solutions don't actually result in the expected scale increase. I want to make sure my tone is coming across clearly here. I don't think the current OVN architecture is perfect, and I don't want to be dismissive of the issues you've raised. If there are changes we can make to simplify OVN and scale better at the same time, I'm all for it. The problem is that, as you pointed out in your proposal email, most of these proposals result in difficulties for upgrades/downgrades, as well as code maintenance. Therefore, if we are going to do any of these, we need to first be certain that we aren't scaling as well as we would like, and that there are not simpler paths to reach our scalability targets. I get your point and this is specifically why I did split the conversation in two. I did not want my proposals to be mixed up with the issues. I will see if I can get hard data that can demonstrate what I claim. Thanks! ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hi Robin, Thanks a bunch for putting these two emails together. I've read through them and the replies. I think there's one major issue: a lack of data. I think the four bullet points you listed below are admirable goals. The problem is that I think we're putting the cart before the horse with both the issues and proposals. In other words, before being able to properly evaluate these emails, we need to see a scenario that 1) Has clear goals for what scalability metrics are desired. 2) Shows evidence that these scalability goals are not being met. 3) Shows evidence that one or more of the issues listed in this email are the cause for the scalability issues in the scenario. 4) Shows evidence that the proposed changes would fix the scalability issues in the scenario. I listed them in this order because without a failing scenario, we can't claim the scalability is poor. Then if we have a failing scenario, it's possible that the problem and solution is much simpler than any of the issues or proposals that have been brought up here. Then, it's also possible that maybe only a subset of the issues listed in this email are contributing to the failure. Even if the issues identified here are directly causing the scenario to fail, there may still be simpler solutions than what has been proposed. And finally, it's possible that the proposed solutions don't actually result in the expected scale increase. I want to make sure my tone is coming across clearly here. I don't think the current OVN architecture is perfect, and I don't want to be dismissive of the issues you've raised. If there are changes we can make to simplify OVN and scale better at the same time, I'm all for it. The problem is that, as you pointed out in your proposal email, most of these proposals result in difficulties for upgrades/downgrades, as well as code maintenance. Therefore, if we are going to do any of these, we need to first be certain that we aren't scaling as well as we would like, and that there are not simpler paths to reach our scalability targets. On 9/28/23 11:18, Robin Jarry wrote: Hello OVN community, I'm glad the subject of this message has caught your attention :-) I would like to start a discussion about how we could improve OVN on the following topics: * Reduce the memory and CPU footprint of ovn-controller, ovn-northd. * Support scaling of L2 connectivity across larger clusters. * Simplify CMS interoperability. * Allow support for alternative datapath implementations. This first email will focus on the current issues that (in my view) are preventing OVN from scaling L2 networks on larger clusters. I will send another message with some change proposals to remove or fix these issues. Disclaimer: I am fairly new to this project and my perception and understanding may be incorrect in some aspects. Please forgive me in advance if I use the wrong terms and/or make invalid statements. My intent is only to make things better and not to put the blame on anyone for the current design choices. Southbound Design = In the current architecture, both databases contain a mix of state and configuration. While this does not seem to cause any scaling issues for the northbound DB, it can become a bottleneck for the southbound with large numbers of chassis and logical network constructs. The southbound database contains a mix of configuration (logical flows transformed from the logical network topology) and state (chassis, port bindings, mac bindings, FDB entries, etc.). The "configuration" part is consumed by ovn-controller to implement the network on every chassis and the "state" part is consumed by ovn-northd to update the northbound "state" entries and to update logical flows. Some CMS's [1] also depend on the southbound "state" in order to function properly. [1] https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40 Centralized decisions = Every chassis needs to be "aware" of all other chassis in the cluster. This requirement mainly comes from overlay networks that are implemented over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some limitations). It is not a scaling issue by itself, but it implies a centralized decision which in turn puts pressure on the central node at scale. Due to ovsdb monitoring and caching, any change in the southbound DB (either by northd or by any of the chassis controllers) is replicated on every chassis. The monitor_all option is often enabled on large clusters to avoid the conditional monitoring CPU cost on the central node. This leads to high memory usage on all chassis, control plane traffic and possible disruptions in the ovs-vswitchd datapath flow cache. Unfortunately, I don't have any hard data to back this claim. This is mainly coming from discussions I had with neutron contributors and from brainstorming sessions with colleagues. I hope that the current work on OVN
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Han Zhou, Oct 01, 2023 at 21:30: > Please note that tunnels are needed not only between nodes related to same > logical switches, but also when they are related to different logical > switches connected by logical routers (even multiple LR+LS hops away). Yep. > To clarify a little more, openstack deployment can have different logical > topologies. So to evaluate the impact of monitor_all settings there should > be different test cases to capture different types of deployment, e.g. > full-mesh topology (monitor_all=true is better) v.s. "small islands" > toplogy (monitor_all=false is reasonable). This is one thing to note for the recent ovn-heater work that adds openstack test cases. > FDB and MAC_binding tables are used by ovn-controllers. They are > essentially the central storage for MAC tables of the distributed logical > switches (FDB) and ARP/Neighbour tables for distributed logical routers > (MAC_binding). A record can be populate by one chassis and consumed by many > other chassis. > > monitor_all should work the same way for these tables: if monitor_all = > false, only rows related to "local datapaths" should be downloaded to the > chassis. However, for FDB table, the condition is not set for now (which > may have been a miss in the initial implementation). Perhaps this is not > noticed because MAC learning is not a very widely used feature and no scale > impact noticed, but I just proposed a patch to enable the conditional > monitoring: > https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/ Ok thanks! ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
On Sun, Oct 1, 2023 at 9:06 AM Robin Jarry wrote: > > Hi Han, > > thanks a lot for your detailed answer. > > Han Zhou, Sep 30, 2023 at 01:03: > > > I think ovn-controller only consumes the logical flows. The chassis and > > > port bindings tables are used by northd to updated these logical flows. > > > > Felix was right. For example, port-binding is firstly a configuration from > > north-bound, but the states such as its physical location (the chassis > > column) are populated by ovn-controller of the owning chassis and consumed > > by other ovn-controllers that are interested in that port-binding. > > I was not aware of this. Thanks. > > > > Exactly, but was the signaling between the nodes ever an issue? > > > > I am not an expert of BGP, but at least for what I am aware of, there are > > scaling issues in things like BGP full mesh signaling, and there are > > solutions such as route reflector (which is again centralized) to solve > > such issues. > > I am not familiar with BGP full mesh signaling. But from what can tell, > it looks like the same concept than the full mesh GENEVE tunnels. Except > that the tunnels are only used when the same logical switch is > implemented between two nodes. > Please note that tunnels are needed not only between nodes related to same logical switches, but also when they are related to different logical switches connected by logical routers (even multiple LR+LS hops away). > > > So you have enabled monitor_all=true as well? Or did you test at scale > > > with monitor_all=false. > > > > > We do use monitor_all=false, primarily to reduce memory footprint (and also > > CPU cost of IDL processing) on each chassis. There are trade-offs to the SB > > DB server performance: > > > > - On one hand it increases the cost of conditional monitoring, which > > is expensive for sure > > - On the other hand, it reduces the total amount of data for the > > server to propagate to clients > > > > It really depends on your topology for making the choice. If most of the > > nodes would anyway monitor most of the DB data (something similar to a > > full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in > > topology like ovn-kubernetes where each node has its dedicated part of the > > data, or in topologies where you have lots of small "island" such as a > > cloud with many small tenants that never talks to each other, using > > monitor_all=false could make sense (but still need to be carefully > > evaluated and tested for your own use cases). > > I didn't see recent scale testing for openstack, but in past testing we > had to set monitor_all=true because the CPU usage of the SB ovsdb was > a bottleneck. > To clarify a little more, openstack deployment can have different logical topologies. So to evaluate the impact of monitor_all settings there should be different test cases to capture different types of deployment, e.g. full-mesh topology (monitor_all=true is better) v.s. "small islands" toplogy (monitor_all=false is reasonable). > > > The memory usage would be reduced but I don't know to which point. One > > > of the main consumers is the logical flows table which is required > > > everywhere. Unless there is a way to only sync a portion of this table > > > depending on the chassis, disabling monitor_all would save syncing the > > > unneeded tables for ovn-controller: chassis, port bindings, etc. > > > > Probably it wasn't what you meant, but I'd like to clarify that it is not > > about unneeded tables, but unneeded rows in those tables (mainly > > logical_flow and port_binding). > > It indeed syncs only a portion of the tables. It is not depending directly > > on chassis, but depending on what port-bindings are on the chassis and what > > logical connectivity those port-bindings have. So, again, the choice really > > depends on your use cases. > > What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought > ovn-controller does not need them. If that is the case, I thought that > by default, the whole tables (not only some of their rows) were excluded > from the synchronized data. > FDB and MAC_binding tables are used by ovn-controllers. They are essentially the central storage for MAC tables of the distributed logical switches (FDB) and ARP/Neighbour tables for distributed logical routers (MAC_binding). A record can be populate by one chassis and consumed by many other chassis. monitor_all should work the same way for these tables: if monitor_all = false, only rows related to "local datapaths" should be downloaded to the chassis. However, for FDB table, the condition is not set for now (which may have been a miss in the initial implementation). Perhaps this is not noticed because MAC learning is not a very widely used feature and no scale impact noticed, but I just proposed a patch to enable the conditional monitoring: https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/ Thanks, Han > Thanks! >
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hi Han, thanks a lot for your detailed answer. Han Zhou, Sep 30, 2023 at 01:03: > > I think ovn-controller only consumes the logical flows. The chassis and > > port bindings tables are used by northd to updated these logical flows. > > Felix was right. For example, port-binding is firstly a configuration from > north-bound, but the states such as its physical location (the chassis > column) are populated by ovn-controller of the owning chassis and consumed > by other ovn-controllers that are interested in that port-binding. I was not aware of this. Thanks. > > Exactly, but was the signaling between the nodes ever an issue? > > I am not an expert of BGP, but at least for what I am aware of, there are > scaling issues in things like BGP full mesh signaling, and there are > solutions such as route reflector (which is again centralized) to solve > such issues. I am not familiar with BGP full mesh signaling. But from what can tell, it looks like the same concept than the full mesh GENEVE tunnels. Except that the tunnels are only used when the same logical switch is implemented between two nodes. > > So you have enabled monitor_all=true as well? Or did you test at scale > > with monitor_all=false. > > > We do use monitor_all=false, primarily to reduce memory footprint (and also > CPU cost of IDL processing) on each chassis. There are trade-offs to the SB > DB server performance: > > - On one hand it increases the cost of conditional monitoring, which > is expensive for sure > - On the other hand, it reduces the total amount of data for the > server to propagate to clients > > It really depends on your topology for making the choice. If most of the > nodes would anyway monitor most of the DB data (something similar to a > full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in > topology like ovn-kubernetes where each node has its dedicated part of the > data, or in topologies where you have lots of small "island" such as a > cloud with many small tenants that never talks to each other, using > monitor_all=false could make sense (but still need to be carefully > evaluated and tested for your own use cases). I didn't see recent scale testing for openstack, but in past testing we had to set monitor_all=true because the CPU usage of the SB ovsdb was a bottleneck. > > The memory usage would be reduced but I don't know to which point. One > > of the main consumers is the logical flows table which is required > > everywhere. Unless there is a way to only sync a portion of this table > > depending on the chassis, disabling monitor_all would save syncing the > > unneeded tables for ovn-controller: chassis, port bindings, etc. > > Probably it wasn't what you meant, but I'd like to clarify that it is not > about unneeded tables, but unneeded rows in those tables (mainly > logical_flow and port_binding). > It indeed syncs only a portion of the tables. It is not depending directly > on chassis, but depending on what port-bindings are on the chassis and what > logical connectivity those port-bindings have. So, again, the choice really > depends on your use cases. What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought ovn-controller does not need them. If that is the case, I thought that by default, the whole tables (not only some of their rows) were excluded from the synchronized data. Thanks! ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
On Fri, Sep 29, 2023 at 7:26 AM Robin Jarry wrote: > > Hi Felix, > > Thanks a lot for your message. > > Felix Huettner, Sep 29, 2023 at 14:35: > > I can get that when running 10k ovn-controllers the benefits of > > optimizing cpu and memory load are quite significant. However i am > > unsure about reducing the footprint of ovn-northd. > > When running so many nodes i would have assumed that having an > > additional (or maybe two) dedicated machines for ovn-northd would > > be completely acceptable, as long as it can still actually do what > > it should in a reasonable timeframe. > > Would the goal for ovn-northd be more like "Reduce the full/incremental > > recompute time" then? +1 > > The main goal of this thread is to get a consensus on the actual issues > that prevent scaling at the moment. We can discuss solutions in the > other thread. > Thanks for the good discussions! > > > * Allow support for alternative datapath implementations. > > > > Does this mean ovs datapths (e.g. dpdk) or something different? > > See the other thread. > > > > Southbound Design > > > = > ... > > Note that also ovn-controller consumes the "state" of other chassis to > > e.g build the tunnels to other chassis. To visualize my understanding > > > > ++---++ > > || configuration | state| > > ++---++ > > | ovn-northd | write-only | read-only | > > ++---++ > > | ovn-controller | read-only | read-write | > > ++---++ > > |some cms| no access? | read-only | > > ++---++ > > I think ovn-controller only consumes the logical flows. The chassis and > port bindings tables are used by northd to updated these logical flows. > Felix was right. For example, port-binding is firstly a configuration from north-bound, but the states such as its physical location (the chassis column) are populated by ovn-controller of the owning chassis and consumed by other ovn-controllers that are interested in that port-binding. > > > Centralized decisions > > > = > > > > > > Every chassis needs to be "aware" of all other chassis in the cluster. > > > > I think we need to accept this as fundamental truth. Indepentent if you > > look at centralized designs like ovn or the neutron-l2 implementation > > or if you look at decentralized designs like bgp or spanning tree. In > > all cases if we need some kind of organized communication we need to > > know all relevant peers. > > Designs might diverge if you need to be "aware" of all peers or just > > some of them, but that is just a tradeoff between data size and options > > you have to forward data. > > > > > This requirement mainly comes from overlay networks that are implemented > > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some > > > limitations). It is not a scaling issue by itself, but it implies > > > a centralized decision which in turn puts pressure on the central node > > > at scale. > > > > +1. On the other hand it removes signaling needs between the nodes (like > > you would have with bgp). > > Exactly, but was the signaling between the nodes ever an issue? I am not an expert of BGP, but at least for what I am aware of, there are scaling issues in things like BGP full mesh signaling, and there are solutions such as route reflector (which is again centralized) to solve such issues. > > > > Due to ovsdb monitoring and caching, any change in the southbound DB > > > (either by northd or by any of the chassis controllers) is replicated on > > > every chassis. The monitor_all option is often enabled on large clusters > > > to avoid the conditional monitoring CPU cost on the central node. > > > > This is, i guess, something that should be possible to fix. We have also > > enabled this setting as it gave us stability improvements and we do not > > yet see performance issues with it > > So you have enabled monitor_all=true as well? Or did you test at scale > with monitor_all=false. > We do use monitor_all=false, primarily to reduce memory footprint (and also CPU cost of IDL processing) on each chassis. There are trade-offs to the SB DB server performance: - On one hand it increases the cost of conditional monitoring, which is expensive for sure - On the other hand, it reduces the total amount of data for the server to propagate to clients It really depends on your topology for making the choice. If most of the nodes would anyway monitor most of the DB data (something similar to a full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in topology like ovn-kubernetes where each node has its dedicated part of the data, or in topologies where you have lots of small "island" such as a cloud with many small tenants that never talks to each other, using monitor_all=false could make sense (but
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hi Felix, Thanks a lot for your message. Felix Huettner, Sep 29, 2023 at 14:35: > I can get that when running 10k ovn-controllers the benefits of > optimizing cpu and memory load are quite significant. However i am > unsure about reducing the footprint of ovn-northd. > When running so many nodes i would have assumed that having an > additional (or maybe two) dedicated machines for ovn-northd would > be completely acceptable, as long as it can still actually do what > it should in a reasonable timeframe. > Would the goal for ovn-northd be more like "Reduce the full/incremental > recompute time" then? The main goal of this thread is to get a consensus on the actual issues that prevent scaling at the moment. We can discuss solutions in the other thread. > > * Allow support for alternative datapath implementations. > > Does this mean ovs datapths (e.g. dpdk) or something different? See the other thread. > > Southbound Design > > = ... > Note that also ovn-controller consumes the "state" of other chassis to > e.g build the tunnels to other chassis. To visualize my understanding > > ++---++ > || configuration | state| > ++---++ > | ovn-northd | write-only | read-only | > ++---++ > | ovn-controller | read-only | read-write | > ++---++ > |some cms| no access? | read-only | > ++---++ I think ovn-controller only consumes the logical flows. The chassis and port bindings tables are used by northd to updated these logical flows. > > Centralized decisions > > = > > > > Every chassis needs to be "aware" of all other chassis in the cluster. > > I think we need to accept this as fundamental truth. Indepentent if you > look at centralized designs like ovn or the neutron-l2 implementation > or if you look at decentralized designs like bgp or spanning tree. In > all cases if we need some kind of organized communication we need to > know all relevant peers. > Designs might diverge if you need to be "aware" of all peers or just > some of them, but that is just a tradeoff between data size and options > you have to forward data. > > > This requirement mainly comes from overlay networks that are implemented > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some > > limitations). It is not a scaling issue by itself, but it implies > > a centralized decision which in turn puts pressure on the central node > > at scale. > > +1. On the other hand it removes signaling needs between the nodes (like > you would have with bgp). Exactly, but was the signaling between the nodes ever an issue? > > Due to ovsdb monitoring and caching, any change in the southbound DB > > (either by northd or by any of the chassis controllers) is replicated on > > every chassis. The monitor_all option is often enabled on large clusters > > to avoid the conditional monitoring CPU cost on the central node. > > This is, i guess, something that should be possible to fix. We have also > enabled this setting as it gave us stability improvements and we do not > yet see performance issues with it So you have enabled monitor_all=true as well? Or did you test at scale with monitor_all=false. What I am saying is that without monitor_all=true, the southbound ovsdb-server needs to do checks to determine what updates to send to which client. Since the server is single threaded, it becomes an issue at scale. I know that there were some significant improvements made recently but it will only push the limit further. I don't have hard data to prove my point yet unfortunately. > > This leads to high memory usage on all chassis, control plane traffic > > and possible disruptions in the ovs-vswitchd datapath flow cache. > > Unfortunately, I don't have any hard data to back this claim. This is > > mainly coming from discussions I had with neutron contributors and from > > brainstorming sessions with colleagues. > > Could you maybe elaborate on the datapath flow cache issue, as it sounds > like it might affect actual live traffic and i am not aware of details > there. I may have had a wrong understanding of the mechanisms of OVS here. I was under the impression that any update of the openflow rules would invalidate of all datapath flows. It is far more subtle than this [1]. So unless there is an actual change in the packet pipeline, live traffic should not be affected. [1] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained > The memory usage and the traffic would be fixed by not having to rely on > monitor_all, right? The memory usage would be reduced but I don't know to which point. One of the main consumers is the logical flows table which is required everywhere. Unless there is a way to only sync a portion of this table
Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hi Robin and everyone else, On Thu, Sep 28, 2023 at 05:18:19PM +0200, Robin Jarry via discuss wrote: > Hello OVN community, > > I'm glad the subject of this message has caught your attention :-) > > I would like to start a discussion about how we could improve OVN on the > following topics: > > * Reduce the memory and CPU footprint of ovn-controller, ovn-northd. I can get that when running 10k ovn-controllers the benefits of optimizing cpu and memory load are quite significant. However i am unsure about reducing the footprint of ovn-northd. When running so many nodes i would have assumed that having an additional (or maybe two) dedicated machines for ovn-northd would be completely acceptable, as long as it can still actually do what it should in a reasonable timeframe. Would the goal for ovn-northd be more like "Reduce the full/incremental recompute time" then? > * Support scaling of L2 connectivity across larger clusters. > * Simplify CMS interoperability. > * Allow support for alternative datapath implementations. Does this mean ovs datapths (e.g. dpdk) or something different? > > This first email will focus on the current issues that (in my view) are > preventing OVN from scaling L2 networks on larger clusters. I will send > another message with some change proposals to remove or fix these > issues. > > Disclaimer: > > I am fairly new to this project and my perception and understanding may > be incorrect in some aspects. Please forgive me in advance if I use the > wrong terms and/or make invalid statements. My intent is only to make > things better and not to put the blame on anyone for the current design > choices. Please apply the same disclaimer to my comments as well. > > Southbound Design > = > > In the current architecture, both databases contain a mix of state and > configuration. While this does not seem to cause any scaling issues for > the northbound DB, it can become a bottleneck for the southbound with > large numbers of chassis and logical network constructs. > > The southbound database contains a mix of configuration (logical flows > transformed from the logical network topology) and state (chassis, port > bindings, mac bindings, FDB entries, etc.). > > The "configuration" part is consumed by ovn-controller to implement the > network on every chassis and the "state" part is consumed by ovn-northd > to update the northbound "state" entries and to update logical flows. > Some CMS's [1] also depend on the southbound "state" in order to > function properly. Note that also ovn-controller consumes the "state" of other chassis to e.g build the tunnels to other chassis. To visualize my understanding ++---++ || configuration | state| ++---++ | ovn-northd | write-only | read-only | ++---++ | ovn-controller | read-only | read-write | ++---++ |some cms| no access? | read-only | ++---++ > > [1] > https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40 > > Centralized decisions > = > > Every chassis needs to be "aware" of all other chassis in the cluster. I think we need to accept this as fundamental truth. Indepentent if you look at centralized designs like ovn or the neutron-l2 implementation or if you look at decentralized designs like bgp or spanning tree. In all cases if we need some kind of organized communication we need to know all relevant peers. Designs might diverge if you need to be "aware" of all peers or just some of them, but that is just a tradeoff between data size and options you have to forward data. > This requirement mainly comes from overlay networks that are implemented > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some > limitations). It is not a scaling issue by itself, but it implies > a centralized decision which in turn puts pressure on the central node > at scale. +1. On the other hand it removes signaling needs between the nodes (like you would have with bgp). > > Due to ovsdb monitoring and caching, any change in the southbound DB > (either by northd or by any of the chassis controllers) is replicated on > every chassis. The monitor_all option is often enabled on large clusters > to avoid the conditional monitoring CPU cost on the central node. This is, i guess, something that should be possible to fix. We have also enabled this setting as it gave us stability improvements and we do not yet see performance issues with it > > This leads to high memory usage on all chassis, control plane traffic > and possible disruptions in the ovs-vswitchd datapath flow cache. > Unfortunately, I don't have any hard data to back this claim. This is > mainly coming from discussions I had with neutron contributors and from >