Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-03 Thread Mark Michelson via discuss

Hi Robin,

Thanks a bunch for putting these two emails together. I've read through 
them and the replies.


I think there's one major issue: a lack of data.

I think the four bullet points you listed below are admirable goals. The 
problem is that I think we're putting the cart before the horse with 
both the issues and proposals.


In other words, before being able to properly evaluate these emails, we 
need to see a scenario that

1) Has clear goals for what scalability metrics are desired.
2) Shows evidence that these scalability goals are not being met.
3) Shows evidence that one or more of the issues listed in this email 
are the cause for the scalability issues in the scenario.
4) Shows evidence that the proposed changes would fix the scalability 
issues in the scenario.


I listed them in this order because without a failing scenario, we can't 
claim the scalability is poor. Then if we have a failing scenario, it's 
possible that the problem and solution is much simpler than any of the 
issues or proposals that have been brought up here. Then, it's also 
possible that maybe only a subset of the issues listed in this email are 
contributing to the failure. Even if the issues identified here are 
directly causing the scenario to fail, there may still be simpler 
solutions than what has been proposed. And finally, it's possible that 
the proposed solutions don't actually result in the expected scale increase.


I want to make sure my tone is coming across clearly here. I don't think 
the current OVN architecture is perfect, and I don't want to be 
dismissive of the issues you've raised. If there are changes we can make 
to simplify OVN and scale better at the same time, I'm all for it. The 
problem is that, as you pointed out in your proposal email, most of 
these proposals result in difficulties for upgrades/downgrades, as well 
as code maintenance. Therefore, if we are going to do any of these, we 
need to first be certain that we aren't scaling as well as we would 
like, and that there are not simpler paths to reach our scalability targets.


On 9/28/23 11:18, Robin Jarry wrote:

Hello OVN community,

I'm glad the subject of this message has caught your attention :-)

I would like to start a discussion about how we could improve OVN on the
following topics:

* Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
* Support scaling of L2 connectivity across larger clusters.
* Simplify CMS interoperability.
* Allow support for alternative datapath implementations.

This first email will focus on the current issues that (in my view) are
preventing OVN from scaling L2 networks on larger clusters. I will send
another message with some change proposals to remove or fix these
issues.

Disclaimer:

I am fairly new to this project and my perception and understanding may
be incorrect in some aspects. Please forgive me in advance if I use the
wrong terms and/or make invalid statements. My intent is only to make
things better and not to put the blame on anyone for the current design
choices.

Southbound Design
=

In the current architecture, both databases contain a mix of state and
configuration. While this does not seem to cause any scaling issues for
the northbound DB, it can become a bottleneck for the southbound with
large numbers of chassis and logical network constructs.

The southbound database contains a mix of configuration (logical flows
transformed from the logical network topology) and state (chassis, port
bindings, mac bindings, FDB entries, etc.).

The "configuration" part is consumed by ovn-controller to implement the
network on every chassis and the "state" part is consumed by ovn-northd
to update the northbound "state" entries and to update logical flows.
Some CMS's [1] also depend on the southbound "state" in order to
function properly.

[1] 
https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40

Centralized decisions
=

Every chassis needs to be "aware" of all other chassis in the cluster.
This requirement mainly comes from overlay networks that are implemented
over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
limitations). It is not a scaling issue by itself, but it implies
a centralized decision which in turn puts pressure on the central node
at scale.

Due to ovsdb monitoring and caching, any change in the southbound DB
(either by northd or by any of the chassis controllers) is replicated on
every chassis. The monitor_all option is often enabled on large clusters
to avoid the conditional monitoring CPU cost on the central node.

This leads to high memory usage on all chassis, control plane traffic
and possible disruptions in the ovs-vswitchd datapath flow cache.
Unfortunately, I don't have any hard data to back this claim. This is
mainly coming from discussions I had with neutron contributors and from
brainstorming sessions with colleagues.

I hope that the current work on OVN 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-03 Thread Robin Jarry via discuss
Frode Nordahl, Oct 02, 2023 at 10:05:
> > I must admit I don't have enough field experience operating large
> > network fabrics to state what issues multicast can cause with these.
> > This is why I raised this in the cons list :)
> >
> > What specific issues did you have in mind?
>
> It has been a while since I was overseeing the operations of large
> metro networks, but I have vivid memories of multicast routing being a
> recurring issue. A fabric supporting 10k computes would most likely
> not be one large L2, there would be L3 routing involved and as a
> consequence your proposal imposes configuration and scale requirements
> on the fabric.
>
> Datapoints that suggest other people see this as an issue too can be
> found in the fact that popular top of rack vendors have chosen control
> plane based MAC learning for their EVPN implementations (RFC 7432).
> There are also multiple papers discussing the scaling issues of
> Multicast.

Thanks, I will try to educate myself better about this :)

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-03 Thread Robin Jarry via discuss
Hi all,

Felix Huettner, Oct 02, 2023 at 09:35:
> Hi everyone,
>
> just want to add my experience below
> On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > >
> > > Hi Han,
> > >
> > > Please see my comments/questions inline.
> > >
> > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > Distributed mac learning
> > > > > 
> > > > >
> > > > > Use one OVS bridge per logical switch with mac learning
> > > > > enabled. Only create the bridge if the logical switch has
> > > > > a port bound to the local chassis.
> > > > >
> > > > > Pros:
> > > > >
> > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > >   mostly).
> > > > > - No central mac binding table required.
> > > >
> > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > table of distributed logical routers. We should probably call it
> > > > IP_MAC_binding table, or just Neighbor table.
> > >
> > > Yes sorry about the confusion. I actually meant the FDB table.
> > >
> > > > Here what you mean is actually L2 MAC learning, which today is
> > > > implemented by the FDB table in SB DB, and it is only for
> > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > a MAC address of a VIF.
> > >
> > > This is not that uncommon in telco use cases where VNFs can send
> > > packets from mac addresses unknown to OVN.
> > >
> > Understand, but VNFs contributes a very small portion of the
> > workloads, right? Maybe I should rephrase that: it is uncommon to
> > have "unknown" addresses for the majority of ports in a large scale
> > cloud. Is this understanding correct?
>
> I can only share numbers for our usecase with ~650 chassis we have the
> following distribution of "unknown" in the `addresses` field of
> Logical_Switch_Port:
> * 23000 with a mac address + ip and without "unknown"
> * 250 with a mac address + ip and with "unknown"
> * 30 with just "unknown"
>
> The usecase is a generic public cloud and we do not have any telco
> related things.

I don't have any numbers from telco deployments at hand but I will poke
around.

> > > > The purpose of this proposal is clear - to avoid using a central
> > > > table in DB for L2 information but instead using L2 MAC learning
> > > > to populate such information on chassis, which is a reasonable
> > > > alternative with pros and cons.
> > > > However, I don't think it is necessary to use separate OVS
> > > > bridges for this purpose. L2 MAC learning can be easily
> > > > implemented in the br-int bridge with OVS flows, which is much
> > > > simpler than managing dynamic number of OVS bridges just for the
> > > > purpose of using the builtin OVS mac-learning.
> > >
> > > I agree that this could also be implemented with VLAN tags on the
> > > appropriate ports. But since OVS does not support trunk ports, it
> > > may require complicated OF pipelines. My intent with this idea was
> > > two fold:
> > >
> > > 1) Avoid a central point of failure for mac learning/aging.
> > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> >
> > IMHO, the L2 pipeline is not really complex. It is probably the
> > simplest part (compared with other features for L3, NAT, ACL, LB,
> > etc.). Adding dynamic learning to this part probably makes it *a
> > little* more complex, but should still be straightforward. We don't
> > need any VLAN tag because the incoming packet has geneve VNI in the
> > metadata. We just need a flow that resubmits to lookup
> > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > tunnel endpont information) if the src MAC is not found, with the
> > help of the "learn" action. The entries are per-logical_switch
> > (VNI). This would serve your purpose of avoiding a central DB for
> > L2. At least this looks much simpler to me than managing dynamic
> > number of OVS bridges and the patch pairs between them.

Would that work for non GENEVE networks (localnet) when there is no VNI?
Does that apply as well?


> >
> > >
> > > > Now back to the distributed MAC learning idea itself.
> > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > >
> > > > In OVN today this information is conveyed through:
> > > >
> > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > > >
> > > > In your proposal:
> > > >
> > > > - MAC and Chassis mapping (can be learned through initial L2
> > > >   broadcast/flood)
> > > >
> > > > This indeed would avoid the control plane cost through the
> > > > centralized components (for this L2 binding