Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-02 Thread Felix Huettner via discuss
Hi everyone,

just want to add my experience below
On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> >
> > Hi Han,
> >
> > Please see my comments/questions inline.
> >
> > Han Zhou, Sep 30, 2023 at 21:59:
> > > > Distributed mac learning
> > > > 
> > > >
> > > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > > create the bridge if the logical switch has a port bound to the local
> > > > chassis.
> > > >
> > > > Pros:
> > > >
> > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> mostly).
> > > > - No central mac binding table required.
> > >
> > > Firstly to clarify the terminology of "mac binding" to avoid confusion,
> the
> > > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > > learning. It is actually the ARP/Neighbor table of distributed logical
> > > routers. We should probably call it IP_MAC_binding table, or just
> Neighbor
> > > table.
> >
> > Yes sorry about the confusion. I actually meant the FDB table.
> >
> > > Here what you mean is actually L2 MAC learning, which today is
> implemented
> > > by the FDB table in SB DB, and it is only for uncommon use cases when
> the
> > > NB doesn't have the knowledge of a MAC address of a VIF.
> >
> > This is not that uncommon in telco use cases where VNFs can send packets
> > from mac addresses unknown to OVN.
> >
> Understand, but VNFs contributes a very small portion of the workloads,
> right? Maybe I should rephrase that: it is uncommon to have "unknown"
> addresses for the majority of ports in a large scale cloud. Is this
> understanding correct?

I can only share numbers for our usecase with ~650 chassis we have the
following distribution of "unknown" in the `addresses` field of
Logical_Switch_Port:
* 23000 with a mac address + ip and without "unknown"
* 250 with a mac address + ip and with "unknown"
* 30 with just "unknown"

The usecase is a generic public cloud and we do not have any telco
related things.

>
> > > The purpose of this proposal is clear - to avoid using a central table
> in
> > > DB for L2 information but instead using L2 MAC learning to populate such
> > > information on chassis, which is a reasonable alternative with pros and
> > > cons.
> > > However, I don't think it is necessary to use separate OVS bridges for
> this
> > > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > > with OVS flows, which is much simpler than managing dynamic number of
> OVS
> > > bridges just for the purpose of using the builtin OVS mac-learning.
> >
> > I agree that this could also be implemented with VLAN tags on the
> > appropriate ports. But since OVS does not support trunk ports, it may
> > require complicated OF pipelines. My intent with this idea was two fold:
> >
> > 1) Avoid a central point of failure for mac learning/aging.
> > 2) Simplify the OF pipeline by making all FDB operations dynamic.
>
> IMHO, the L2 pipeline is not really complex. It is probably the simplest
> part (compared with other features for L3, NAT, ACL, LB, etc.).
> Adding dynamic learning to this part probably makes it *a little* more
> complex, but should still be straightforward. We don't need any VLAN tag
> because the incoming packet has geneve VNI in the metadata. We just need a
> flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
> new flow (with related tunnel endpont information) if the src MAC is not
> found, with the help of the "learn" action. The entries are
> per-logical_switch (VNI). This would serve your purpose of avoiding a
> central DB for L2. At least this looks much simpler to me than managing
> dynamic number of OVS bridges and the patch pairs between them.
>
> >
> > > Now back to the distributed MAC learning idea itself. Essentially for
> two
> > > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
> to
> > > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
> discuss
> > > this later), Chassis1 needs to know that VM2's MAC is located on
> Chassis2.
> > >
> > > In OVN today this information is conveyed through:
> > >
> > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > >
> > > In your proposal:
> > >
> > > - MAC and Chassis mapping (can be learned through initial L2
> > >   broadcast/flood)
> > >
> > > This indeed would avoid the control plane cost through the centralized
> > > components (for this L2 binding part). Given that today's SB OVSDB is a
> > > bottleneck, this idea may sound attractive. But please also take into
> > > consideration the below improvement that could mitigate the OVN central
> > > scale issue:
> > >
> > > - For MAC and LSP mapping, northd is now capable of incrementally
> > >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> > >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> > 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-02 Thread Frode Nordahl via discuss
On Sat, Sep 30, 2023 at 3:50 PM Robin Jarry  wrote:

[ snip ]

> > > Also, will such behavior be compatible with HW-offload-capable to
> > > smartnics/DPUs?
> >
> > I am also a bit concerned about this, what would be the typical number
> > of bridges supported by hardware?
>
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.

True, but you never know what odd bugs might pop out when doing things
like this, hence my concern :)

> > > >>> Use multicast for overlay networks
> > > >>> ==
> > > > [snip]
> > > >>> - 24bit VNI allows for more than 16 million logical switches. No need
> > > >>>  for extended GENEVE tunnel options.
> > > >> Note that using vxlan at the moment significantly reduces the ovn
> > > >> featureset. This is because the geneve header options are currently 
> > > >> used
> > > >> for data that would not fit into the vxlan vni.
> > > >>
> > > >> From ovn-architecture.7.xml:
> > > >> ```
> > > >> The maximum number of networks is reduced to 4096.
> > > >> The maximum number of ports per network is reduced to 2048.
> > > >> ACLs matching against logical ingress port identifiers are not 
> > > >> supported.
> > > >> OVN interconnection feature is not supported.
> > > >> ```
> > > >
> > > > In my understanding, the main reason why GENEVE replaced VXLAN is
> > > > because Openstack uses full mesh point to point tunnels and that the
> > > > sender needs to know behind which chassis any mac address is to send it
> > > > into the correct tunnel. GENEVE allowed to reduce the lookup time both
> > > > on the sender and receiver thanks to ingress/egress port metadata.
> > > >
> > > > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> > > > https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> > > >
> > > > If VXLAN + multicast and address learning was used, the "correct" tunnel
> > > > would be established ad-hoc and both sender and receiver lookups would
> > > > only be a simple mac forwarding with learning. The ingress pipeline
> > > > would probably cost a little more.
> > > >
> > > > Maybe multicast + address learning could be implemented for GENEVE as
> > > > well. But it would not be interoperable with other VTEPs.
> >
> > While it is true that it takes time before switch hardware picks up
> > support for emerging protocols, I do not think it is a valid argument
> > for limiting the development of OVN. Most hardware offload capable
> > NICs already have GENEVE support, and if you survey recent or upcoming
> > releases from top of rack switch vendors you will also find that they
> > have added support for using GENEVE for hardware VTEPs. The fact that
> > SDNs with a large customer footprint (such as NSX and OVN) make use of
> > GENEVE is most likely a deciding factor for their adoption, and I see
> > no reason why we should stop defining the edge of development in this
> > space.
>
> GENEVE could perfectly be suitable with a multicast based control plane
> to establish ad-hoc tunnels without any centralized involvement.
>
> I was only proposing VXLAN since this multicast group system was part of
> the original RFC (supported in Linux since 3.12).
>
> > > >>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> > > >>>  top-of-rack switches. Multicast is only used for BUM traffic.
> > > >>> - Only one VXLAN output port per implemented logical switch on a given
> > > >>>  chassis.
> > > >>
> > > >> Would this actually work with one VXLAN output port? Would you not need
> > > >> one port per target node to send unicast traffic (as you otherwise 
> > > >> flood
> > > >> all packets to all participating nodes)?
> > > >
> > > > You would need one VXLAN output port per implemented logical switch on
> > > > a given chassis. The port would have a VNI (unique per logical switch)
> > > > and an associated multicast IP address. Any chassis that implement this
> > > > logical switch would subscribe to that multicast group. The flooding
> > > > would be limited to first packets and broadcast/multicast traffic (ARP
> > > > requests, mostly). Once the receiver node replies, all communication
> > > > will happen with unicast.
> > > >
> > > > https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic
> > > >
> > > >>> Cons:
> > > >>>
> > > >>> - OVS does not support VXLAN address learning yet.
> > > >>> - The number of usable multicast groups in a fabric network may be
> > > >>>  limited?
> > > >>> - How to manage seamless upgrades and interoperability with older OVN
> > > >>>  versions?
> > > >> - This pushes all logic related to chassis management to the
> > > >>  underlying networking fabric. It thereby places additional
> > > >>

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-02 Thread Robin Jarry via discuss
Han Zhou, Oct 01, 2023 at 21:30:
> Please note that tunnels are needed not only between nodes related to same
> logical switches, but also when they are related to different logical
> switches connected by logical routers (even multiple LR+LS hops away).

Yep.

> To clarify a little more, openstack deployment can have different logical
> topologies. So to evaluate the impact of monitor_all settings there should
> be different test cases to capture different types of deployment, e.g.
> full-mesh topology (monitor_all=true is better) v.s. "small islands"
> toplogy (monitor_all=false is reasonable).

This is one thing to note for the recent ovn-heater work that adds
openstack test cases.

> FDB and MAC_binding tables are used by ovn-controllers. They are
> essentially the central storage for MAC tables of the distributed logical
> switches (FDB) and ARP/Neighbour tables for distributed logical routers
> (MAC_binding). A record can be populate by one chassis and consumed by many
> other chassis.
>
> monitor_all should work the same way for these tables: if monitor_all =
> false, only rows related to "local datapaths" should be downloaded to the
> chassis. However, for FDB table, the condition is not set for now (which
> may have been a miss in the initial implementation). Perhaps this is not
> noticed because MAC learning is not a very widely used feature and no scale
> impact noticed, but I just proposed a patch to enable the conditional
> monitoring:
> https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/

Ok thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss