Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Frode Nordahl via discuss Sat, 30 Sep 2023 02:02:08 -0700

Thanks alot for starting this discussion.

On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
<ovs-discuss@openvswitch.org> wrote:
>
> Hi Robin,
>
> Please, see inline.
>
> regards,
> Vladislav Odintsov
>
> > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> > <ovs-discuss@openvswitch.org> wrote:
> >
> > Felix Huettner, Sep 29, 2023 at 15:23:
> >>> Distributed mac learning
> >>> ========================
> > [snip]
> >>>
> >>> Cons:
> >>>
> >>> - How to manage seamless upgrades?
> >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> >>> - Multiple openflow connections (one per managed bridge).
> >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> >>>  as well).
> >>
> >> - No central information anymore on mac bindings. All nodes need to
> >>  update their data individually
> >> - Each bridge generates also a linux network interface. I do not know if
> >>  there is some kind of limit to the linux interfaces or the ovs bridges
> >>  somewhere.
> >
> > That's a good point. However, only the bridges related to one
> > implemented logical network would need to be created on a single
> > chassis. Even with the largest OVN deployments, I doubt this would be
> > a limitation.
> >
> >> Would you still preprovision static mac addresses on the bridge for all
> >> port_bindings we know the mac address from, or would you rather leave
> >> that up for learning as well?
> >
> > I would leave everything dynamic.
> >
> >> I do not know if there is some kind of performance/optimization penality
> >> for moving packets between different bridges.
> >
> > As far as I know, once the openflow pipeline has been resolved into
> > a datapath flow, there is no penalty.
> >
> >> You can also not only use the logical switch that have a local port
> >> bound. Assume the following topology:
> >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> >> logical switches would mean only ls1 and ls3 are available on that
> >> hypervisor. This would break the connection between the two vms which
> >> would in the current implementation just traverse the two logical
> >> routers.
> >> I guess we would need to create bridges for each locally reachable
> >> logical switch. I am concerned about the potentially significant
> >> increase in bridges and openflow connections this brings.
> >
> > That is one of the concerns I raised in the last point. In my opinion
> > this is a trade off. You remove centralization and require more local
> > processing. But overall, the processing cost should remain equivalent.
>
> Just want to clarify.
> For topology described by Felix above, you propose to create 2 OVS bridges, 
> right? How will the packet traverse from vm1 to vm2?
>
> Currently when the packet enters OVS all the logical switching and routing 
> openflow calculation is done with no packet re-entering OVS, and this results 
> in one DP flow match to deliver this packet from vm1 to vm2 (if no conntrack 
> used, which could introduce recirculations).
> Do I understand correctly, that in this proposal OVS needs to receive packet 
> from “ls1” bridge, next run through lrouter “lr1” OpenFlow pipelines, then 
> output packet to “ls2” OVS bridge for mac learning between logical routers 
> (should we have here OF flow with learn action?), then send packet again to 
> OVS, calculate “lr2” OpenFlow pipeline and finally reach destination OVS 
> bridge “ls3” to send packet to a vm2?
>
> Also, will such behavior be compatible with HW-offload-capable to 
> smartnics/DPUs?


I am also a bit concerned about this, what would be the typical number
of bridges supported by hardware?

> >
> >>> Use multicast for overlay networks
> >>> ==================================
> > [snip]
> >>> - 24bit VNI allows for more than 16 million logical switches. No need
> >>>  for extended GENEVE tunnel options.
> >> Note that using vxlan at the moment significantly reduces the ovn
> >> featureset. This is because the geneve header options are currently used
> >> for data that would not fit into the vxlan vni.
> >>
> >> From ovn-architecture.7.xml:
> >> ```
> >> The maximum number of networks is reduced to 4096.
> >> The maximum number of ports per network is reduced to 2048.
> >> ACLs matching against logical ingress port identifiers are not supported.
> >> OVN interconnection feature is not supported.
> >> ```
> >
> > In my understanding, the main reason why GENEVE replaced VXLAN is
> > because Openstack uses full mesh point to point tunnels and that the
> > sender needs to know behind which chassis any mac address is to send it
> > into the correct tunnel. GENEVE allowed to reduce the lookup time both
> > on the sender and receiver thanks to ingress/egress port metadata.
> >
> > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> > https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> >
> > If VXLAN + multicast and address learning was used, the "correct" tunnel
> > would be established ad-hoc and both sender and receiver lookups would
> > only be a simple mac forwarding with learning. The ingress pipeline
> > would probably cost a little more.
> >
> > Maybe multicast + address learning could be implemented for GENEVE as
> > well. But it would not be interoperable with other VTEPs.

While it is true that it takes time before switch hardware picks up
support for emerging protocols, I do not think it is a valid argument
for limiting the development of OVN. Most hardware offload capable
NICs already have GENEVE support, and if you survey recent or upcoming
releases from top of rack switch vendors you will also find that they
have added support for using GENEVE for hardware VTEPs. The fact that
SDNs with a large customer footprint (such as NSX and OVN) make use of
GENEVE is most likely a deciding factor for their adoption, and I see
no reason why we should stop defining the edge of development in this
space.

> >>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> >>>  top-of-rack switches. Multicast is only used for BUM traffic.
> >>> - Only one VXLAN output port per implemented logical switch on a given
> >>>  chassis.
> >>
> >> Would this actually work with one VXLAN output port? Would you not need
> >> one port per target node to send unicast traffic (as you otherwise flood
> >> all packets to all participating nodes)?
> >
> > You would need one VXLAN output port per implemented logical switch on
> > a given chassis. The port would have a VNI (unique per logical switch)
> > and an associated multicast IP address. Any chassis that implement this
> > logical switch would subscribe to that multicast group. The flooding
> > would be limited to first packets and broadcast/multicast traffic (ARP
> > requests, mostly). Once the receiver node replies, all communication
> > will happen with unicast.
> >
> > https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic
> >
> >>> Cons:
> >>>
> >>> - OVS does not support VXLAN address learning yet.
> >>> - The number of usable multicast groups in a fabric network may be
> >>>  limited?
> >>> - How to manage seamless upgrades and interoperability with older OVN
> >>>  versions?
> >> - This pushes all logic related to chassis management to the
> >>  underlying networking fabric. It thereby places additional
> >>  requirements on the network fabric that have not been here before and
> >>  that might not be available for all users.
> >
> > Are you aware of any fabric that does not support IGMP/MLD snooping?

Have you ever operated a network without having issues with multicast? ;)

> >> - The bfd sessions between chassis are no longer possible thereby
> >>  preventing fast failover of gateway chassis.
> >
> > I don't know what these BFD sessions are used for. But we could imagine
> > an ad-hoc establishment of them when a tunnel is created.
> >
> >> As this idea requires VXLAN and all current limitation would apply to
> >> this solution as well this is probably no general solution but rather a
> >> deployment option.
> >
> > Yes, for backward compatibility, it would probably need to be opt-in.

Would an alternative be to look at how we can make the existing
communication infrastructure that OVN provides between the
ovn-controllers more efficient for this use case? If you think about
it, could it be used for "multicast" like operation? One of the issues
with large L2s for OVN today is the population of every known mac
address in the network to every chassis in the cloud. Would an
alternative be to:
- Each ovn-controller preprograms only the mac bindings for logical
switch ports residing on the hypervisor.
- When learning of a remote MAC address is necessary, broadcast the
request only to tunnel endpoints where we know there are logical
switch ports for the same logical switch.
- Add a local OVSDB instance for ovn-controller to store things such
as learned mac addresses instead of using the central DB for this
information.


> >>> Connect ovn-controller to the northbound DB
> >>> ===========================================
> > [snip]
> >>> For other components that require access to the southbound DB (e.g.
> >>> neutron metadata agent), ovn-controller should provide an interface to
> >>> expose state and configuration data for local consumption.
> >>
> >> Note that also ovn-interconnect uses access to the southbound DB to add
> >> chassis of the interconnected site (and potentially some more magic).
> >
> > I was not aware of this. Thanks for the heads up.
> >
> >>> Pros:
> > [snip]
> >>
> >> - one less codebase with northd gone
> >>
> >>> Cons:
> >>>
> >>> - This would be a serious API breakage for systems that depend on the
> >>>  southbound DB.
> >>> - Can all OVN constructs be implemented without a southbound DB?
> >>> - Is the community interested in alternative datapaths?
> >>
> >> - It requires each ovn-controller to do that translation of a given
> >>  construct (e.g. a logical switch) thereby probably increasing the cpu
> >>  load and recompute time
> >
> > We cannot get this for free :) The CPU load that is gone from the
> > central node needs to be shared across all chassis.
> >
> >> - The complexity of the ovn-controller grows as it gains nearly all
> >>  logic of northd
> >
> > Agreed, but the complexity may not be that high. Since ovn-controller
> > would not need to do a two staged translation from NB model to logical
> > flows to openflow.
> >
> > Also, if we reuse OVS bridges to implement logical switches. There would
> > be a reduced number of flows to compute.
> >
> >> I now understand what you meant with the alternative datapaths in your
> >> first mail. While i find the option interesting i'm not sure how much
> >> value actually would come out of that.
> >
> > I am having resource constrained environments in mind (DPUs/IPUs, edge
> > nodes, etc). When available memory and CPU are limited, OVS may not the
> > most efficient. Maybe using the plain linux (or BSD) networking stack
> > would be perfectly suitable and more lightweight.

I honestly do not think using linuxbridge as a datapath is an
desirable option for multiple reasons:
1) There is no performant and hardware offloadable way to express ACLs
for linuxbridges.
2) There is no way to express L3 constructs for linuxbridges.
3) The current OVS OpenFlow bridge model is a perfect fit for
translating the intent into flows programmed directly into the
hardware switch on the NIC, and from my perspective this is one of the
main reasons why we are migrating the world onto OVS/OVN and away from
legacy implementations based on linuxbridges and network namespaces.
Accelerator cards/DPUs/IPUs are usually equipped with such hardware
switches (implemented in ASIC or FPGA).

-- 
Frode Nordahl

> > Also, since the northbound database doesn't know anything about flows,
> > it could make OVN interoperable with any network capable element that is
> > able to implement the network intent as described in the NB DB (<insert
> > the name of your vrouter here>, etc.).
> >
> >> For me it feels like this would make ovn siginificantly harder to debug.
> >
> > Are you talking about ovn-trace? Or in general?
> >
> > Thanks for your comments.
> >
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to