Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Vladislav Odintsov via discuss Sat, 30 Sep 2023 09:57:13 -0700


regards,
Vladislav Odintsov


> On 30 Sep 2023, at 16:50, Robin Jarry <rja...@redhat.com> wrote:
> 
> Hi Vladislav, Frode,
> 
> Thanks for your replies.
> 
> Frode Nordahl, Sep 30, 2023 at 10:55:
>> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>> <ovs-discuss@openvswitch.org> wrote:
>>>> On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
>>>> <ovs-discuss@openvswitch.org> wrote:
>>>> 
>>>> Felix Huettner, Sep 29, 2023 at 15:23:
>>>>>> Distributed mac learning
>>>>>> ========================
>>>> [snip]
>>>>>> 
>>>>>> Cons:
>>>>>> 
>>>>>> - How to manage seamless upgrades?
>>>>>> - Requires ovn-controller to move/plug ports in the correct bridge.
>>>>>> - Multiple openflow connections (one per managed bridge).
>>>>>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>>>>>> as well).
>>>>> 
>>>>> - No central information anymore on mac bindings. All nodes need to
>>>>> update their data individually
>>>>> - Each bridge generates also a linux network interface. I do not know if
>>>>> there is some kind of limit to the linux interfaces or the ovs bridges
>>>>> somewhere.
>>>> 
>>>> That's a good point. However, only the bridges related to one
>>>> implemented logical network would need to be created on a single
>>>> chassis. Even with the largest OVN deployments, I doubt this would be
>>>> a limitation.
>>>> 
>>>>> Would you still preprovision static mac addresses on the bridge for all
>>>>> port_bindings we know the mac address from, or would you rather leave
>>>>> that up for learning as well?
>>>> 
>>>> I would leave everything dynamic.
>>>> 
>>>>> I do not know if there is some kind of performance/optimization penality
>>>>> for moving packets between different bridges.
>>>> 
>>>> As far as I know, once the openflow pipeline has been resolved into
>>>> a datapath flow, there is no penalty.
>>>> 
>>>>> You can also not only use the logical switch that have a local port
>>>>> bound. Assume the following topology:
>>>>> +---+ +---+ +---+ +---+ +---+ +---+ +---+
>>>>> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
>>>>> +---+ +---+ +---+ +---+ +---+ +---+ +---+
>>>>> vm1 and vm2 are both running on the same hypervisor. Creating only local
>>>>> logical switches would mean only ls1 and ls3 are available on that
>>>>> hypervisor. This would break the connection between the two vms which
>>>>> would in the current implementation just traverse the two logical
>>>>> routers.
>>>>> I guess we would need to create bridges for each locally reachable
>>>>> logical switch. I am concerned about the potentially significant
>>>>> increase in bridges and openflow connections this brings.
>>>> 
>>>> That is one of the concerns I raised in the last point. In my opinion
>>>> this is a trade off. You remove centralization and require more local
>>>> processing. But overall, the processing cost should remain equivalent.
>>> 
>>> Just want to clarify.
>>> 
>>> For topology described by Felix above, you propose to create 2 OVS
>>> bridges, right? How will the packet traverse from vm1 to vm2?
> 
> In this particular case, there would be 3 OVS bridges, one for each
> logical switch.

Yeah, agree, this is typo. Below I named three bridges :).

> 
>>> Currently when the packet enters OVS all the logical switching and
>>> routing openflow calculation is done with no packet re-entering OVS,
>>> and this results in one DP flow match to deliver this packet from
>>> vm1 to vm2 (if no conntrack used, which could introduce
>>> recirculations).
>>> 
>>> Do I understand correctly, that in this proposal OVS needs to
>>> receive packet from “ls1” bridge, next run through lrouter “lr1”
>>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
>>> learning between logical routers (should we have here OF flow with
>>> learn action?), then send packet again to OVS, calculate “lr2”
>>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
>>> send packet to a vm2?
> 
> What I am proposing is to implement the northbound L2 network intent
> with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> constructs and ACLs would require patch ports and specific OF pipelines.
> 
> We could even think of adding more advanced L3 capabilities (RIB) into
> OVS to simplify the OF pipelines.

But this will make OVS<->kernel interaction more complex. Even if we forget 
about dpdk environments…

> 
>>> Also, will such behavior be compatible with HW-offload-capable to
>>> smartnics/DPUs?
>> 
>> I am also a bit concerned about this, what would be the typical number
>> of bridges supported by hardware?
> 
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.

As far as I remember from my tests against ConnectX-5/6 SmartNICs in ASAP^2 
mode, HW-offload is not capable with offloading OVS patch ports. At least it 
was so 2 years ago.
If that is still true, this will degrade the ability to offload datapath.
Maybe @Han Zhou can comment this in more detail.

> 
>>>>>> Use multicast for overlay networks
>>>>>> ==================================
>>>> [snip]
>>>>>> - 24bit VNI allows for more than 16 million logical switches. No need
>>>>>> for extended GENEVE tunnel options.
>>>>> Note that using vxlan at the moment significantly reduces the ovn
>>>>> featureset. This is because the geneve header options are currently used
>>>>> for data that would not fit into the vxlan vni.
>>>>> 
>>>>> From ovn-architecture.7.xml:
>>>>> ```
>>>>> The maximum number of networks is reduced to 4096.
>>>>> The maximum number of ports per network is reduced to 2048.
>>>>> ACLs matching against logical ingress port identifiers are not supported.
>>>>> OVN interconnection feature is not supported.
>>>>> ```
>>>> 
>>>> In my understanding, the main reason why GENEVE replaced VXLAN is
>>>> because Openstack uses full mesh point to point tunnels and that the
>>>> sender needs to know behind which chassis any mac address is to send it
>>>> into the correct tunnel. GENEVE allowed to reduce the lookup time both
>>>> on the sender and receiver thanks to ingress/egress port metadata.
>>>> 
>>>> https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
>>>> https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
>>>> 
>>>> If VXLAN + multicast and address learning was used, the "correct" tunnel
>>>> would be established ad-hoc and both sender and receiver lookups would
>>>> only be a simple mac forwarding with learning. The ingress pipeline
>>>> would probably cost a little more.
>>>> 
>>>> Maybe multicast + address learning could be implemented for GENEVE as
>>>> well. But it would not be interoperable with other VTEPs.
>> 
>> While it is true that it takes time before switch hardware picks up
>> support for emerging protocols, I do not think it is a valid argument
>> for limiting the development of OVN. Most hardware offload capable
>> NICs already have GENEVE support, and if you survey recent or upcoming
>> releases from top of rack switch vendors you will also find that they
>> have added support for using GENEVE for hardware VTEPs. The fact that
>> SDNs with a large customer footprint (such as NSX and OVN) make use of
>> GENEVE is most likely a deciding factor for their adoption, and I see
>> no reason why we should stop defining the edge of development in this
>> space.
> 
> GENEVE could perfectly be suitable with a multicast based control plane
> to establish ad-hoc tunnels without any centralized involvement.
> 
> I was only proposing VXLAN since this multicast group system was part of
> the original RFC (supported in Linux since 3.12).
> 
>>>>>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
>>>>>> top-of-rack switches. Multicast is only used for BUM traffic.
>>>>>> - Only one VXLAN output port per implemented logical switch on a given
>>>>>> chassis.
>>>>> 
>>>>> Would this actually work with one VXLAN output port? Would you not need
>>>>> one port per target node to send unicast traffic (as you otherwise flood
>>>>> all packets to all participating nodes)?
>>>> 
>>>> You would need one VXLAN output port per implemented logical switch on
>>>> a given chassis. The port would have a VNI (unique per logical switch)
>>>> and an associated multicast IP address. Any chassis that implement this
>>>> logical switch would subscribe to that multicast group. The flooding
>>>> would be limited to first packets and broadcast/multicast traffic (ARP
>>>> requests, mostly). Once the receiver node replies, all communication
>>>> will happen with unicast.
>>>> 
>>>> https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic
>>>> 
>>>>>> Cons:
>>>>>> 
>>>>>> - OVS does not support VXLAN address learning yet.
>>>>>> - The number of usable multicast groups in a fabric network may be
>>>>>> limited?
>>>>>> - How to manage seamless upgrades and interoperability with older OVN
>>>>>> versions?
>>>>> - This pushes all logic related to chassis management to the
>>>>> underlying networking fabric. It thereby places additional
>>>>> requirements on the network fabric that have not been here before and
>>>>> that might not be available for all users.
>>>> 
>>>> Are you aware of any fabric that does not support IGMP/MLD snooping?
>> 
>> Have you ever operated a network without having issues with multicast? ;)
> 
> I must admit I don't have enough field experience operating large
> network fabrics to state what issues multicast can cause with these.
> This is why I raised this in the cons list :)
> 
> What specific issues did you have in mind?
> 
>>>>> - The bfd sessions between chassis are no longer possible thereby
>>>>> preventing fast failover of gateway chassis.
>>>> 
>>>> I don't know what these BFD sessions are used for. But we could imagine
>>>> an ad-hoc establishment of them when a tunnel is created.
>>>> 
>>>>> As this idea requires VXLAN and all current limitation would apply to
>>>>> this solution as well this is probably no general solution but rather a
>>>>> deployment option.
>>>> 
>>>> Yes, for backward compatibility, it would probably need to be opt-in.
>> 
>> Would an alternative be to look at how we can make the existing
>> communication infrastructure that OVN provides between the
>> ovn-controllers more efficient for this use case? If you think about
>> it, could it be used for "multicast" like operation? One of the issues
>> with large L2s for OVN today is the population of every known mac
>> address in the network to every chassis in the cloud. Would an
>> alternative be to:
>> 
>> - Each ovn-controller preprograms only the mac bindings for logical
>>  switch ports residing on the hypervisor.
>> - When learning of a remote MAC address is necessary, broadcast the
>>  request only to tunnel endpoints where we know there are logical
>>  switch ports for the same logical switch.
>> - Add a local OVSDB instance for ovn-controller to store things such
>>  as learned mac addresses instead of using the central DB for this
>>  information.
> 
> I am afraid that it would complicate even more the current OVN. Why
> reimplement existing network stack features in OVN?
> 
> I am eager to know if real multicast operation was ever considered, if
> so, why was it discarded as a viable option. If not, could we consider
> it?
> 
>>>>>> Connect ovn-controller to the northbound DB
>>>>>> ===========================================
>>>> [snip]
>>>>>> For other components that require access to the southbound DB (e.g.
>>>>>> neutron metadata agent), ovn-controller should provide an interface to
>>>>>> expose state and configuration data for local consumption.
>>>>> 
>>>>> Note that also ovn-interconnect uses access to the southbound DB to add
>>>>> chassis of the interconnected site (and potentially some more magic).
>>>> 
>>>> I was not aware of this. Thanks for the heads up.
>>>> 
>>>>>> Pros:
>>>> [snip]
>>>>> 
>>>>> - one less codebase with northd gone
>>>>> 
>>>>>> Cons:
>>>>>> 
>>>>>> - This would be a serious API breakage for systems that depend on the
>>>>>> southbound DB.
>>>>>> - Can all OVN constructs be implemented without a southbound DB?
>>>>>> - Is the community interested in alternative datapaths?
>>>>> 
>>>>> - It requires each ovn-controller to do that translation of a given
>>>>> construct (e.g. a logical switch) thereby probably increasing the cpu
>>>>> load and recompute time
>>>> 
>>>> We cannot get this for free :) The CPU load that is gone from the
>>>> central node needs to be shared across all chassis.
>>>> 
>>>>> - The complexity of the ovn-controller grows as it gains nearly all
>>>>> logic of northd
>>>> 
>>>> Agreed, but the complexity may not be that high. Since ovn-controller
>>>> would not need to do a two staged translation from NB model to logical
>>>> flows to openflow.
>>>> 
>>>> Also, if we reuse OVS bridges to implement logical switches. There would
>>>> be a reduced number of flows to compute.
>>>> 
>>>>> I now understand what you meant with the alternative datapaths in your
>>>>> first mail. While i find the option interesting i'm not sure how much
>>>>> value actually would come out of that.
>>>> 
>>>> I am having resource constrained environments in mind (DPUs/IPUs, edge
>>>> nodes, etc). When available memory and CPU are limited, OVS may not the
>>>> most efficient. Maybe using the plain linux (or BSD) networking stack
>>>> would be perfectly suitable and more lightweight.
>> 
>> I honestly do not think using linuxbridge as a datapath is an
>> desirable option for multiple reasons:
>> 
>> 1) There is no performant and hardware offloadable way to express ACLs
>>   for linuxbridges.
>> 2) There is no way to express L3 constructs for linuxbridges.
>> 3) The current OVS OpenFlow bridge model is a perfect fit for
>>   translating the intent into flows programmed directly into the
>>   hardware switch on the NIC, and from my perspective this is one of
>>   the main reasons why we are migrating the world onto OVS/OVN and
>>   away from legacy implementations based on linuxbridges and network
>>   namespaces.
>> 
>> Accelerator cards/DPUs/IPUs are usually equipped with such hardware
>> switches (implemented in ASIC or FPGA).
> 
> Let me first clarify one point, I am *not* suggesting to use linux
> bridges and network namespaces as a first class replacement for OVS.
> I am aware that the linux network stack has neither the level of
> performance nor the determinism required for cloud and telco use cases.
> 
> What I am proposing is to make OVN more inclusive and decouple it from
> the flow-based paradigm by allowing alternative implementations of the
> northbound network intent.
> 
> I realize that this idea may be controversial for the community since
> OVN has been closely tied to OVS since the start. However I am convinced
> that this is a direction worth exploring, or at least discussing :)
> 
>>>> Also, since the northbound database doesn't know anything about flows,
>>>> it could make OVN interoperable with any network capable element that is
>>>> able to implement the network intent as described in the NB DB (<insert
>>>> the name of your vrouter here>, etc.).
>>>> 
>>>>> For me it feels like this would make ovn siginificantly harder to debug.
>>>> 
>>>> Are you talking about ovn-trace? Or in general?
>>>> 
>>>> Thanks for your comments.
> 
> Thanks everyone for the constructive discussion so far!
> 
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to