Hi all,
> On 30 Sep 2023, at 15:51, Robin Jarry via discuss > <ovs-discuss@openvswitch.org> wrote: > > Hi Vladislav, Frode, > > Thanks for your replies. > > Frode Nordahl, Sep 30, 2023 at 10:55: >> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss >> <ovs-discuss@openvswitch.org> wrote: >>>> On 29 Sep 2023, at 18:14, Robin Jarry via discuss >>>> <ovs-discuss@openvswitch.org> wrote: >>>> >>>> Felix Huettner, Sep 29, 2023 at 15:23: >>>>>> Distributed mac learning >>>>>> ======================== >>>> [snip] >>>>>> >>>>>> Cons: >>>>>> >>>>>> - How to manage seamless upgrades? >>>>>> - Requires ovn-controller to move/plug ports in the correct bridge. >>>>>> - Multiple openflow connections (one per managed bridge). >>>>>> - Requires ovn-trace to be reimplemented differently (maybe other tools >>>>>> as well). >>>>> >>>>> - No central information anymore on mac bindings. All nodes need to >>>>> update their data individually >>>>> - Each bridge generates also a linux network interface. I do not know if >>>>> there is some kind of limit to the linux interfaces or the ovs bridges >>>>> somewhere. >>>> >>>> That's a good point. However, only the bridges related to one >>>> implemented logical network would need to be created on a single >>>> chassis. Even with the largest OVN deployments, I doubt this would be >>>> a limitation. >>>> >>>>> Would you still preprovision static mac addresses on the bridge for all >>>>> port_bindings we know the mac address from, or would you rather leave >>>>> that up for learning as well? >>>> >>>> I would leave everything dynamic. >>>> >>>>> I do not know if there is some kind of performance/optimization penality >>>>> for moving packets between different bridges. >>>> >>>> As far as I know, once the openflow pipeline has been resolved into >>>> a datapath flow, there is no penalty. >>>> >>>>> You can also not only use the logical switch that have a local port >>>>> bound. Assume the following topology: >>>>> +---+ +---+ +---+ +---+ +---+ +---+ +---+ >>>>> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2| >>>>> +---+ +---+ +---+ +---+ +---+ +---+ +---+ >>>>> vm1 and vm2 are both running on the same hypervisor. Creating only local >>>>> logical switches would mean only ls1 and ls3 are available on that >>>>> hypervisor. This would break the connection between the two vms which >>>>> would in the current implementation just traverse the two logical >>>>> routers. >>>>> I guess we would need to create bridges for each locally reachable >>>>> logical switch. I am concerned about the potentially significant >>>>> increase in bridges and openflow connections this brings. >>>> >>>> That is one of the concerns I raised in the last point. In my opinion >>>> this is a trade off. You remove centralization and require more local >>>> processing. But overall, the processing cost should remain equivalent. >>> >>> Just want to clarify. >>> >>> For topology described by Felix above, you propose to create 2 OVS >>> bridges, right? How will the packet traverse from vm1 to vm2? > > In this particular case, there would be 3 OVS bridges, one for each > logical switch. > >>> Currently when the packet enters OVS all the logical switching and >>> routing openflow calculation is done with no packet re-entering OVS, >>> and this results in one DP flow match to deliver this packet from >>> vm1 to vm2 (if no conntrack used, which could introduce >>> recirculations). >>> >>> Do I understand correctly, that in this proposal OVS needs to >>> receive packet from “ls1” bridge, next run through lrouter “lr1” >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac >>> learning between logical routers (should we have here OF flow with >>> learn action?), then send packet again to OVS, calculate “lr2” >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to >>> send packet to a vm2? > > What I am proposing is to implement the northbound L2 network intent > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network > constructs and ACLs would require patch ports and specific OF pipelines. > > We could even think of adding more advanced L3 capabilities (RIB) into > OVS to simplify the OF pipelines. > >>> Also, will such behavior be compatible with HW-offload-capable to >>> smartnics/DPUs? >> >> I am also a bit concerned about this, what would be the typical number >> of bridges supported by hardware? > > As far as I understand, only the datapath flows are offloaded to > hardware. The OF pipeline is only parsed when there is an upcall for the > first packet. Once resolved, the datapath flow is reused. OVS bridges > are only logical constructs, they are neither reflected in the datapath > nor in hardware. > >>>>>> Use multicast for overlay networks >>>>>> ================================== >>>> [snip] >>>>>> - 24bit VNI allows for more than 16 million logical switches. No need >>>>>> for extended GENEVE tunnel options. >>>>> Note that using vxlan at the moment significantly reduces the ovn >>>>> featureset. This is because the geneve header options are currently used >>>>> for data that would not fit into the vxlan vni. >>>>> >>>>> From ovn-architecture.7.xml: >>>>> ``` >>>>> The maximum number of networks is reduced to 4096. >>>>> The maximum number of ports per network is reduced to 2048. >>>>> ACLs matching against logical ingress port identifiers are not supported. >>>>> OVN interconnection feature is not supported. >>>>> ``` >>>> >>>> In my understanding, the main reason why GENEVE replaced VXLAN is >>>> because Openstack uses full mesh point to point tunnels and that the >>>> sender needs to know behind which chassis any mac address is to send it >>>> into the correct tunnel. GENEVE allowed to reduce the lookup time both >>>> on the sender and receiver thanks to ingress/egress port metadata. >>>> >>>> https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/ >>>> https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/ >>>> >>>> If VXLAN + multicast and address learning was used, the "correct" tunnel >>>> would be established ad-hoc and both sender and receiver lookups would >>>> only be a simple mac forwarding with learning. The ingress pipeline >>>> would probably cost a little more. >>>> >>>> Maybe multicast + address learning could be implemented for GENEVE as >>>> well. But it would not be interoperable with other VTEPs. >> >> While it is true that it takes time before switch hardware picks up >> support for emerging protocols, I do not think it is a valid argument >> for limiting the development of OVN. Most hardware offload capable >> NICs already have GENEVE support, and if you survey recent or upcoming >> releases from top of rack switch vendors you will also find that they >> have added support for using GENEVE for hardware VTEPs. The fact that >> SDNs with a large customer footprint (such as NSX and OVN) make use of >> GENEVE is most likely a deciding factor for their adoption, and I see >> no reason why we should stop defining the edge of development in this >> space. > > GENEVE could perfectly be suitable with a multicast based control plane > to establish ad-hoc tunnels without any centralized involvement. > > I was only proposing VXLAN since this multicast group system was part of > the original RFC (supported in Linux since 3.12). > >>>>>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in >>>>>> top-of-rack switches. Multicast is only used for BUM traffic. >>>>>> - Only one VXLAN output port per implemented logical switch on a given >>>>>> chassis. >>>>> >>>>> Would this actually work with one VXLAN output port? Would you not need >>>>> one port per target node to send unicast traffic (as you otherwise flood >>>>> all packets to all participating nodes)? >>>> >>>> You would need one VXLAN output port per implemented logical switch on >>>> a given chassis. The port would have a VNI (unique per logical switch) >>>> and an associated multicast IP address. Any chassis that implement this >>>> logical switch would subscribe to that multicast group. The flooding >>>> would be limited to first packets and broadcast/multicast traffic (ARP >>>> requests, mostly). Once the receiver node replies, all communication >>>> will happen with unicast. >>>> >>>> https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic >>>> >>>>>> Cons: >>>>>> >>>>>> - OVS does not support VXLAN address learning yet. >>>>>> - The number of usable multicast groups in a fabric network may be >>>>>> limited? >>>>>> - How to manage seamless upgrades and interoperability with older OVN >>>>>> versions? >>>>> - This pushes all logic related to chassis management to the >>>>> underlying networking fabric. It thereby places additional >>>>> requirements on the network fabric that have not been here before and >>>>> that might not be available for all users. >>>> >>>> Are you aware of any fabric that does not support IGMP/MLD snooping? >> >> Have you ever operated a network without having issues with multicast? ;) > > I must admit I don't have enough field experience operating large > network fabrics to state what issues multicast can cause with these. > This is why I raised this in the cons list :) > > What specific issues did you have in mind? > >>>>> - The bfd sessions between chassis are no longer possible thereby >>>>> preventing fast failover of gateway chassis. >>>> >>>> I don't know what these BFD sessions are used for. But we could imagine >>>> an ad-hoc establishment of them when a tunnel is created. >>>> >>>>> As this idea requires VXLAN and all current limitation would apply to >>>>> this solution as well this is probably no general solution but rather a >>>>> deployment option. >>>> >>>> Yes, for backward compatibility, it would probably need to be opt-in. >> >> Would an alternative be to look at how we can make the existing >> communication infrastructure that OVN provides between the >> ovn-controllers more efficient for this use case? If you think about >> it, could it be used for "multicast" like operation? One of the issues >> with large L2s for OVN today is the population of every known mac >> address in the network to every chassis in the cloud. Would an >> alternative be to: >> >> - Each ovn-controller preprograms only the mac bindings for logical >> switch ports residing on the hypervisor. >> - When learning of a remote MAC address is necessary, broadcast the >> request only to tunnel endpoints where we know there are logical >> switch ports for the same logical switch. >> - Add a local OVSDB instance for ovn-controller to store things such >> as learned mac addresses instead of using the central DB for this >> information. > > I am afraid that it would complicate even more the current OVN. Why > reimplement existing network stack features in OVN? > > I am eager to know if real multicast operation was ever considered, if > so, why was it discarded as a viable option. If not, could we consider > it? > >>>>>> Connect ovn-controller to the northbound DB >>>>>> =========================================== >>>> [snip] >>>>>> For other components that require access to the southbound DB (e.g. >>>>>> neutron metadata agent), ovn-controller should provide an interface to >>>>>> expose state and configuration data for local consumption. >>>>> >>>>> Note that also ovn-interconnect uses access to the southbound DB to add >>>>> chassis of the interconnected site (and potentially some more magic). >>>> >>>> I was not aware of this. Thanks for the heads up. >>>> >>>>>> Pros: >>>> [snip] >>>>> >>>>> - one less codebase with northd gone >>>>> >>>>>> Cons: >>>>>> >>>>>> - This would be a serious API breakage for systems that depend on the >>>>>> southbound DB. >>>>>> - Can all OVN constructs be implemented without a southbound DB? >>>>>> - Is the community interested in alternative datapaths? >>>>> >>>>> - It requires each ovn-controller to do that translation of a given >>>>> construct (e.g. a logical switch) thereby probably increasing the cpu >>>>> load and recompute time >>>> >>>> We cannot get this for free :) The CPU load that is gone from the >>>> central node needs to be shared across all chassis. >>>> >>>>> - The complexity of the ovn-controller grows as it gains nearly all >>>>> logic of northd >>>> >>>> Agreed, but the complexity may not be that high. Since ovn-controller >>>> would not need to do a two staged translation from NB model to logical >>>> flows to openflow. >>>> >>>> Also, if we reuse OVS bridges to implement logical switches. There would >>>> be a reduced number of flows to compute. >>>> >>>>> I now understand what you meant with the alternative datapaths in your >>>>> first mail. While i find the option interesting i'm not sure how much >>>>> value actually would come out of that. >>>> >>>> I am having resource constrained environments in mind (DPUs/IPUs, edge >>>> nodes, etc). When available memory and CPU are limited, OVS may not the >>>> most efficient. Maybe using the plain linux (or BSD) networking stack >>>> would be perfectly suitable and more lightweight. >> >> I honestly do not think using linuxbridge as a datapath is an >> desirable option for multiple reasons: >> >> 1) There is no performant and hardware offloadable way to express ACLs >> for linuxbridges. >> 2) There is no way to express L3 constructs for linuxbridges. >> 3) The current OVS OpenFlow bridge model is a perfect fit for >> translating the intent into flows programmed directly into the >> hardware switch on the NIC, and from my perspective this is one of >> the main reasons why we are migrating the world onto OVS/OVN and >> away from legacy implementations based on linuxbridges and network >> namespaces. >> >> Accelerator cards/DPUs/IPUs are usually equipped with such hardware >> switches (implemented in ASIC or FPGA). > > Let me first clarify one point, I am *not* suggesting to use linux > bridges and network namespaces as a first class replacement for OVS. > I am aware that the linux network stack has neither the level of > performance nor the determinism required for cloud and telco use cases. > > What I am proposing is to make OVN more inclusive and decouple it from > the flow-based paradigm by allowing alternative implementations of the > northbound network intent. > > I realize that this idea may be controversial for the community since > OVN has been closely tied to OVS since the start. However I am convinced > that this is a direction worth exploring, or at least discussing :) Separate bridges, no southbound database, VXLAN multicast, RIB, changes in the existing tooling, VXLAN address learning, upgrade paths, backwards compat,… does not this sound like a separate project altogether? > >>>> Also, since the northbound database doesn't know anything about flows, >>>> it could make OVN interoperable with any network capable element that is >>>> able to implement the network intent as described in the NB DB (<insert >>>> the name of your vrouter here>, etc.). >>>> >>>>> For me it feels like this would make ovn siginificantly harder to debug. >>>> >>>> Are you talking about ovn-trace? Or in general? >>>> >>>> Thanks for your comments. > > Thanks everyone for the constructive discussion so far! > > _______________________________________________ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss