However, IMHO, as mentioned in my other response, separating OVS bridges doesn't sound necessary for the idea of distributed MAC learning.
Thanks,
Han
> >
> >>>>>> Use multicast for overlay networks
> >>>>>> ==================================
> >>>> [snip]
> >>>>>> - 24bit VNI allows for more than 16 million logical switches. No need
> >>>>>> for extended GENEVE tunnel options.
> >>>>> Note that using vxlan at the moment significantly reduces the ovn
> >>>>> featureset. This is because the geneve header options are currently used
> >>>>> for data that would not fit into the vxlan vni.
> >>>>>
> >>>>> From ovn-architecture.7.xml:
> >>>>> ```
> >>>>> The maximum number of networks is reduced to 4096.
> >>>>> The maximum number of ports per network is reduced to 2048.
> >>>>> ACLs matching against logical ingress port identifiers are not supported.
> >>>>> OVN interconnection feature is not supported.
> >>>>> ```
> >>>>
> >>>> In my understanding, the main reason why GENEVE replaced VXLAN is
> >>>> because Openstack uses full mesh point to point tunnels and that the
> >>>> sender needs to know behind which chassis any mac address is to send it
> >>>> into the correct tunnel. GENEVE allowed to reduce the lookup time both
> >>>> on the sender and receiver thanks to ingress/egress port metadata.
> >>>>
> >>>>
https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/> >>>>
https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/> >>>>
> >>>> If VXLAN + multicast and address learning was used, the "correct" tunnel
> >>>> would be established ad-hoc and both sender and receiver lookups would
> >>>> only be a simple mac forwarding with learning. The ingress pipeline
> >>>> would probably cost a little more.
> >>>>
> >>>> Maybe multicast + address learning could be implemented for GENEVE as
> >>>> well. But it would not be interoperable with other VTEPs.
> >>
> >> While it is true that it takes time before switch hardware picks up
> >> support for emerging protocols, I do not think it is a valid argument
> >> for limiting the development of OVN. Most hardware offload capable
> >> NICs already have GENEVE support, and if you survey recent or upcoming
> >> releases from top of rack switch vendors you will also find that they
> >> have added support for using GENEVE for hardware VTEPs. The fact that
> >> SDNs with a large customer footprint (such as NSX and OVN) make use of
> >> GENEVE is most likely a deciding factor for their adoption, and I see
> >> no reason why we should stop defining the edge of development in this
> >> space.
> >
> > GENEVE could perfectly be suitable with a multicast based control plane
> > to establish ad-hoc tunnels without any centralized involvement.
> >
> > I was only proposing VXLAN since this multicast group system was part of
> > the original RFC (supported in Linux since 3.12).
> >
> >>>>>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> >>>>>> top-of-rack switches. Multicast is only used for BUM traffic.
> >>>>>> - Only one VXLAN output port per implemented logical switch on a given
> >>>>>> chassis.
> >>>>>
> >>>>> Would this actually work with one VXLAN output port? Would you not need
> >>>>> one port per target node to send unicast traffic (as you otherwise flood
> >>>>> all packets to all participating nodes)?
> >>>>
> >>>> You would need one VXLAN output port per implemented logical switch on
> >>>> a given chassis. The port would have a VNI (unique per logical switch)
> >>>> and an associated multicast IP address. Any chassis that implement this
> >>>> logical switch would subscribe to that multicast group. The flooding
> >>>> would be limited to first packets and broadcast/multicast traffic (ARP
> >>>> requests, mostly). Once the receiver node replies, all communication
> >>>> will happen with unicast.
> >>>>
> >>>>
https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic> >>>>
> >>>>>> Cons:
> >>>>>>
> >>>>>> - OVS does not support VXLAN address learning yet.
> >>>>>> - The number of usable multicast groups in a fabric network may be
> >>>>>> limited?
> >>>>>> - How to manage seamless upgrades and interoperability with older OVN
> >>>>>> versions?
> >>>>> - This pushes all logic related to chassis management to the
> >>>>> underlying networking fabric. It thereby places additional
> >>>>> requirements on the network fabric that have not been here before and
> >>>>> that might not be available for all users.
> >>>>
> >>>> Are you aware of any fabric that does not support IGMP/MLD snooping?
> >>
> >> Have you ever operated a network without having issues with multicast? ;)
> >
> > I must admit I don't have enough field experience operating large
> > network fabrics to state what issues multicast can cause with these.
> > This is why I raised this in the cons list :)
> >
> > What specific issues did you have in mind?
> >
> >>>>> - The bfd sessions between chassis are no longer possible thereby
> >>>>> preventing fast failover of gateway chassis.
> >>>>
> >>>> I don't know what these BFD sessions are used for. But we could imagine
> >>>> an ad-hoc establishment of them when a tunnel is created.
> >>>>
> >>>>> As this idea requires VXLAN and all current limitation would apply to
> >>>>> this solution as well this is probably no general solution but rather a
> >>>>> deployment option.
> >>>>
> >>>> Yes, for backward compatibility, it would probably need to be opt-in.
> >>
> >> Would an alternative be to look at how we can make the existing
> >> communication infrastructure that OVN provides between the
> >> ovn-controllers more efficient for this use case? If you think about
> >> it, could it be used for "multicast" like operation? One of the issues
> >> with large L2s for OVN today is the population of every known mac
> >> address in the network to every chassis in the cloud. Would an
> >> alternative be to:
> >>
> >> - Each ovn-controller preprograms only the mac bindings for logical
> >> switch ports residing on the hypervisor.
> >> - When learning of a remote MAC address is necessary, broadcast the
> >> request only to tunnel endpoints where we know there are logical
> >> switch ports for the same logical switch.
> >> - Add a local OVSDB instance for ovn-controller to store things such
> >> as learned mac addresses instead of using the central DB for this
> >> information.
> >
> > I am afraid that it would complicate even more the current OVN. Why
> > reimplement existing network stack features in OVN?
> >
> > I am eager to know if real multicast operation was ever considered, if
> > so, why was it discarded as a viable option. If not, could we consider
> > it?
> >
> >>>>>> Connect ovn-controller to the northbound DB
> >>>>>> ===========================================
> >>>> [snip]
> >>>>>> For other components that require access to the southbound DB (e.g.
> >>>>>> neutron metadata agent), ovn-controller should provide an interface to
> >>>>>> expose state and configuration data for local consumption.
> >>>>>
> >>>>> Note that also ovn-interconnect uses access to the southbound DB to add
> >>>>> chassis of the interconnected site (and potentially some more magic).
> >>>>
> >>>> I was not aware of this. Thanks for the heads up.
> >>>>
> >>>>>> Pros:
> >>>> [snip]
> >>>>>
> >>>>> - one less codebase with northd gone
> >>>>>
> >>>>>> Cons:
> >>>>>>
> >>>>>> - This would be a serious API breakage for systems that depend on the
> >>>>>> southbound DB.
> >>>>>> - Can all OVN constructs be implemented without a southbound DB?
> >>>>>> - Is the community interested in alternative datapaths?
> >>>>>
> >>>>> - It requires each ovn-controller to do that translation of a given
> >>>>> construct (e.g. a logical switch) thereby probably increasing the cpu
> >>>>> load and recompute time
> >>>>
> >>>> We cannot get this for free :) The CPU load that is gone from the
> >>>> central node needs to be shared across all chassis.
> >>>>
> >>>>> - The complexity of the ovn-controller grows as it gains nearly all
> >>>>> logic of northd
> >>>>
> >>>> Agreed, but the complexity may not be that high. Since ovn-controller
> >>>> would not need to do a two staged translation from NB model to logical
> >>>> flows to openflow.
> >>>>
> >>>> Also, if we reuse OVS bridges to implement logical switches. There would
> >>>> be a reduced number of flows to compute.
> >>>>
> >>>>> I now understand what you meant with the alternative datapaths in your
> >>>>> first mail. While i find the option interesting i'm not sure how much
> >>>>> value actually would come out of that.
> >>>>
> >>>> I am having resource constrained environments in mind (DPUs/IPUs, edge
> >>>> nodes, etc). When available memory and CPU are limited, OVS may not the
> >>>> most efficient. Maybe using the plain linux (or BSD) networking stack
> >>>> would be perfectly suitable and more lightweight.
> >>
> >> I honestly do not think using linuxbridge as a datapath is an
> >> desirable option for multiple reasons:
> >>
> >> 1) There is no performant and hardware offloadable way to express ACLs
> >> for linuxbridges.
> >> 2) There is no way to express L3 constructs for linuxbridges.
> >> 3) The current OVS OpenFlow bridge model is a perfect fit for
> >> translating the intent into flows programmed directly into the
> >> hardware switch on the NIC, and from my perspective this is one of
> >> the main reasons why we are migrating the world onto OVS/OVN and
> >> away from legacy implementations based on linuxbridges and network
> >> namespaces.
> >>
> >> Accelerator cards/DPUs/IPUs are usually equipped with such hardware
> >> switches (implemented in ASIC or FPGA).
> >
> > Let me first clarify one point, I am *not* suggesting to use linux
> > bridges and network namespaces as a first class replacement for OVS.
> > I am aware that the linux network stack has neither the level of
> > performance nor the determinism required for cloud and telco use cases.
> >
> > What I am proposing is to make OVN more inclusive and decouple it from
> > the flow-based paradigm by allowing alternative implementations of the
> > northbound network intent.
> >
> > I realize that this idea may be controversial for the community since
> > OVN has been closely tied to OVS since the start. However I am convinced
> > that this is a direction worth exploring, or at least discussing :)
> >
> >>>> Also, since the northbound database doesn't know anything about flows,
> >>>> it could make OVN interoperable with any network capable element that is
> >>>> able to implement the network intent as described in the NB DB (<insert
> >>>> the name of your vrouter here>, etc.).
> >>>>
> >>>>> For me it feels like this would make ovn siginificantly harder to debug.
> >>>>
> >>>> Are you talking about ovn-trace? Or in general?
> >>>>
> >>>> Thanks for your comments.
> >
> > Thanks everyone for the constructive discussion so far!
> >