Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Felix Huettner via discuss Wed, 04 Oct 2023 00:25:05 -0700

Hi Robin,

i'll try to answer what i can.


On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> Hi all,
>
> Felix Huettner, Oct 02, 2023 at 09:35:
> > Hi everyone,
> >
> > just want to add my experience below
> > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry <rja...@redhat.com> wrote:
> > > >
> > > > Hi Han,
> > > >
> > > > Please see my comments/questions inline.
> > > >
> > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > Distributed mac learning
> > > > > > ========================
> > > > > >
> > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > a port bound to the local chassis.
> > > > > >
> > > > > > Pros:
> > > > > >
> > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > >   mostly).
> > > > > > - No central mac binding table required.
> > > > >
> > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > table of distributed logical routers. We should probably call it
> > > > > IP_MAC_binding table, or just Neighbor table.
> > > >
> > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > >
> > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > a MAC address of a VIF.
> > > >
> > > > This is not that uncommon in telco use cases where VNFs can send
> > > > packets from mac addresses unknown to OVN.
> > > >
> > > Understand, but VNFs contributes a very small portion of the
> > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > have "unknown" addresses for the majority of ports in a large scale
> > > cloud. Is this understanding correct?
> >
> > I can only share numbers for our usecase with ~650 chassis we have the
> > following distribution of "unknown" in the `addresses` field of
> > Logical_Switch_Port:
> > * 23000 with a mac address + ip and without "unknown"
> > * 250 with a mac address + ip and with "unknown"
> > * 30 with just "unknown"
> >
> > The usecase is a generic public cloud and we do not have any telco
> > related things.
>
> I don't have any numbers from telco deployments at hand but I will poke
> around.
>
> > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > to populate such information on chassis, which is a reasonable
> > > > > alternative with pros and cons.
> > > > > However, I don't think it is necessary to use separate OVS
> > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > purpose of using the builtin OVS mac-learning.
> > > >
> > > > I agree that this could also be implemented with VLAN tags on the
> > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > may require complicated OF pipelines. My intent with this idea was
> > > > two fold:
> > > >
> > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > >
> > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > etc.). Adding dynamic learning to this part probably makes it *a
> > > little* more complex, but should still be straightforward. We don't
> > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > metadata. We just need a flow that resubmits to lookup
> > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > tunnel endpont information) if the src MAC is not found, with the
> > > help of the "learn" action. The entries are per-logical_switch
> > > (VNI). This would serve your purpose of avoiding a central DB for
> > > L2. At least this looks much simpler to me than managing dynamic
> > > number of OVS bridges and the patch pairs between them.
>
> Would that work for non GENEVE networks (localnet) when there is no VNI?
> Does that apply as well?
>
>
> > >
> > > >
> > > > > Now back to the distributed MAC learning idea itself.
> > > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > > >
> > > > > In OVN today this information is conveyed through:
> > > > >
> > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > > > >
> > > > > In your proposal:
> > > > >
> > > > > - MAC and Chassis mapping (can be learned through initial L2
> > > > >   broadcast/flood)
> > > > >
> > > > > This indeed would avoid the control plane cost through the
> > > > > centralized components (for this L2 binding part). Given that
> > > > > today's SB OVSDB is a bottleneck, this idea may sound
> > > > > attractive. But please also take into consideration the below
> > > > > improvement that could mitigate the OVN central scale issue:
> > > > >
> > > > > - For MAC and LSP mapping, northd is now capable of
> > > > >   incrementally processing VIF related L2/L3 changes, so the
> > > > >   cost of NB -> northd -> SB is very small. For SB -> Chassis,
> > > > >   a more scalable DB deployment, such as the OVSDB relays, may
> > > > >   largely help.
> > > >
> > > > But using relays will only help with read-only operations (SB ->
> > > > chassis). Write operations (from dynamically learned mac
> > > > addresses) will be equivalent.
> > > >
> > > OVSDB relay supports write operations, too. It scales better because
> > > each ovsdb-server process handles smaller number of
> > > clients/connections. It may still perform worse when there are too
> > > many write operations from many clients, but I think it should scale
> > > better than without relay. This is only based on my knowledge of the
> > > ovsdb-server relay, but I haven't tested it at scale, yet. People
> > > who actually deployed it may comment more.
> >
> > From our experience i would agree with that. I think removing the
> > large amount of updates that needs to be send out after a write is the
> > most helpful thing here. If you are however limited by raw write
> > throughput and already hit that without any readers then i guess there
> > is only the option to make this decentralized or to improve the ovsdb.
>
> OK thanks.
>
> >
> > >
> > > > > - For LSP and Chassis mapping, the round trip through a central
> > > > >   DB obviously costs higher than a direct L2 broadcast (the
> > > > >   targets are the same). But this can be optimized if the MAC
> > > > >   and Chassis is known by the CMS system (which is true for most
> > > > >   openstack/k8s env I believe). Instead of updating the binding
> > > > >   from each Chassis, CMS can tell this information through the
> > > > >   same NB -> northd -> SB -> Chassis path, and the Chassis can
> > > > >   just read the SB without updating it.
> > > >
> > > > This is only applicable for known L2 addresses.  Maybe telco use
> > > > cases are very specific, but being able to have ports that send
> > > > packets from unknown addresses is a strong requirement.
> > > >
> > > Understand. But in terms of scale, the assumption is that the
> > > majority of ports' address are known by CMS. Is this assumption
> > > correct in telco use cases?
>
> I will reach out to our field teams to try and get an answer to this
> question.
>
> > >
> > > > > On the other hand, the dynamic MAC learning approach has its own
> > > > > drawbacks.
> > > > >
> > > > > - It is simple to consider L2 only, but if considering more SDB
> > > > >   features, a central DB is more flexible to extend and
> > > > >   implement new features than a network protocol based approach.
> > > > > - It is more predictable and easier to debug with pre-populated
> > > > >   information through CMS than states learned dynamically in
> > > > >   data-plane.
> > > > > - With the DB approach we can suppress most of L2
> > > > >   broadcast/flood, while with the distributed MAC learning
> > > > >   broadcast/flood can't be avoided. Although it may happen
> > > > >   mostly when a new workload is launched, it can also happen
> > > > >   when aging. The cost of broadcast in large L2 is also
> > > > >   a potential threat to scale.
> > > >
> > > > I may lack the field experience of operating large datacenter
> > > > networks but I was not aware of any scaling issues because of ARP
> > > > and/or other L2 broadcasts.  Is this an actual problem that was
> > > > reported by cloud/telco operators and which influenced the
> > > > centralized decisions?
> > > >
> > > I didn't hear any cloud operator reporting such problem, but I did
> > > hear in many situation people expressed their concerns to this
> > > problem. And if you google "ARP suppression" there are lots of
> > > implementations by different vendors. And I believe it is a real
> > > problem if not well managed, e.g. using an extremely large L2 domain
> > > without ARP suppression. But I also believe it shouldn't be a big
> > > concern if L2 segments are small.
>
> I was not aware that it could be a real issue at scale. I was expecting
> ARP traffic to be completely negligible.
>
> >
> > I think for large L2 domains having ARP requests being broadcasted is
> > purely a question of network bandwidth. If they get dropped before
> > they reach the destination they can just be retried and in normal
> > communication the arp cache will not expire anyway.
> >
> > However this is different if you have some kind of L2 based failover
> > e.g. if you move a mac address between hosts and then send out GARPs.
> > In this case you must be certain that these GARPs have reached every
> > single switch for it to update its FIB. Without a central state this
> > is impossible to guarantee since packets might be randomly dropped.
> > With a central state you need just one node to pick up the information
> > and write it to some central store (just like that works for virtual
> > Port_Bindings iirc).
> >
> > Note that this would be just a benefit of the central store which also
> > has drawbacks that you already mentioned.
>
> Are you thinking about MLAG?

No the central store is what we have currently with e.g. the
Port_Binding table in the southbound db. (at least from my
understanding).

>
>
> >
> > >
> > > > > > Use multicast for overlay networks
> > > > > > ==================================
> > > > > >
> > > > > > Use a unique 24bit VNI per overlay network. Derive a multicast
> > > > > > group address from that VNI. Use VXLAN address learning [2] to
> > > > > > remove the need for ovn-controller to know the destination
> > > > > > chassis for every mac address in advance.
> > > > > >
> > > > > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
> > > > > >
> > > > > > Pros:
> > > > > >
> > > > > > - Nodes do not need to know about others in advance. The
> > > > > >   control plane load is distributed across the cluster.
> > > > >
> > > > > I don't think that nodes knowing each other (at node/chassis
> > > > > level) in advance is a big scaling problem. Thinking about the
> > > > > 10k nodes scale, it is just 10k entries on each node. And node
> > > > > addition/removal is not a very frequent operation. So does it
> > > > > really matter?
> > > >
> > > > If I'm not mistaken, with the full mesh design, scaling to 10k
> > > > nodes implies 9999 GENEVE ports on every chassis.  Can OVS handle
> > > > that kind of setup?  Could this have an impact on datapath
> > > > performance?
> > > >
> > > I didn't test this scale myself, but according to the OVS
> > > documentation [8] (in the LIMITS section) the limit is determined by
> > > the file descriptor only. It is a good point regrading datapath
> > > performance. In the early days there was a statement "Performance
> > > will degrade beyond 1,024 ports per bridge due to fixed hash table
> > > sizing.", but this was removed in 2019 [9]. It would be great if
> > > someone can share a real test result at this scale (can just
> > > simulate with enough number of tunnel ports).
>
> That would be interesting to get this kind of data. I will check if we
> run such tests in our labs.
>
> > >
> > > > > If I understand correctly, the major point of this approach is
> > > > > to form the Chassis groups for BUM traffic of each L2. For
> > > > > example, for the MAC learning to work, the initial broadcast
> > > > > (usually ARP request) needs to be sent out to the group of
> > > > > Chassis that is related to that specific logical L2. However, as
> > > > > also mentioned by Felix and Frode, requiring multicast support
> > > > > in infrastructure may exclude a lot of users.
> > > >
> > > > Please excuse my candid question, but was multicast traffic in the
> > > > fabric ever raised as a problem?
> > > >
> > > > Most (all?) top of rack switches have IGMP/MLD support built-in.
> > > > If that was not the case, IPv6 would not work since it requires
> > > > multicast to function properly.
> > > >
> > > Having devices supporting IGMP/MLD might still be different from
> > > willing to operate multicast. This is not my domain, so I would let
> > > people with more operator experience comment.
> > > Regarding IPv6, I think the basic IPv6 operations require multicast
> > > but it can use well-defined and static multicast addresses that
> > > don't require dynamic group management provided by MLD.
> > > Anyway, I just wanted to provide an alternative option that may have
> > > less requirement for infrastructure.
> >
> > One of my concerns for using additional features of switches is that
> > in most cases you can not easily fix bugs in them yourself. If there
> > is some kind of bug in OVN i have the possibility to find and fix it
> > myself and thereby fast fix a potential outage. If i use a feature of
> > a switch and find issues in there i am most of the time dependent on
> > some third party that needs to find and fix the issue and then
> > distribute the fix to me. For the normal switching and routing
> > features this concern is also valid, but they are normally extremly
> > widely used. Multicast features are generally significantly less used,
> > so for me that issue would be more significant.
>
> That is indeed a strong point in favor of doing it all in software.
>
> I was expecting IGMP/MLD snooping to be a very basic feature though.

I guess it might be. However a lot of deployments i have heard of run
e.g. EVPN in their network underlay. In this case you not only need to
support IGMP/MLD but also this in combination with EVPN.

Thanks
Felix

>
> > >
> > > > > On the other hand, I would propose something else that can
> > > > > achieve the same with less cost on the central SB. We can still
> > > > > let Chassis join "multicast groups" but instead of relying on IP
> > > > > mulitcast, we can populate this information to SB. It is
> > > > > different from today's LSP-Chassis mapping (port_binding) in SB,
> > > > > but a more coarse-grained mapping of Datapath-Chassis, which is
> > > > > sufficient to support the BUM traffic for the distributed MAC
> > > > > learning purpose and lightweight (relatively) to the central SB.
> > > >
> > > > But it would require the chassis to clone broadcast traffic
> > > > explicitly in OVS.  The main benefit of using IP multicast is that
> > > > BUM traffic duplication is handled by the fabric switches.
> > > >
> > > Agreed.
> > >
> > > > > In addition, if you still need L3 distributed routing, each node
> > > > > not only have to join L2 groups that has workloads running
> > > > > locally, but also needs to join indirectly connected L2 groups
> > > > > (e.g. LS1 - LR - LS2) to receive broadcast to perform MAC
> > > > > learning for L3 connected remotes. The "states" learned by each
> > > > > chassis should be no different than the one achieved by
> > > > > conditional monitoring (ovn-monitor-all=false).
> > > > >
> > > > > Overall, for the above two points, the primary goal is to reduce
> > > > > dependence on the centralized control plane (especially SB DB).
> > > > > I think it may be worth some prototype (not a small change) for
> > > > > special use cases that require extremely large scale but simpler
> > > > > features (and without a big concern of L2 flooding) for a good
> > > > > tradeoff.
> > > > >
> > > > > I'd also like to remind that the L2 related scale issue is more
> > > > > relevant to OpenStack, but it is not a problem for kubernetes
> > > > > (at least not for ovn-kubernetes). ovn-kubernetes solves the
> > > > > problem by using L3 routing instead of L2. L2 is confined within
> > > > > each node, and between the nodes there are only routes exchanged
> > > > > (through SB DB), which is O(N) (N = nodes) instead of O(P) (P
> > > > > = ports). This is discussed in "Trade IP mobility for
> > > > > scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].
> > > > >
> > > > > Also remember that all the other features still require
> > > > > centralized DB, including L3 routing, NAT, ACL, LB, and so on.
> > > > > SB DB optimizations (such as using relay) may still be required
> > > > > when scaling to 10k nodes.
> > > >
> > > > I agree that this is not a small change :) And for it to be worth
> > > > it, it would probably need to go along with removing the
> > > > southbound to push the decentralization further.
> > > >
> > > I'd rather consider them separate prototypes, because they are
> > > somehow independent changes.
> > >
> > > > > > Connect ovn-controller to the northbound DB
> > > > > > ===========================================
> > > > > >
> > > > > > This idea extends on a previous proposal to migrate the
> > > > > > logical flows creation in ovn-controller [3].
> > > > > >
> > > > > > [3] 
> > > > > > https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
> > > > > >
> > > > > > If the first two proposals are implemented, the southbound
> > > > > > database can be removed from the picture. ovn-controller can
> > > > > > directly translate the northbound schema into OVS
> > > > > > configuration bridges, ports and flow rules.
> > >
> > > I forgot to mention in my earlier reply, that I don't think "If the
> > > first two proposals are implemented" matter here.
> > > Firstly, the first two proposals are primarily for L2 distribution,
> > > but it is a small part (both in code base and features) of OVN. Most
> > > other features still rely on the central DB.
> > > Secondly, even without the first two proposals, it is still a valid
> > > attempt to remove SB (primarily removing the logical flow layer).
> > > The L2 OVS flows, together with all the other flows for other
> > > features, can still be generated by ovn-controller according to
> > > central DB (probably a combined DB of current NB and SB).
>
> OK, that makes sense.
>
> Thanks folks!
>
>
>
> > >
> > > > > >
> > > > > > For other components that require access to the southbound DB
> > > > > > (e.g. neutron metadata agent), ovn-controller should provide
> > > > > > an interface to expose state and configuration data for local
> > > > > > consumption.
> > > > > >
> > > > > > All state information present in the NB DB should be moved to
> > > > > > a separate state database [4] for CMS consumption.
> > > > > >
> > > > > > [4] 
> > > > > > https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
> > > > > >
> > > > > > For those who like visuals, I have started working on basic
> > > > > > use cases and how they would be implemented without
> > > > > > a southbound database [5].
> > > > > >
> > > > > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
> > > > > >
> > > > > > Pros:
> > > > > >
> > > > > > - The northbound DB is smaller by design: reduced network
> > > > > >   bandwidth and memory usage in all chassis.
> > > > > > - If we keep the northbound read-only for ovn-controller, it
> > > > > >   removes scaling issues when one controller updates one row
> > > > > >   that needs to be replicated everywhere.
> > > > > > - The northbound schema knows nothing about flows. We could
> > > > > >   introduce alternative dataplane backends configured by
> > > > > >   ovn-controller via plugins. I have done a minimal PoC to
> > > > > >   check if it could work with the linux network stack [6].
> > > > > >
> > > > > > [6] 
> > > > > > https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
> > > > > >
> > > > > > Cons:
> > > > > >
> > > > > > - This would be a serious API breakage for systems that depend
> > > > > >   on the southbound DB.
> > > > > > - Can all OVN constructs be implemented without a southbound
> > > > > >   DB?
> > > > > > - Is the community interested in alternative datapaths?
> > > > > >
> > > > >
> > > > > This idea was also discussed briefly in [7] (page 16-17). The
> > > > > main motivation was to avoid the cost of the intermediate
> > > > > logical flow layer. The above mentioned patch was abandoned
> > > > > because it still has the logical flow translation layer but just
> > > > > moved from northd to ovn-controller. The major benefit of
> > > > > logical flow layer in northd is that it performs common
> > > > > calculations that is required for every (or a lot of) chassis at
> > > > > once, so that they don't need to be repeated on the chassis. It
> > > > > is also very helpful for trouble-shooting. However, the logical
> > > > > flow layer itself has a significant cost.
> > > > >
> > > > > There has been lots of improvement done against the cost, e.g.:
> > > > >
> > > > > - incremental lflow processing in ovn-controller and partially
> > > > >   in ovn-northd
> > > > > - offloading node-local flow generation in ovn-controller (such
> > > > >   as port-security, LB hairpin flows, etc.)
> > > > > - flow-tagging
> > > > > - ...
> > > > >
> > > > > Since then the motivation to remove north/SB has reduced, but it
> > > > > is still a valid alternative (with its pros and cons).
> > > > > (I believe this change is even bigger than the distributed MAC
> > > > > learning, but prototypes are always welcome)
> > > >
> > > > I must be honest here and admit that I would not know where to
> > > > start for prototyping such a change in the OVN code base. This is
> > > > the reason why I reached out to the community to see if my ideas
> > > > (or at least some of them) make sense to others.
> > > >
> > >
> > > It is just a big change, almost rewriting OVN. The major part of
> > > code in ovn-northd is to generate logical flows, and a significant
> > > part of code in ovn-controller is translating logical flow to OVS
> > > flows. I would not be surprised if this becomes just a different
> > > project. (while it is true that some part of ovn-controller can
> > > still be reused, such as physical.c, chassis.c, encap.c, etc.)
> > >
> > > Thanks,
> > > Han
> > >
> > > [8] http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.8.txt
> > > [9] 
> > > https://github.com/openvswitch/ovs/commit/4224b9cf8fdba23fa35c1894eae42fd953a3780b
> > >
> > > > > Regarding the alternative datapath, I personally don't think it
> > > > > is a strong argument here. OVN with its NB schema alone (and
> > > > > OVSDB itself) is not an obvious advantage compared with other
> > > > > SDN solutions. OVN exists primarily to program OVS (or any
> > > > > OpenFlow based datapath). Logical flow table (and other SB data)
> > > > > exists primarily for this purpose. If someone finds another
> > > > > datapath is more attractive to their use cases than
> > > > > OVS/OpenFlow, it is probably better to switch to its own control
> > > > > plane (probably using a more popular/scalable database with
> > > > > their own schema).
> > > > >
> > > > > Best regards,
> > > > > Han
> > > > >
> > > > > [7] 
> > > > > https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf
> > > >
> > > > This is a fair point and I understand your position.
>
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz/>.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here<https://www.datenschutz.schwarz/>.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to