Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Robin Jarry via discuss Tue, 03 Oct 2023 00:23:27 -0700

Hi all,

Felix Huettner, Oct 02, 2023 at 09:35:
> Hi everyone,
>
> just want to add my experience below
> On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry <rja...@redhat.com> wrote:
> > >
> > > Hi Han,
> > >
> > > Please see my comments/questions inline.
> > >
> > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > Distributed mac learning
> > > > > ========================
> > > > >
> > > > > Use one OVS bridge per logical switch with mac learning
> > > > > enabled. Only create the bridge if the logical switch has
> > > > > a port bound to the local chassis.
> > > > >
> > > > > Pros:
> > > > >
> > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > >   mostly).
> > > > > - No central mac binding table required.
> > > >
> > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > table of distributed logical routers. We should probably call it
> > > > IP_MAC_binding table, or just Neighbor table.
> > >
> > > Yes sorry about the confusion. I actually meant the FDB table.
> > >
> > > > Here what you mean is actually L2 MAC learning, which today is
> > > > implemented by the FDB table in SB DB, and it is only for
> > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > a MAC address of a VIF.
> > >
> > > This is not that uncommon in telco use cases where VNFs can send
> > > packets from mac addresses unknown to OVN.
> > >
> > Understand, but VNFs contributes a very small portion of the
> > workloads, right? Maybe I should rephrase that: it is uncommon to
> > have "unknown" addresses for the majority of ports in a large scale
> > cloud. Is this understanding correct?
>
> I can only share numbers for our usecase with ~650 chassis we have the
> following distribution of "unknown" in the `addresses` field of
> Logical_Switch_Port:
> * 23000 with a mac address + ip and without "unknown"
> * 250 with a mac address + ip and with "unknown"
> * 30 with just "unknown"
>
> The usecase is a generic public cloud and we do not have any telco
> related things.


I don't have any numbers from telco deployments at hand but I will poke
around.

> > > > The purpose of this proposal is clear - to avoid using a central
> > > > table in DB for L2 information but instead using L2 MAC learning
> > > > to populate such information on chassis, which is a reasonable
> > > > alternative with pros and cons.
> > > > However, I don't think it is necessary to use separate OVS
> > > > bridges for this purpose. L2 MAC learning can be easily
> > > > implemented in the br-int bridge with OVS flows, which is much
> > > > simpler than managing dynamic number of OVS bridges just for the
> > > > purpose of using the builtin OVS mac-learning.
> > >
> > > I agree that this could also be implemented with VLAN tags on the
> > > appropriate ports. But since OVS does not support trunk ports, it
> > > may require complicated OF pipelines. My intent with this idea was
> > > two fold:
> > >
> > > 1) Avoid a central point of failure for mac learning/aging.
> > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> >
> > IMHO, the L2 pipeline is not really complex. It is probably the
> > simplest part (compared with other features for L3, NAT, ACL, LB,
> > etc.). Adding dynamic learning to this part probably makes it *a
> > little* more complex, but should still be straightforward. We don't
> > need any VLAN tag because the incoming packet has geneve VNI in the
> > metadata. We just need a flow that resubmits to lookup
> > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > tunnel endpont information) if the src MAC is not found, with the
> > help of the "learn" action. The entries are per-logical_switch
> > (VNI). This would serve your purpose of avoiding a central DB for
> > L2. At least this looks much simpler to me than managing dynamic
> > number of OVS bridges and the patch pairs between them.

Would that work for non GENEVE networks (localnet) when there is no VNI?
Does that apply as well?


> >
> > >
> > > > Now back to the distributed MAC learning idea itself.
> > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > >
> > > > In OVN today this information is conveyed through:
> > > >
> > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > > >
> > > > In your proposal:
> > > >
> > > > - MAC and Chassis mapping (can be learned through initial L2
> > > >   broadcast/flood)
> > > >
> > > > This indeed would avoid the control plane cost through the
> > > > centralized components (for this L2 binding part). Given that
> > > > today's SB OVSDB is a bottleneck, this idea may sound
> > > > attractive. But please also take into consideration the below
> > > > improvement that could mitigate the OVN central scale issue:
> > > >
> > > > - For MAC and LSP mapping, northd is now capable of
> > > >   incrementally processing VIF related L2/L3 changes, so the
> > > >   cost of NB -> northd -> SB is very small. For SB -> Chassis,
> > > >   a more scalable DB deployment, such as the OVSDB relays, may
> > > >   largely help.
> > >
> > > But using relays will only help with read-only operations (SB ->
> > > chassis). Write operations (from dynamically learned mac
> > > addresses) will be equivalent.
> > >
> > OVSDB relay supports write operations, too. It scales better because
> > each ovsdb-server process handles smaller number of
> > clients/connections. It may still perform worse when there are too
> > many write operations from many clients, but I think it should scale
> > better than without relay. This is only based on my knowledge of the
> > ovsdb-server relay, but I haven't tested it at scale, yet. People
> > who actually deployed it may comment more.
>
> From our experience i would agree with that. I think removing the
> large amount of updates that needs to be send out after a write is the
> most helpful thing here. If you are however limited by raw write
> throughput and already hit that without any readers then i guess there
> is only the option to make this decentralized or to improve the ovsdb.

OK thanks.

>
> >
> > > > - For LSP and Chassis mapping, the round trip through a central
> > > >   DB obviously costs higher than a direct L2 broadcast (the
> > > >   targets are the same). But this can be optimized if the MAC
> > > >   and Chassis is known by the CMS system (which is true for most
> > > >   openstack/k8s env I believe). Instead of updating the binding
> > > >   from each Chassis, CMS can tell this information through the
> > > >   same NB -> northd -> SB -> Chassis path, and the Chassis can
> > > >   just read the SB without updating it.
> > >
> > > This is only applicable for known L2 addresses.  Maybe telco use
> > > cases are very specific, but being able to have ports that send
> > > packets from unknown addresses is a strong requirement.
> > >
> > Understand. But in terms of scale, the assumption is that the
> > majority of ports' address are known by CMS. Is this assumption
> > correct in telco use cases?

I will reach out to our field teams to try and get an answer to this
question.

> >
> > > > On the other hand, the dynamic MAC learning approach has its own
> > > > drawbacks.
> > > >
> > > > - It is simple to consider L2 only, but if considering more SDB
> > > >   features, a central DB is more flexible to extend and
> > > >   implement new features than a network protocol based approach.
> > > > - It is more predictable and easier to debug with pre-populated
> > > >   information through CMS than states learned dynamically in
> > > >   data-plane.
> > > > - With the DB approach we can suppress most of L2
> > > >   broadcast/flood, while with the distributed MAC learning
> > > >   broadcast/flood can't be avoided. Although it may happen
> > > >   mostly when a new workload is launched, it can also happen
> > > >   when aging. The cost of broadcast in large L2 is also
> > > >   a potential threat to scale.
> > >
> > > I may lack the field experience of operating large datacenter
> > > networks but I was not aware of any scaling issues because of ARP
> > > and/or other L2 broadcasts.  Is this an actual problem that was
> > > reported by cloud/telco operators and which influenced the
> > > centralized decisions?
> > >
> > I didn't hear any cloud operator reporting such problem, but I did
> > hear in many situation people expressed their concerns to this
> > problem. And if you google "ARP suppression" there are lots of
> > implementations by different vendors. And I believe it is a real
> > problem if not well managed, e.g. using an extremely large L2 domain
> > without ARP suppression. But I also believe it shouldn't be a big
> > concern if L2 segments are small.

I was not aware that it could be a real issue at scale. I was expecting
ARP traffic to be completely negligible.

>
> I think for large L2 domains having ARP requests being broadcasted is
> purely a question of network bandwidth. If they get dropped before
> they reach the destination they can just be retried and in normal
> communication the arp cache will not expire anyway.
>
> However this is different if you have some kind of L2 based failover
> e.g. if you move a mac address between hosts and then send out GARPs.
> In this case you must be certain that these GARPs have reached every
> single switch for it to update its FIB. Without a central state this
> is impossible to guarantee since packets might be randomly dropped.
> With a central state you need just one node to pick up the information
> and write it to some central store (just like that works for virtual
> Port_Bindings iirc).
>
> Note that this would be just a benefit of the central store which also
> has drawbacks that you already mentioned.

Are you thinking about MLAG?


>
> >
> > > > > Use multicast for overlay networks
> > > > > ==================================
> > > > >
> > > > > Use a unique 24bit VNI per overlay network. Derive a multicast
> > > > > group address from that VNI. Use VXLAN address learning [2] to
> > > > > remove the need for ovn-controller to know the destination
> > > > > chassis for every mac address in advance.
> > > > >
> > > > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
> > > > >
> > > > > Pros:
> > > > >
> > > > > - Nodes do not need to know about others in advance. The
> > > > >   control plane load is distributed across the cluster.
> > > >
> > > > I don't think that nodes knowing each other (at node/chassis
> > > > level) in advance is a big scaling problem. Thinking about the
> > > > 10k nodes scale, it is just 10k entries on each node. And node
> > > > addition/removal is not a very frequent operation. So does it
> > > > really matter?
> > >
> > > If I'm not mistaken, with the full mesh design, scaling to 10k
> > > nodes implies 9999 GENEVE ports on every chassis.  Can OVS handle
> > > that kind of setup?  Could this have an impact on datapath
> > > performance?
> > >
> > I didn't test this scale myself, but according to the OVS
> > documentation [8] (in the LIMITS section) the limit is determined by
> > the file descriptor only. It is a good point regrading datapath
> > performance. In the early days there was a statement "Performance
> > will degrade beyond 1,024 ports per bridge due to fixed hash table
> > sizing.", but this was removed in 2019 [9]. It would be great if
> > someone can share a real test result at this scale (can just
> > simulate with enough number of tunnel ports).

That would be interesting to get this kind of data. I will check if we
run such tests in our labs.

> >
> > > > If I understand correctly, the major point of this approach is
> > > > to form the Chassis groups for BUM traffic of each L2. For
> > > > example, for the MAC learning to work, the initial broadcast
> > > > (usually ARP request) needs to be sent out to the group of
> > > > Chassis that is related to that specific logical L2. However, as
> > > > also mentioned by Felix and Frode, requiring multicast support
> > > > in infrastructure may exclude a lot of users.
> > >
> > > Please excuse my candid question, but was multicast traffic in the
> > > fabric ever raised as a problem?
> > >
> > > Most (all?) top of rack switches have IGMP/MLD support built-in.
> > > If that was not the case, IPv6 would not work since it requires
> > > multicast to function properly.
> > >
> > Having devices supporting IGMP/MLD might still be different from
> > willing to operate multicast. This is not my domain, so I would let
> > people with more operator experience comment.
> > Regarding IPv6, I think the basic IPv6 operations require multicast
> > but it can use well-defined and static multicast addresses that
> > don't require dynamic group management provided by MLD.
> > Anyway, I just wanted to provide an alternative option that may have
> > less requirement for infrastructure.
>
> One of my concerns for using additional features of switches is that
> in most cases you can not easily fix bugs in them yourself. If there
> is some kind of bug in OVN i have the possibility to find and fix it
> myself and thereby fast fix a potential outage. If i use a feature of
> a switch and find issues in there i am most of the time dependent on
> some third party that needs to find and fix the issue and then
> distribute the fix to me. For the normal switching and routing
> features this concern is also valid, but they are normally extremly
> widely used. Multicast features are generally significantly less used,
> so for me that issue would be more significant.

That is indeed a strong point in favor of doing it all in software.

I was expecting IGMP/MLD snooping to be a very basic feature though.

> >
> > > > On the other hand, I would propose something else that can
> > > > achieve the same with less cost on the central SB. We can still
> > > > let Chassis join "multicast groups" but instead of relying on IP
> > > > mulitcast, we can populate this information to SB. It is
> > > > different from today's LSP-Chassis mapping (port_binding) in SB,
> > > > but a more coarse-grained mapping of Datapath-Chassis, which is
> > > > sufficient to support the BUM traffic for the distributed MAC
> > > > learning purpose and lightweight (relatively) to the central SB.
> > >
> > > But it would require the chassis to clone broadcast traffic
> > > explicitly in OVS.  The main benefit of using IP multicast is that
> > > BUM traffic duplication is handled by the fabric switches.
> > >
> > Agreed.
> >
> > > > In addition, if you still need L3 distributed routing, each node
> > > > not only have to join L2 groups that has workloads running
> > > > locally, but also needs to join indirectly connected L2 groups
> > > > (e.g. LS1 - LR - LS2) to receive broadcast to perform MAC
> > > > learning for L3 connected remotes. The "states" learned by each
> > > > chassis should be no different than the one achieved by
> > > > conditional monitoring (ovn-monitor-all=false).
> > > >
> > > > Overall, for the above two points, the primary goal is to reduce
> > > > dependence on the centralized control plane (especially SB DB).
> > > > I think it may be worth some prototype (not a small change) for
> > > > special use cases that require extremely large scale but simpler
> > > > features (and without a big concern of L2 flooding) for a good
> > > > tradeoff.
> > > >
> > > > I'd also like to remind that the L2 related scale issue is more
> > > > relevant to OpenStack, but it is not a problem for kubernetes
> > > > (at least not for ovn-kubernetes). ovn-kubernetes solves the
> > > > problem by using L3 routing instead of L2. L2 is confined within
> > > > each node, and between the nodes there are only routes exchanged
> > > > (through SB DB), which is O(N) (N = nodes) instead of O(P) (P
> > > > = ports). This is discussed in "Trade IP mobility for
> > > > scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].
> > > >
> > > > Also remember that all the other features still require
> > > > centralized DB, including L3 routing, NAT, ACL, LB, and so on.
> > > > SB DB optimizations (such as using relay) may still be required
> > > > when scaling to 10k nodes.
> > >
> > > I agree that this is not a small change :) And for it to be worth
> > > it, it would probably need to go along with removing the
> > > southbound to push the decentralization further.
> > >
> > I'd rather consider them separate prototypes, because they are
> > somehow independent changes.
> >
> > > > > Connect ovn-controller to the northbound DB
> > > > > ===========================================
> > > > >
> > > > > This idea extends on a previous proposal to migrate the
> > > > > logical flows creation in ovn-controller [3].
> > > > >
> > > > > [3] 
> > > > > https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
> > > > >
> > > > > If the first two proposals are implemented, the southbound
> > > > > database can be removed from the picture. ovn-controller can
> > > > > directly translate the northbound schema into OVS
> > > > > configuration bridges, ports and flow rules.
> >
> > I forgot to mention in my earlier reply, that I don't think "If the
> > first two proposals are implemented" matter here.
> > Firstly, the first two proposals are primarily for L2 distribution,
> > but it is a small part (both in code base and features) of OVN. Most
> > other features still rely on the central DB.
> > Secondly, even without the first two proposals, it is still a valid
> > attempt to remove SB (primarily removing the logical flow layer).
> > The L2 OVS flows, together with all the other flows for other
> > features, can still be generated by ovn-controller according to
> > central DB (probably a combined DB of current NB and SB).

OK, that makes sense.

Thanks folks!



> >
> > > > >
> > > > > For other components that require access to the southbound DB
> > > > > (e.g. neutron metadata agent), ovn-controller should provide
> > > > > an interface to expose state and configuration data for local
> > > > > consumption.
> > > > >
> > > > > All state information present in the NB DB should be moved to
> > > > > a separate state database [4] for CMS consumption.
> > > > >
> > > > > [4] 
> > > > > https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
> > > > >
> > > > > For those who like visuals, I have started working on basic
> > > > > use cases and how they would be implemented without
> > > > > a southbound database [5].
> > > > >
> > > > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
> > > > >
> > > > > Pros:
> > > > >
> > > > > - The northbound DB is smaller by design: reduced network
> > > > >   bandwidth and memory usage in all chassis.
> > > > > - If we keep the northbound read-only for ovn-controller, it
> > > > >   removes scaling issues when one controller updates one row
> > > > >   that needs to be replicated everywhere.
> > > > > - The northbound schema knows nothing about flows. We could
> > > > >   introduce alternative dataplane backends configured by
> > > > >   ovn-controller via plugins. I have done a minimal PoC to
> > > > >   check if it could work with the linux network stack [6].
> > > > >
> > > > > [6] 
> > > > > https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
> > > > >
> > > > > Cons:
> > > > >
> > > > > - This would be a serious API breakage for systems that depend
> > > > >   on the southbound DB.
> > > > > - Can all OVN constructs be implemented without a southbound
> > > > >   DB?
> > > > > - Is the community interested in alternative datapaths?
> > > > >
> > > >
> > > > This idea was also discussed briefly in [7] (page 16-17). The
> > > > main motivation was to avoid the cost of the intermediate
> > > > logical flow layer. The above mentioned patch was abandoned
> > > > because it still has the logical flow translation layer but just
> > > > moved from northd to ovn-controller. The major benefit of
> > > > logical flow layer in northd is that it performs common
> > > > calculations that is required for every (or a lot of) chassis at
> > > > once, so that they don't need to be repeated on the chassis. It
> > > > is also very helpful for trouble-shooting. However, the logical
> > > > flow layer itself has a significant cost.
> > > >
> > > > There has been lots of improvement done against the cost, e.g.:
> > > >
> > > > - incremental lflow processing in ovn-controller and partially
> > > >   in ovn-northd
> > > > - offloading node-local flow generation in ovn-controller (such
> > > >   as port-security, LB hairpin flows, etc.)
> > > > - flow-tagging
> > > > - ...
> > > >
> > > > Since then the motivation to remove north/SB has reduced, but it
> > > > is still a valid alternative (with its pros and cons).
> > > > (I believe this change is even bigger than the distributed MAC
> > > > learning, but prototypes are always welcome)
> > >
> > > I must be honest here and admit that I would not know where to
> > > start for prototyping such a change in the OVN code base. This is
> > > the reason why I reached out to the community to see if my ideas
> > > (or at least some of them) make sense to others.
> > >
> >
> > It is just a big change, almost rewriting OVN. The major part of
> > code in ovn-northd is to generate logical flows, and a significant
> > part of code in ovn-controller is translating logical flow to OVS
> > flows. I would not be surprised if this becomes just a different
> > project. (while it is true that some part of ovn-controller can
> > still be reused, such as physical.c, chassis.c, encap.c, etc.)
> >
> > Thanks,
> > Han
> >
> > [8] http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.8.txt
> > [9] 
> > https://github.com/openvswitch/ovs/commit/4224b9cf8fdba23fa35c1894eae42fd953a3780b
> >
> > > > Regarding the alternative datapath, I personally don't think it
> > > > is a strong argument here. OVN with its NB schema alone (and
> > > > OVSDB itself) is not an obvious advantage compared with other
> > > > SDN solutions. OVN exists primarily to program OVS (or any
> > > > OpenFlow based datapath). Logical flow table (and other SB data)
> > > > exists primarily for this purpose. If someone finds another
> > > > datapath is more attractive to their use cases than
> > > > OVS/OpenFlow, it is probably better to switch to its own control
> > > > plane (probably using a more popular/scalable database with
> > > > their own schema).
> > > >
> > > > Best regards,
> > > > Han
> > > >
> > > > [7] https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf
> > >
> > > This is a fair point and I understand your position.

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to