Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Felix Huettner via discuss Mon, 02 Oct 2023 00:36:22 -0700

Hi everyone,

just want to add my experience below
On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry <rja...@redhat.com> wrote:
> >
> > Hi Han,
> >
> > Please see my comments/questions inline.
> >
> > Han Zhou, Sep 30, 2023 at 21:59:
> > > > Distributed mac learning
> > > > ========================
> > > >
> > > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > > create the bridge if the logical switch has a port bound to the local
> > > > chassis.
> > > >
> > > > Pros:
> > > >
> > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> mostly).
> > > > - No central mac binding table required.
> > >
> > > Firstly to clarify the terminology of "mac binding" to avoid confusion,
> the
> > > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > > learning. It is actually the ARP/Neighbor table of distributed logical
> > > routers. We should probably call it IP_MAC_binding table, or just
> Neighbor
> > > table.
> >
> > Yes sorry about the confusion. I actually meant the FDB table.
> >
> > > Here what you mean is actually L2 MAC learning, which today is
> implemented
> > > by the FDB table in SB DB, and it is only for uncommon use cases when
> the
> > > NB doesn't have the knowledge of a MAC address of a VIF.
> >
> > This is not that uncommon in telco use cases where VNFs can send packets
> > from mac addresses unknown to OVN.
> >
> Understand, but VNFs contributes a very small portion of the workloads,
> right? Maybe I should rephrase that: it is uncommon to have "unknown"
> addresses for the majority of ports in a large scale cloud. Is this
> understanding correct?


I can only share numbers for our usecase with ~650 chassis we have the
following distribution of "unknown" in the `addresses` field of
Logical_Switch_Port:
* 23000 with a mac address + ip and without "unknown"
* 250 with a mac address + ip and with "unknown"
* 30 with just "unknown"

The usecase is a generic public cloud and we do not have any telco
related things.

>
> > > The purpose of this proposal is clear - to avoid using a central table
> in
> > > DB for L2 information but instead using L2 MAC learning to populate such
> > > information on chassis, which is a reasonable alternative with pros and
> > > cons.
> > > However, I don't think it is necessary to use separate OVS bridges for
> this
> > > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > > with OVS flows, which is much simpler than managing dynamic number of
> OVS
> > > bridges just for the purpose of using the builtin OVS mac-learning.
> >
> > I agree that this could also be implemented with VLAN tags on the
> > appropriate ports. But since OVS does not support trunk ports, it may
> > require complicated OF pipelines. My intent with this idea was two fold:
> >
> > 1) Avoid a central point of failure for mac learning/aging.
> > 2) Simplify the OF pipeline by making all FDB operations dynamic.
>
> IMHO, the L2 pipeline is not really complex. It is probably the simplest
> part (compared with other features for L3, NAT, ACL, LB, etc.).
> Adding dynamic learning to this part probably makes it *a little* more
> complex, but should still be straightforward. We don't need any VLAN tag
> because the incoming packet has geneve VNI in the metadata. We just need a
> flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
> new flow (with related tunnel endpont information) if the src MAC is not
> found, with the help of the "learn" action. The entries are
> per-logical_switch (VNI). This would serve your purpose of avoiding a
> central DB for L2. At least this looks much simpler to me than managing
> dynamic number of OVS bridges and the patch pairs between them.
>
> >
> > > Now back to the distributed MAC learning idea itself. Essentially for
> two
> > > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
> to
> > > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
> discuss
> > > this later), Chassis1 needs to know that VM2's MAC is located on
> Chassis2.
> > >
> > > In OVN today this information is conveyed through:
> > >
> > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > >
> > > In your proposal:
> > >
> > > - MAC and Chassis mapping (can be learned through initial L2
> > >   broadcast/flood)
> > >
> > > This indeed would avoid the control plane cost through the centralized
> > > components (for this L2 binding part). Given that today's SB OVSDB is a
> > > bottleneck, this idea may sound attractive. But please also take into
> > > consideration the below improvement that could mitigate the OVN central
> > > scale issue:
> > >
> > > - For MAC and LSP mapping, northd is now capable of incrementally
> > >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> > >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> > >   such as the OVSDB relays, may largely help.
> >
> > But using relays will only help with read-only operations (SB ->
> > chassis). Write operations (from dynamically learned mac addresses) will
> > be equivalent.
> >
> OVSDB relay supports write operations, too. It scales better because each
> ovsdb-server process handles smaller number of clients/connections. It may
> still perform worse when there are too many write operations from many
> clients, but I think it should scale better than without relay. This is
> only based on my knowledge of the ovsdb-server relay, but I haven't tested
> it at scale, yet. People who actually deployed it may comment more.

From our experience i would agree with that. I think removing the large
amount of updates that needs to be send out after a write is the most
helpful thing here.
If you are however limited by raw write throughput and already hit that
without any readers then i guess there is only the option to make this
decentralized or to improve the ovsdb.

>
> > > - For LSP and Chassis mapping, the round trip through a central DB
> > >   obviously costs higher than a direct L2 broadcast (the targets are
> > >   the same). But this can be optimized if the MAC and Chassis is known
> > >   by the CMS system (which is true for most openstack/k8s env
> > >   I believe). Instead of updating the binding from each Chassis, CMS
> > >   can tell this information through the same NB -> northd -> SB ->
> > >   Chassis path, and the Chassis can just read the SB without updating
> > >   it.
> >
> > This is only applicable for known L2 addresses.  Maybe telco use cases
> > are very specific, but being able to have ports that send packets from
> > unknown addresses is a strong requirement.
> >
> Understand. But in terms of scale, the assumption is that the majority of
> ports' address are known by CMS. Is this assumption correct in telco use
> cases?
>
> > > On the other hand, the dynamic MAC learning approach has its own
> drawbacks.
> > >
> > > - It is simple to consider L2 only, but if considering more SDB
> > >   features, a central DB is more flexible to extend and implement new
> > >   features than a network protocol based approach.
> > > - It is more predictable and easier to debug with pre-populated
> > >   information through CMS than states learned dynamically in
> > >   data-plane.
> > > - With the DB approach we can suppress most of L2 broadcast/flood,
> > >   while with the distributed MAC learning broadcast/flood can't be
> > >   avoided. Although it may happen mostly when a new workload is
> > >   launched, it can also happen when aging. The cost of broadcast in
> > >   large L2 is also a potential threat to scale.
> >
> > I may lack the field experience of operating large datacenter networks
> > but I was not aware of any scaling issues because of ARP and/or other L2
> > broadcasts.  Is this an actual problem that was reported by cloud/telco
> > operators and which influenced the centralized decisions?
> >
> I didn't hear any cloud operator reporting such problem, but I did hear in
> many situation people expressed their concerns to this problem. And if you
> google "ARP suppression" there are lots of implementations by different
> vendors. And I believe it is a real problem if not well managed, e.g. using
> an extremely large L2 domain without ARP suppression. But I also believe it
> shouldn't be a big concern if L2 segments are small.

I think for large L2 domains having ARP requests being broadcasted is
purely a question of network bandwidth. If they get dropped before they
reach the destination they can just be retried and in normal
communication the arp cache will not expire anyway.
However this is different if you have some kind of L2 based failover
e.g. if you move a mac address between hosts and then send out GARPs.
In this case you must be certain that these GARPs have reached every single
switch for it to update its FIB. Without a central state this is
impossible to guarantee since packets might be randomly dropped. With a
central state you need just one node to pick up the information and
write it to some central store (just like that works for virtual
Port_Bindings iirc).

Note that this would be just a benefit of the central store which also
has drawbacks that you already mentioned.

>
> > > > Use multicast for overlay networks
> > > > ==================================
> > > >
> > > > Use a unique 24bit VNI per overlay network. Derive a multicast group
> > > > address from that VNI. Use VXLAN address learning [2] to remove the
> need
> > > > for ovn-controller to know the destination chassis for every mac
> address
> > > > in advance.
> > > >
> > > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
> > > >
> > > > Pros:
> > > >
> > > > - Nodes do not need to know about others in advance. The control plane
> > > >   load is distributed across the cluster.
> > >
> > > I don't think that nodes knowing each other (at node/chassis level) in
> > > advance is a big scaling problem. Thinking about the 10k nodes scale,
> it is
> > > just 10k entries on each node. And node addition/removal is not a very
> > > frequent operation. So does it really matter?
> >
> > If I'm not mistaken, with the full mesh design, scaling to 10k nodes
> > implies 9999 GENEVE ports on every chassis.  Can OVS handle that kind of
> > setup?  Could this have an impact on datapath performance?
> >
> I didn't test this scale myself, but according to the OVS documentation [8]
> (in the LIMITS section) the limit is determined by the file descriptor only.
> It is a good point regrading datapath performance. In the early days there
> was a statement "Performance will degrade beyond 1,024 ports per bridge due
> to fixed hash table sizing.", but this was removed in 2019 [9].
> It would be great if someone can share a real test result at this scale
> (can just simulate with enough number of tunnel ports).
>
> > > If I understand correctly, the major point of this approach is to form
> the
> > > Chassis groups for BUM traffic of each L2. For example, for the MAC
> > > learning to work, the initial broadcast (usually ARP request) needs to
> be
> > > sent out to the group of Chassis that is related to that specific
> logical
> > > L2. However, as also mentioned by Felix and Frode, requiring multicast
> > > support in infrastructure may exclude a lot of users.
> >
> > Please excuse my candid question, but was multicast traffic in the
> > fabric ever raised as a problem?
> >
> > Most (all?) top of rack switches have IGMP/MLD support built-in. If that
> > was not the case, IPv6 would not work since it requires multicast to
> > function properly.
> >
> Having devices supporting IGMP/MLD might still be different from willing to
> operate multicast. This is not my domain, so I would let people with more
> operator experience comment.
> Regarding IPv6, I think the basic IPv6 operations require multicast but it
> can use well-defined and static multicast addresses that don't require
> dynamic group management provided by MLD.
> Anyway, I just wanted to provide an alternative option that may have less
> requirement for infrastructure.

One of my concerns for using additional features of switches is that in
most cases you can not easily fix bugs in them yourself. If there is
some kind of bug in OVN i have the possibility to find and fix it myself
and thereby fast fix a potential outage.
If i use a feature of a switch and find issues in there i am most of the
time dependent on some third party that needs to find and fix the issue
and then distribute the fix to me. For the normal switching and routing
features this concern is also valid, but they are normally extremly
widely used. Multicast features are generally significantly less used,
so for me that issue would be more significant.

Thanks
Felix

>
> > > On the other hand, I would propose something else that can achieve the
> same
> > > with less cost on the central SB. We can still let Chassis join
> "multicast
> > > groups" but instead of relying on IP mulitcast, we can populate this
> > > information to SB. It is different from today's LSP-Chassis mapping
> > > (port_binding) in SB, but a more coarse-grained mapping of
> > > Datapath-Chassis, which is sufficient to support the BUM traffic for the
> > > distributed MAC learning purpose and lightweight (relatively) to the
> > > central SB.
> >
> > But it would require the chassis to clone broadcast traffic explicitly
> > in OVS.  The main benefit of using IP multicast is that BUM traffic
> > duplication is handled by the fabric switches.
> >
> Agreed.
>
> > > In addition, if you still need L3 distributed routing, each node not
> only
> > > have to join L2 groups that has workloads running locally, but also
> needs
> > > to join indirectly connected L2 groups (e.g. LS1 - LR - LS2) to receive
> > > broadcast to perform MAC learning for L3 connected remotes. The "states"
> > > learned by each chassis should be no different than the one achieved by
> > > conditional monitoring (ovn-monitor-all=false).
> > >
> > > Overall, for the above two points, the primary goal is to reduce
> dependence
> > > on the centralized control plane (especially SB DB). I think it may be
> > > worth some prototype (not a small change) for special use cases that
> > > require extremely large scale but simpler features (and without a big
> > > concern of L2 flooding) for a good tradeoff.
> > >
> > > I'd also like to remind that the L2 related scale issue is more
> relevant to
> > > OpenStack, but it is not a problem for kubernetes (at least not for
> > > ovn-kubernetes). ovn-kubernetes solves the problem by using L3 routing
> > > instead of L2. L2 is confined within each node, and between the nodes
> there
> > > are only routes exchanged (through SB DB), which is O(N) (N = nodes)
> > > instead of O(P) (P = ports). This is discussed in "Trade IP mobility for
> > > scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].
> > >
> > > Also remember that all the other features still require centralized DB,
> > > including L3 routing, NAT, ACL, LB, and so on. SB DB optimizations
> (such as
> > > using relay) may still be required when scaling to 10k nodes.
> >
> > I agree that this is not a small change :) And for it to be worth it, it
> > would probably need to go along with removing the southbound to push the
> > decentralization further.
> >
> I'd rather consider them separate prototypes, because they are somehow
> independent changes.
>
> > > > Connect ovn-controller to the northbound DB
> > > > ===========================================
> > > >
> > > > This idea extends on a previous proposal to migrate the logical flows
> > > > creation in ovn-controller [3].
> > > >
> > > > [3]
> https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
> > > >
> > > > If the first two proposals are implemented, the southbound database
> can
> > > > be removed from the picture. ovn-controller can directly translate the
> > > > northbound schema into OVS configuration bridges, ports and flow
> rules.
>
> I forgot to mention in my earlier reply, that I don't think "If the first
> two proposals are implemented" matter here.
> Firstly, the first two proposals are primarily for L2 distribution, but it
> is a small part (both in code base and features) of OVN. Most other
> features still rely on the central DB.
> Secondly, even without the first two proposals, it is still a valid attempt
> to remove SB (primarily removing the logical flow layer). The L2 OVS flows,
> together with all the other flows for other features, can still be
> generated by ovn-controller according to central DB (probably a combined DB
> of current NB and SB).
>
> > > >
> > > > For other components that require access to the southbound DB (e.g.
> > > > neutron metadata agent), ovn-controller should provide an interface to
> > > > expose state and configuration data for local consumption.
> > > >
> > > > All state information present in the NB DB should be moved to a
> separate
> > > > state database [4] for CMS consumption.
> > > >
> > > > [4]
> https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
> > > >
> > > > For those who like visuals, I have started working on basic use cases
> > > > and how they would be implemented without a southbound database [5].
> > > >
> > > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
> > > >
> > > > Pros:
> > > >
> > > > - The northbound DB is smaller by design: reduced network bandwidth
> and
> > > >   memory usage in all chassis.
> > > > - If we keep the northbound read-only for ovn-controller, it removes
> > > >   scaling issues when one controller updates one row that needs to be
> > > >   replicated everywhere.
> > > > - The northbound schema knows nothing about flows. We could introduce
> > > >   alternative dataplane backends configured by ovn-controller via
> > > >   plugins. I have done a minimal PoC to check if it could work with
> the
> > > >   linux network stack [6].
> > > >
> > > > [6]
> https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
> > > >
> > > > Cons:
> > > >
> > > > - This would be a serious API breakage for systems that depend on the
> > > >   southbound DB.
> > > > - Can all OVN constructs be implemented without a southbound DB?
> > > > - Is the community interested in alternative datapaths?
> > > >
> > >
> > > This idea was also discussed briefly in [7] (page 16-17). The main
> > > motivation was to avoid the cost of the intermediate logical flow layer.
> > > The above mentioned patch was abandoned because it still has the logical
> > > flow translation layer but just moved from northd to ovn-controller.
> > > The major benefit of logical flow layer in northd is that it performs
> > > common calculations that is required for every (or a lot of) chassis at
> > > once, so that they don't need to be repeated on the chassis. It is also
> > > very helpful for trouble-shooting. However, the logical flow layer
> itself
> > > has a significant cost.
> > >
> > > There has been lots of improvement done against the cost, e.g.:
> > >
> > > - incremental lflow processing in ovn-controller and partially in
> ovn-northd
> > > - offloading node-local flow generation in ovn-controller (such as
> > >   port-security, LB hairpin flows, etc.)
> > > - flow-tagging
> > > - ...
> > >
> > > Since then the motivation to remove north/SB has reduced, but it is
> still a
> > > valid alternative (with its pros and cons).
> > > (I believe this change is even bigger than the distributed MAC learning,
> > > but prototypes are always welcome)
> >
> > I must be honest here and admit that I would not know where to start for
> > prototyping such a change in the OVN code base. This is the reason why
> > I reached out to the community to see if my ideas (or at least some of
> > them) make sense to others.
> >
>
> It is just a big change, almost rewriting OVN. The major part of code in
> ovn-northd is to generate logical flows, and a significant part of code in
> ovn-controller is translating logical flow to OVS flows.
> I would not be surprised if this becomes just a different project. (while
> it is true that some part of ovn-controller can still be reused, such as
> physical.c, chassis.c, encap.c, etc.)
>
> Thanks,
> Han
>
> [8] http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.8.txt
> [9]
> https://github.com/openvswitch/ovs/commit/4224b9cf8fdba23fa35c1894eae42fd953a3780b
>
> > > Regarding the alternative datapath, I personally don't think it is a
> strong
> > > argument here. OVN with its NB schema alone (and OVSDB itself) is not an
> > > obvious advantage compared with other SDN solutions. OVN exists
> primarily
> > > to program OVS (or any OpenFlow based datapath). Logical flow table (and
> > > other SB data) exists primarily for this purpose. If someone finds
> another
> > > datapath is more attractive to their use cases than OVS/OpenFlow, it is
> > > probably better to switch to its own control plane (probably using a
> more
> > > popular/scalable database with their own schema).
> > >
> > > Best regards,
> > > Han
> > >
> > > [7] https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf
> >
> > This is a fair point and I understand your position.
> >
> > Thanks for taking the time to consider my ideas!
> >
Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die 
Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte 
unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.


This e-mail may contain confidential content and is intended only for the 
specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and 
delete this e-mail.

Information on data protection can be found 
here<https://www.datenschutz.schwarz>.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to