Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Han Zhou via discuss Sun, 01 Oct 2023 15:50:08 -0700

On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry <rja...@redhat.com> wrote:
>
> Hi Han,
>
> Please see my comments/questions inline.
>
> Han Zhou, Sep 30, 2023 at 21:59:
> > > Distributed mac learning
> > > ========================
> > >
> > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > create the bridge if the logical switch has a port bound to the local
> > > chassis.
> > >
> > > Pros:
> > >
> > > - Minimal openflow rules required in each bridge (ACLs and NAT
mostly).
> > > - No central mac binding table required.
> >
> > Firstly to clarify the terminology of "mac binding" to avoid confusion,
the
> > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > learning. It is actually the ARP/Neighbor table of distributed logical
> > routers. We should probably call it IP_MAC_binding table, or just
Neighbor
> > table.
>
> Yes sorry about the confusion. I actually meant the FDB table.
>
> > Here what you mean is actually L2 MAC learning, which today is
implemented
> > by the FDB table in SB DB, and it is only for uncommon use cases when
the
> > NB doesn't have the knowledge of a MAC address of a VIF.
>
> This is not that uncommon in telco use cases where VNFs can send packets
> from mac addresses unknown to OVN.
>
Understand, but VNFs contributes a very small portion of the workloads,
right? Maybe I should rephrase that: it is uncommon to have "unknown"
addresses for the majority of ports in a large scale cloud. Is this
understanding correct?


> > The purpose of this proposal is clear - to avoid using a central table
in
> > DB for L2 information but instead using L2 MAC learning to populate such
> > information on chassis, which is a reasonable alternative with pros and
> > cons.
> > However, I don't think it is necessary to use separate OVS bridges for
this
> > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > with OVS flows, which is much simpler than managing dynamic number of
OVS
> > bridges just for the purpose of using the builtin OVS mac-learning.
>
> I agree that this could also be implemented with VLAN tags on the
> appropriate ports. But since OVS does not support trunk ports, it may
> require complicated OF pipelines. My intent with this idea was two fold:
>
> 1) Avoid a central point of failure for mac learning/aging.
> 2) Simplify the OF pipeline by making all FDB operations dynamic.

IMHO, the L2 pipeline is not really complex. It is probably the simplest
part (compared with other features for L3, NAT, ACL, LB, etc.).
Adding dynamic learning to this part probably makes it *a little* more
complex, but should still be straightforward. We don't need any VLAN tag
because the incoming packet has geneve VNI in the metadata. We just need a
flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
new flow (with related tunnel endpont information) if the src MAC is not
found, with the help of the "learn" action. The entries are
per-logical_switch (VNI). This would serve your purpose of avoiding a
central DB for L2. At least this looks much simpler to me than managing
dynamic number of OVS bridges and the patch pairs between them.

>
> > Now back to the distributed MAC learning idea itself. Essentially for
two
> > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
to
> > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
discuss
> > this later), Chassis1 needs to know that VM2's MAC is located on
Chassis2.
> >
> > In OVN today this information is conveyed through:
> >
> > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> >
> > In your proposal:
> >
> > - MAC and Chassis mapping (can be learned through initial L2
> >   broadcast/flood)
> >
> > This indeed would avoid the control plane cost through the centralized
> > components (for this L2 binding part). Given that today's SB OVSDB is a
> > bottleneck, this idea may sound attractive. But please also take into
> > consideration the below improvement that could mitigate the OVN central
> > scale issue:
> >
> > - For MAC and LSP mapping, northd is now capable of incrementally
> >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> >   such as the OVSDB relays, may largely help.
>
> But using relays will only help with read-only operations (SB ->
> chassis). Write operations (from dynamically learned mac addresses) will
> be equivalent.
>
OVSDB relay supports write operations, too. It scales better because each
ovsdb-server process handles smaller number of clients/connections. It may
still perform worse when there are too many write operations from many
clients, but I think it should scale better than without relay. This is
only based on my knowledge of the ovsdb-server relay, but I haven't tested
it at scale, yet. People who actually deployed it may comment more.

> > - For LSP and Chassis mapping, the round trip through a central DB
> >   obviously costs higher than a direct L2 broadcast (the targets are
> >   the same). But this can be optimized if the MAC and Chassis is known
> >   by the CMS system (which is true for most openstack/k8s env
> >   I believe). Instead of updating the binding from each Chassis, CMS
> >   can tell this information through the same NB -> northd -> SB ->
> >   Chassis path, and the Chassis can just read the SB without updating
> >   it.
>
> This is only applicable for known L2 addresses.  Maybe telco use cases
> are very specific, but being able to have ports that send packets from
> unknown addresses is a strong requirement.
>
Understand. But in terms of scale, the assumption is that the majority of
ports' address are known by CMS. Is this assumption correct in telco use
cases?

> > On the other hand, the dynamic MAC learning approach has its own
drawbacks.
> >
> > - It is simple to consider L2 only, but if considering more SDB
> >   features, a central DB is more flexible to extend and implement new
> >   features than a network protocol based approach.
> > - It is more predictable and easier to debug with pre-populated
> >   information through CMS than states learned dynamically in
> >   data-plane.
> > - With the DB approach we can suppress most of L2 broadcast/flood,
> >   while with the distributed MAC learning broadcast/flood can't be
> >   avoided. Although it may happen mostly when a new workload is
> >   launched, it can also happen when aging. The cost of broadcast in
> >   large L2 is also a potential threat to scale.
>
> I may lack the field experience of operating large datacenter networks
> but I was not aware of any scaling issues because of ARP and/or other L2
> broadcasts.  Is this an actual problem that was reported by cloud/telco
> operators and which influenced the centralized decisions?
>
I didn't hear any cloud operator reporting such problem, but I did hear in
many situation people expressed their concerns to this problem. And if you
google "ARP suppression" there are lots of implementations by different
vendors. And I believe it is a real problem if not well managed, e.g. using
an extremely large L2 domain without ARP suppression. But I also believe it
shouldn't be a big concern if L2 segments are small.

> > > Use multicast for overlay networks
> > > ==================================
> > >
> > > Use a unique 24bit VNI per overlay network. Derive a multicast group
> > > address from that VNI. Use VXLAN address learning [2] to remove the
need
> > > for ovn-controller to know the destination chassis for every mac
address
> > > in advance.
> > >
> > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
> > >
> > > Pros:
> > >
> > > - Nodes do not need to know about others in advance. The control plane
> > >   load is distributed across the cluster.
> >
> > I don't think that nodes knowing each other (at node/chassis level) in
> > advance is a big scaling problem. Thinking about the 10k nodes scale,
it is
> > just 10k entries on each node. And node addition/removal is not a very
> > frequent operation. So does it really matter?
>
> If I'm not mistaken, with the full mesh design, scaling to 10k nodes
> implies 9999 GENEVE ports on every chassis.  Can OVS handle that kind of
> setup?  Could this have an impact on datapath performance?
>
I didn't test this scale myself, but according to the OVS documentation [8]
(in the LIMITS section) the limit is determined by the file descriptor only.
It is a good point regrading datapath performance. In the early days there
was a statement "Performance will degrade beyond 1,024 ports per bridge due
to fixed hash table sizing.", but this was removed in 2019 [9].
It would be great if someone can share a real test result at this scale
(can just simulate with enough number of tunnel ports).

> > If I understand correctly, the major point of this approach is to form
the
> > Chassis groups for BUM traffic of each L2. For example, for the MAC
> > learning to work, the initial broadcast (usually ARP request) needs to
be
> > sent out to the group of Chassis that is related to that specific
logical
> > L2. However, as also mentioned by Felix and Frode, requiring multicast
> > support in infrastructure may exclude a lot of users.
>
> Please excuse my candid question, but was multicast traffic in the
> fabric ever raised as a problem?
>
> Most (all?) top of rack switches have IGMP/MLD support built-in. If that
> was not the case, IPv6 would not work since it requires multicast to
> function properly.
>
Having devices supporting IGMP/MLD might still be different from willing to
operate multicast. This is not my domain, so I would let people with more
operator experience comment.
Regarding IPv6, I think the basic IPv6 operations require multicast but it
can use well-defined and static multicast addresses that don't require
dynamic group management provided by MLD.
Anyway, I just wanted to provide an alternative option that may have less
requirement for infrastructure.

> > On the other hand, I would propose something else that can achieve the
same
> > with less cost on the central SB. We can still let Chassis join
"multicast
> > groups" but instead of relying on IP mulitcast, we can populate this
> > information to SB. It is different from today's LSP-Chassis mapping
> > (port_binding) in SB, but a more coarse-grained mapping of
> > Datapath-Chassis, which is sufficient to support the BUM traffic for the
> > distributed MAC learning purpose and lightweight (relatively) to the
> > central SB.
>
> But it would require the chassis to clone broadcast traffic explicitly
> in OVS.  The main benefit of using IP multicast is that BUM traffic
> duplication is handled by the fabric switches.
>
Agreed.

> > In addition, if you still need L3 distributed routing, each node not
only
> > have to join L2 groups that has workloads running locally, but also
needs
> > to join indirectly connected L2 groups (e.g. LS1 - LR - LS2) to receive
> > broadcast to perform MAC learning for L3 connected remotes. The "states"
> > learned by each chassis should be no different than the one achieved by
> > conditional monitoring (ovn-monitor-all=false).
> >
> > Overall, for the above two points, the primary goal is to reduce
dependence
> > on the centralized control plane (especially SB DB). I think it may be
> > worth some prototype (not a small change) for special use cases that
> > require extremely large scale but simpler features (and without a big
> > concern of L2 flooding) for a good tradeoff.
> >
> > I'd also like to remind that the L2 related scale issue is more
relevant to
> > OpenStack, but it is not a problem for kubernetes (at least not for
> > ovn-kubernetes). ovn-kubernetes solves the problem by using L3 routing
> > instead of L2. L2 is confined within each node, and between the nodes
there
> > are only routes exchanged (through SB DB), which is O(N) (N = nodes)
> > instead of O(P) (P = ports). This is discussed in "Trade IP mobility for
> > scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].
> >
> > Also remember that all the other features still require centralized DB,
> > including L3 routing, NAT, ACL, LB, and so on. SB DB optimizations
(such as
> > using relay) may still be required when scaling to 10k nodes.
>
> I agree that this is not a small change :) And for it to be worth it, it
> would probably need to go along with removing the southbound to push the
> decentralization further.
>
I'd rather consider them separate prototypes, because they are somehow
independent changes.

> > > Connect ovn-controller to the northbound DB
> > > ===========================================
> > >
> > > This idea extends on a previous proposal to migrate the logical flows
> > > creation in ovn-controller [3].
> > >
> > > [3]
https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
> > >
> > > If the first two proposals are implemented, the southbound database
can
> > > be removed from the picture. ovn-controller can directly translate the
> > > northbound schema into OVS configuration bridges, ports and flow
rules.

I forgot to mention in my earlier reply, that I don't think "If the first
two proposals are implemented" matter here.
Firstly, the first two proposals are primarily for L2 distribution, but it
is a small part (both in code base and features) of OVN. Most other
features still rely on the central DB.
Secondly, even without the first two proposals, it is still a valid attempt
to remove SB (primarily removing the logical flow layer). The L2 OVS flows,
together with all the other flows for other features, can still be
generated by ovn-controller according to central DB (probably a combined DB
of current NB and SB).

> > >
> > > For other components that require access to the southbound DB (e.g.
> > > neutron metadata agent), ovn-controller should provide an interface to
> > > expose state and configuration data for local consumption.
> > >
> > > All state information present in the NB DB should be moved to a
separate
> > > state database [4] for CMS consumption.
> > >
> > > [4]
https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
> > >
> > > For those who like visuals, I have started working on basic use cases
> > > and how they would be implemented without a southbound database [5].
> > >
> > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
> > >
> > > Pros:
> > >
> > > - The northbound DB is smaller by design: reduced network bandwidth
and
> > >   memory usage in all chassis.
> > > - If we keep the northbound read-only for ovn-controller, it removes
> > >   scaling issues when one controller updates one row that needs to be
> > >   replicated everywhere.
> > > - The northbound schema knows nothing about flows. We could introduce
> > >   alternative dataplane backends configured by ovn-controller via
> > >   plugins. I have done a minimal PoC to check if it could work with
the
> > >   linux network stack [6].
> > >
> > > [6]
https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
> > >
> > > Cons:
> > >
> > > - This would be a serious API breakage for systems that depend on the
> > >   southbound DB.
> > > - Can all OVN constructs be implemented without a southbound DB?
> > > - Is the community interested in alternative datapaths?
> > >
> >
> > This idea was also discussed briefly in [7] (page 16-17). The main
> > motivation was to avoid the cost of the intermediate logical flow layer.
> > The above mentioned patch was abandoned because it still has the logical
> > flow translation layer but just moved from northd to ovn-controller.
> > The major benefit of logical flow layer in northd is that it performs
> > common calculations that is required for every (or a lot of) chassis at
> > once, so that they don't need to be repeated on the chassis. It is also
> > very helpful for trouble-shooting. However, the logical flow layer
itself
> > has a significant cost.
> >
> > There has been lots of improvement done against the cost, e.g.:
> >
> > - incremental lflow processing in ovn-controller and partially in
ovn-northd
> > - offloading node-local flow generation in ovn-controller (such as
> >   port-security, LB hairpin flows, etc.)
> > - flow-tagging
> > - ...
> >
> > Since then the motivation to remove north/SB has reduced, but it is
still a
> > valid alternative (with its pros and cons).
> > (I believe this change is even bigger than the distributed MAC learning,
> > but prototypes are always welcome)
>
> I must be honest here and admit that I would not know where to start for
> prototyping such a change in the OVN code base. This is the reason why
> I reached out to the community to see if my ideas (or at least some of
> them) make sense to others.
>

It is just a big change, almost rewriting OVN. The major part of code in
ovn-northd is to generate logical flows, and a significant part of code in
ovn-controller is translating logical flow to OVS flows.
I would not be surprised if this becomes just a different project. (while
it is true that some part of ovn-controller can still be reused, such as
physical.c, chassis.c, encap.c, etc.)

Thanks,
Han

[8] http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.8.txt
[9]
https://github.com/openvswitch/ovs/commit/4224b9cf8fdba23fa35c1894eae42fd953a3780b

> > Regarding the alternative datapath, I personally don't think it is a
strong
> > argument here. OVN with its NB schema alone (and OVSDB itself) is not an
> > obvious advantage compared with other SDN solutions. OVN exists
primarily
> > to program OVS (or any OpenFlow based datapath). Logical flow table (and
> > other SB data) exists primarily for this purpose. If someone finds
another
> > datapath is more attractive to their use cases than OVS/OpenFlow, it is
> > probably better to switch to its own control plane (probably using a
more
> > popular/scalable database with their own schema).
> >
> > Best regards,
> > Han
> >
> > [7] https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf
>
> This is a fair point and I understand your position.
>
> Thanks for taking the time to consider my ideas!
>

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to