Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Han Zhou via discuss Sat, 30 Sep 2023 12:59:52 -0700

On Thu, Sep 28, 2023 at 9:28 AM Robin Jarry <rja...@redhat.com> wrote:
>
> Hello OVN community,
>
> This is a follow up on the message I have sent today [1]. That second
> part focuses on some ideas I have to remove the limitations that were
> mentioned in the previous email.
>
> [1]
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html
>
> If you didn't read it, my goal is to start a discussion about how we
> could improve OVN on the following topics:
>
> - Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
> - Support scaling of L2 connectivity across larger clusters.
> - Simplify CMS interoperability.
> - Allow support for alternative datapath implementations.
>
> Disclaimer:
>
> This message does not mention anything about L3/L4 features of OVN.
> I didn't have time to work on these, yet. I hope we can discuss how
> these fit with my ideas.
>


Hi Robin and folks, thanks for the great discussions!
I read the replies of two other threads of this email, but I am replying
directly here to comment on some of the original statements in this email.
I will reply to the other threads for some specific points.

> Distributed mac learning
> ========================
>
> Use one OVS bridge per logical switch with mac learning enabled. Only
> create the bridge if the logical switch has a port bound to the local
> chassis.
>
> Pros:
>
> - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> - No central mac binding table required.

Firstly to clarify the terminology of "mac binding" to avoid confusion, the
mac_binding table currently in SB DB has nothing to do with L2 MAC
learning. It is actually the ARP/Neighbor table of distributed logical
routers. We should probably call it IP_MAC_binding table, or just Neighbor
table.
Here what you mean is actually L2 MAC learning, which today is implemented
by the FDB table in SB DB, and it is only for uncommon use cases when the
NB doesn't have the knowledge of a MAC address of a VIF.

> - Mac table aging comes for free.
> - Zero access to southbound DB for learned addresses nor for aging.
>
> Cons:
>
> - How to manage seamless upgrades?
> - Requires ovn-controller to move/plug ports in the correct bridge.
> - Multiple openflow connections (one per managed bridge).
> - Requires ovn-trace to be reimplemented differently (maybe other tools
>   as well).
>

The purpose of this proposal is clear - to avoid using a central table in
DB for L2 information but instead using L2 MAC learning to populate such
information on chassis, which is a reasonable alternative with pros and
cons.
However, I don't think it is necessary to use separate OVS bridges for this
purpose. L2 MAC learning can be easily implemented in the br-int bridge
with OVS flows, which is much simpler than managing dynamic number of OVS
bridges just for the purpose of using the builtin OVS mac-learning.

Now back to the distributed MAC learning idea itself. Essentially for two
VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
In OVN today this information is conveyed through:
- MAC and LSP mapping (NB -> northd -> SB -> Chassis)
- LSP and Chassis mapping (Chassis -> SB -> Chassis)

In your proposal:
- MAC and Chassis mapping (can be learned through initial L2
broadcast/flood)

This indeed would avoid the control plane cost through the centralized
components (for this L2 binding part). Given that today's SB OVSDB is a
bottleneck, this idea may sound attractive. But please also take into
consideration the below improvement that could mitigate the OVN central
scale issue:
- For MAC and LSP mapping, northd is now capable of incrementally
processing VIF related L2/L3 changes, so the cost of NB -> northd -> SB is
very small. For SB -> Chassis, a more scalable DB deployment, such as the
OVSDB relays, may largely help.
- For LSP and Chassis mapping, the round trip through a central DB
obviously costs higher than a direct L2 broadcast (the targets are the
same). But this can be optimized if the MAC and Chassis is known by the CMS
system (which is true for most openstack/k8s env I believe). Instead of
updating the binding from each Chassis, CMS can tell this information
through the same NB -> northd -> SB -> Chassis path, and the Chassis can
just read the SB without updating it.

On the other hand, the dynamic MAC learning approach has its own drawbacks.
- It is simple to consider L2 only, but if considering more SDB features, a
central DB is more flexible to extend and implement new features than a
network protocol based approach.
- It is more predictable and easier to debug with pre-populated information
through CMS than states learned dynamically in data-plane.
- With the DB approach we can suppress most of L2 broadcast/flood, while
with the distributed MAC learning broadcast/flood can't be avoided.
Although it may happen mostly when a new workload is launched, it can also
happen when aging. The cost of broadcast in large L2 is also a potential
threat to scale.
- For the initial broadcast/flood, the source chassis needs to know all the
destinations, which requires either a pre-populated LSP and Chassis mapping
with the same centralized SB mechanism, or with something else that is
discussed in your below proposal "Use multicast for overlay networks". (so
let's continue the discussion below)

> Use multicast for overlay networks
> ==================================
>
> Use a unique 24bit VNI per overlay network. Derive a multicast group
> address from that VNI. Use VXLAN address learning [2] to remove the need
> for ovn-controller to know the destination chassis for every mac address
> in advance.
>
> [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
>
> Pros:
>
> - Nodes do not need to know about others in advance. The control plane
>   load is distributed across the cluster.

I don't think that nodes knowing each other (at node/chassis level) in
advance is a big scaling problem. Thinking about the 10k nodes scale, it is
just 10k entries on each node. And node addition/removal is not a very
frequent operation. So does it really matter?

> - 24bit VNI allows for more than 16 million logical switches. No need
>   for extended GENEVE tunnel options.
> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
>   top-of-rack switches. Multicast is only used for BUM traffic.
> - Only one VXLAN output port per implemented logical switch on a given
>   chassis.
>
> Cons:
>
> - OVS does not support VXLAN address learning yet.
> - The number of usable multicast groups in a fabric network may be
>   limited?
> - How to manage seamless upgrades and interoperability with older OVN
>   versions?
>

If I understand correctly, the major point of this approach is to form the
Chassis groups for BUM traffic of each L2. For example, for the MAC
learning to work, the initial broadcast (usually ARP request) needs to be
sent out to the group of Chassis that is related to that specific logical
L2. However, as also mentioned by Felix and Frode, requiring multicast
support in infrastructure may exclude a lot of users.

On the other hand, I would propose something else that can achieve the same
with less cost on the central SB. We can still let Chassis join "multicast
groups" but instead of relying on IP mulitcast, we can populate this
information to SB. It is different from today's LSP-Chassis mapping
(port_binding) in SB, but a more coarse-grained mapping of
Datapath-Chassis, which is sufficient to support the BUM traffic for the
distributed MAC learning purpose and lightweight (relatively) to the
central SB.

In addition, if you still need L3 distributed routing, each node not only
have to join L2 groups that has workloads running locally, but also needs
to join indirectly connected L2 groups (e.g. LS1 - LR - LS2) to receive
broadcast to perform MAC learning for L3 connected remotes. The "states"
learned by each chassis should be no different than the one achieved by
conditional monitoring (ovn-monitor-all=false).

Overall, for the above two points, the primary goal is to reduce dependence
on the centralized control plane (especially SB DB). I think it may be
worth some prototype (not a small change) for special use cases that
require extremely large scale but simpler features (and without a big
concern of L2 flooding) for a good tradeoff.

I'd also like to remind that the L2 related scale issue is more relevant to
OpenStack, but it is not a problem for kubernetes (at least not for
ovn-kubernetes). ovn-kubernetes solves the problem by using L3 routing
instead of L2. L2 is confined within each node, and between the nodes there
are only routes exchanged (through SB DB), which is O(N) (N = nodes)
instead of O(P) (P = ports). This is discussed in "Trade IP mobility for
scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].

Also remember that all the other features still require centralized DB,
including L3 routing, NAT, ACL, LB, and so on. SB DB optimizations (such as
using relay) may still be required when scaling to 10k nodes.

> Connect ovn-controller to the northbound DB
> ===========================================
>
> This idea extends on a previous proposal to migrate the logical flows
> creation in ovn-controller [3].
>
> [3]
https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
>
> If the first two proposals are implemented, the southbound database can
> be removed from the picture. ovn-controller can directly translate the
> northbound schema into OVS configuration bridges, ports and flow rules.
>
> For other components that require access to the southbound DB (e.g.
> neutron metadata agent), ovn-controller should provide an interface to
> expose state and configuration data for local consumption.
>
> All state information present in the NB DB should be moved to a separate
> state database [4] for CMS consumption.
>
> [4] https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
>
> For those who like visuals, I have started working on basic use cases
> and how they would be implemented without a southbound database [5].
>
> [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
>
> Pros:
>
> - The northbound DB is smaller by design: reduced network bandwidth and
>   memory usage in all chassis.
> - If we keep the northbound read-only for ovn-controller, it removes
>   scaling issues when one controller updates one row that needs to be
>   replicated everywhere.
> - The northbound schema knows nothing about flows. We could introduce
>   alternative dataplane backends configured by ovn-controller via
>   plugins. I have done a minimal PoC to check if it could work with the
>   linux network stack [6].
>
> [6]
https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
>
> Cons:
>
> - This would be a serious API breakage for systems that depend on the
>   southbound DB.
> - Can all OVN constructs be implemented without a southbound DB?
> - Is the community interested in alternative datapaths?
>

This idea was also discussed briefly in [7] (page 16-17). The main
motivation was to avoid the cost of the intermediate logical flow layer.
The above mentioned patch was abandoned because it still has the logical
flow translation layer but just moved from northd to ovn-controller.
The major benefit of logical flow layer in northd is that it performs
common calculations that is required for every (or a lot of) chassis at
once, so that they don't need to be repeated on the chassis. It is also
very helpful for trouble-shooting. However, the logical flow layer itself
has a significant cost.
There has been lots of improvement done against the cost, e.g.:
- incremental lflow processing in ovn-controller and partially in ovn-northd
- offloading node-local flow generation in ovn-controller (such as
port-security, LB hairpin flows, etc.)
- flow-tagging
- ...
Since then the motivation to remove north/SB has reduced, but it is still a
valid alternative (with its pros and cons).
(I believe this change is even bigger than the distributed MAC learning,
but prototypes are always welcome)

Regarding the alternative datapath, I personally don't think it is a strong
argument here. OVN with its NB schema alone (and OVSDB itself) is not an
obvious advantage compared with other SDN solutions. OVN exists primarily
to program OVS (or any OpenFlow based datapath). Logical flow table (and
other SB data) exists primarily for this purpose. If someone finds another
datapath is more attractive to their use cases than OVS/OpenFlow, it is
probably better to switch to its own control plane (probably using a more
popular/scalable database with their own schema).

Best regards,
Han

[7] https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf

> Closing thoughts
> ================
>
> I mainly focused on OpenStack use cases for now, but I think these
> propositions could benefit Kubernetes as well.
>
> I hope I didn't bore everyone to death. Let me know what you think.
>
> Cheers!
>
> --
> Robin Jarry
> Red Hat, Telco/NFV
>

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to