[ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals
Hello OVN community, This is a follow up on the message I have sent today [1]. That second part focuses on some ideas I have to remove the limitations that were mentioned in the previous email. [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html If you didn't read it, my goal is to start a discussion about how we could improve OVN on the following topics: - Reduce the memory and CPU footprint of ovn-controller, ovn-northd. - Support scaling of L2 connectivity across larger clusters. - Simplify CMS interoperability. - Allow support for alternative datapath implementations. Disclaimer: This message does not mention anything about L3/L4 features of OVN. I didn't have time to work on these, yet. I hope we can discuss how these fit with my ideas. Distributed mac learning Use one OVS bridge per logical switch with mac learning enabled. Only create the bridge if the logical switch has a port bound to the local chassis. Pros: - Minimal openflow rules required in each bridge (ACLs and NAT mostly). - No central mac binding table required. - Mac table aging comes for free. - Zero access to southbound DB for learned addresses nor for aging. Cons: - How to manage seamless upgrades? - Requires ovn-controller to move/plug ports in the correct bridge. - Multiple openflow connections (one per managed bridge). - Requires ovn-trace to be reimplemented differently (maybe other tools as well). Use multicast for overlay networks == Use a unique 24bit VNI per overlay network. Derive a multicast group address from that VNI. Use VXLAN address learning [2] to remove the need for ovn-controller to know the destination chassis for every mac address in advance. [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2 Pros: - Nodes do not need to know about others in advance. The control plane load is distributed across the cluster. - 24bit VNI allows for more than 16 million logical switches. No need for extended GENEVE tunnel options. - Limited and scoped "flooding" with IGMP/MLD snooping enabled in top-of-rack switches. Multicast is only used for BUM traffic. - Only one VXLAN output port per implemented logical switch on a given chassis. Cons: - OVS does not support VXLAN address learning yet. - The number of usable multicast groups in a fabric network may be limited? - How to manage seamless upgrades and interoperability with older OVN versions? Connect ovn-controller to the northbound DB === This idea extends on a previous proposal to migrate the logical flows creation in ovn-controller [3]. [3] https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/ If the first two proposals are implemented, the southbound database can be removed from the picture. ovn-controller can directly translate the northbound schema into OVS configuration bridges, ports and flow rules. For other components that require access to the southbound DB (e.g. neutron metadata agent), ovn-controller should provide an interface to expose state and configuration data for local consumption. All state information present in the NB DB should be moved to a separate state database [4] for CMS consumption. [4] https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html For those who like visuals, I have started working on basic use cases and how they would be implemented without a southbound database [5]. [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3 Pros: - The northbound DB is smaller by design: reduced network bandwidth and memory usage in all chassis. - If we keep the northbound read-only for ovn-controller, it removes scaling issues when one controller updates one row that needs to be replicated everywhere. - The northbound schema knows nothing about flows. We could introduce alternative dataplane backends configured by ovn-controller via plugins. I have done a minimal PoC to check if it could work with the linux network stack [6]. [6] https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go Cons: - This would be a serious API breakage for systems that depend on the southbound DB. - Can all OVN constructs be implemented without a southbound DB? - Is the community interested in alternative datapaths? Closing thoughts I mainly focused on OpenStack use cases for now, but I think these propositions could benefit Kubernetes as well. I hope I didn't bore everyone to death. Let me know what you think. Cheers! -- Robin Jarry Red Hat, Telco/NFV ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
[ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues
Hello OVN community, I'm glad the subject of this message has caught your attention :-) I would like to start a discussion about how we could improve OVN on the following topics: * Reduce the memory and CPU footprint of ovn-controller, ovn-northd. * Support scaling of L2 connectivity across larger clusters. * Simplify CMS interoperability. * Allow support for alternative datapath implementations. This first email will focus on the current issues that (in my view) are preventing OVN from scaling L2 networks on larger clusters. I will send another message with some change proposals to remove or fix these issues. Disclaimer: I am fairly new to this project and my perception and understanding may be incorrect in some aspects. Please forgive me in advance if I use the wrong terms and/or make invalid statements. My intent is only to make things better and not to put the blame on anyone for the current design choices. Southbound Design = In the current architecture, both databases contain a mix of state and configuration. While this does not seem to cause any scaling issues for the northbound DB, it can become a bottleneck for the southbound with large numbers of chassis and logical network constructs. The southbound database contains a mix of configuration (logical flows transformed from the logical network topology) and state (chassis, port bindings, mac bindings, FDB entries, etc.). The "configuration" part is consumed by ovn-controller to implement the network on every chassis and the "state" part is consumed by ovn-northd to update the northbound "state" entries and to update logical flows. Some CMS's [1] also depend on the southbound "state" in order to function properly. [1] https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40 Centralized decisions = Every chassis needs to be "aware" of all other chassis in the cluster. This requirement mainly comes from overlay networks that are implemented over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some limitations). It is not a scaling issue by itself, but it implies a centralized decision which in turn puts pressure on the central node at scale. Due to ovsdb monitoring and caching, any change in the southbound DB (either by northd or by any of the chassis controllers) is replicated on every chassis. The monitor_all option is often enabled on large clusters to avoid the conditional monitoring CPU cost on the central node. This leads to high memory usage on all chassis, control plane traffic and possible disruptions in the ovs-vswitchd datapath flow cache. Unfortunately, I don't have any hard data to back this claim. This is mainly coming from discussions I had with neutron contributors and from brainstorming sessions with colleagues. I hope that the current work on OVN heater to integrate openstack support [2] will allow getting more insight. [2] https://github.com/ovn-org/ovn-heater/pull/179 Dynamic mac learning Logical switch ports on a given chassis are all connected to the same OVS bridge, in the same VLAN. This prevents from using local mac address learning and shifts the responsibility to a centralized ovn-northd to create all the required logical flows to properly segment the network. When using mac_address=unknown ports, centralized mac learning is enabled and when a new address is seen entering a port, OVS sends it to the local controller which updates the FDB table and recomputes flow rules accordingly. With logical switches spanning across a large number of chassis, this centralized mac address learning and aging can have an impact on control plane and dataplane performance. Closing thoughts My understanding of L3 and L4 capabilities of OVN are too limited to discuss if there are other issues that would prevent scaling to thousands of nodes. My point was mainly focused on L2 network scaling. I would love to get other opinions on these statements. Cheers! -- Robin Jarry Red Hat, Telco/NFV ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Does dp:tc always mean adequate HW offload
On 28 Sep 2023, at 5:35, Levente Csikor via discuss wrote: > Hi all, > > I am playing around with HW offloading and I have observed something > strange that I cannot explain (as of now). > > I have two OvS switches (both supporting and having hw-offload:true) > connected back-to-back through two physical ports. > > I am actually measuring latency with Pktgen-DPDK, so I send latency > measuring packets through my first OvS, which will be returned by the > second OvS. Both are running on a SmartNIC, but the scenario is > somewhat similar of having them as hypervisor switches using kernel > drivers and pktgen-dpdk app is running in a VM "above" the first OvS. > > The flow rules are based on Layer-3 information (instead of simple port > forward). > > OvS 1 flow rules (pfXhpf face the VM, pX face the other OvS): > - ip,in_port=pf0hpf,nw_src=192.168.0.1,nw_dst=192.168.1.1 > actions=output:p0 > - ip,in_port=p1,nw_src=192.168.0.1,nw_dst=192.168.1.1 > actions=output:pf1hpf > > OvS 2 flow rules: > - ip,in_port=p0,nw_src=192.168.0.1,nw_dst=192.168.1.1 actions=output:p1 > > The routing is working as expected, I get back all the packets, but due > to the relatively bad latency I measured, I checked whether these flow > rules, or in fact, their corresponding flow-caches, are indeed > offloaded to the HW. > > In the case of the first OvS, the following command gives the below > output: > # ovs-appctl dpctl/dump-flows -m |grep offloaded > > ufid:66bfe654-11ac-4f01-8dec-414177df78c4, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0) > ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p1),packet_type(ns=0/0 > ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00 > :00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=192.168.0.1,dst=192.16 > 8.1.1,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:368243396, > bytes:29459460538, used:0.120s, offloaded:yes, dp:tc, actions:pf1hpf > ufid:89516b0d-bb6f-4df0-ba91-973ef79e6fa3, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0) > ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf0hpf),packet_type(ns > =0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:0 > 0:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=192.168.0.1,dst=19 > 2.168.1.1,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:3630728485, > bytes:275935348872, used:0.080s, offloaded:yes, dp:tc, actions:p0 > ufid:3fbc4e09-9d78-4e3b-8711-541d36828959, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0) > ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf0hpf),packet_type(ns > =0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:0 > 0:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,ds > t=0.0.0.0/128.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), > packets:233401993, bytes:17738549256, used:0.080s, offloaded:yes, > dp:tc, actions:drop > > The key observation here is that the entries offloaded are using tc, > and the ones related to my layer-3 forwarding rules (1st and 2nd entry) > are indeed offloaded to the hardware. > > However, on the second OvS, I see this: > ufid:b88192f5-17d5-461d-ba61-2bd0b7e49b39, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0) > ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p1),packet_type(ns=0/0 > ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00 > :00/00:00:00:00:00:00),eth_type(0x88cc), packets:0, bytes:0, > used:never, offloaded:yes, dp:tc, actions:drop > ufid:82f41402-9d5b-4229-a1c0-d75a35138ebe, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0) > ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p0),packet_type(ns=0/0 > ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00 > :00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=192.168.0.1,dst=192.16 > 8.1.1,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:432344072, > bytes:26805332464, used:0.000s, dp:tc, actions:p1 > ufid:da660698-a31e-47e5-bf30-34ee90486f57, > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0) > ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p0),packet_type(ns=0/0 > ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00 > :00/00:00:00:00:00:00),eth_type(0x88cc), packets:0, bytes:0, > used:never, offloaded:yes, dp:tc, actions:drop > ufid:68dc355c-0823-4324-8a02-fa104597b8d8, > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(p0),skb_mark(0/0),c > t_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=18:5a:58:0 > c:c8:ca,dst=01:80:c2:00:00:00),eth_type(0/0x), packets:67094, > bytes:4025640, used:0.313s, dp:ovs, actions:drop > ufid:39f2ee6d-4a1b-4781-a73e-2d0a8d045df3, > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(p1),skb_mark(0/0),c > t_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:0 > 0:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_ > type(0/0x), packets:4112, bytes:246720, used:1.725s, dp:ovs, > actions:drop >