[ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-28 Thread Robin Jarry via discuss
Hello OVN community,

This is a follow up on the message I have sent today [1]. That second
part focuses on some ideas I have to remove the limitations that were
mentioned in the previous email.

[1] 
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html

If you didn't read it, my goal is to start a discussion about how we
could improve OVN on the following topics:

- Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
- Support scaling of L2 connectivity across larger clusters.
- Simplify CMS interoperability.
- Allow support for alternative datapath implementations.

Disclaimer:

This message does not mention anything about L3/L4 features of OVN.
I didn't have time to work on these, yet. I hope we can discuss how
these fit with my ideas.

Distributed mac learning


Use one OVS bridge per logical switch with mac learning enabled. Only
create the bridge if the logical switch has a port bound to the local
chassis.

Pros:

- Minimal openflow rules required in each bridge (ACLs and NAT mostly).
- No central mac binding table required.
- Mac table aging comes for free.
- Zero access to southbound DB for learned addresses nor for aging.

Cons:

- How to manage seamless upgrades?
- Requires ovn-controller to move/plug ports in the correct bridge.
- Multiple openflow connections (one per managed bridge).
- Requires ovn-trace to be reimplemented differently (maybe other tools
  as well).

Use multicast for overlay networks
==

Use a unique 24bit VNI per overlay network. Derive a multicast group
address from that VNI. Use VXLAN address learning [2] to remove the need
for ovn-controller to know the destination chassis for every mac address
in advance.

[2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2

Pros:

- Nodes do not need to know about others in advance. The control plane
  load is distributed across the cluster.
- 24bit VNI allows for more than 16 million logical switches. No need
  for extended GENEVE tunnel options.
- Limited and scoped "flooding" with IGMP/MLD snooping enabled in
  top-of-rack switches. Multicast is only used for BUM traffic.
- Only one VXLAN output port per implemented logical switch on a given
  chassis.

Cons:

- OVS does not support VXLAN address learning yet.
- The number of usable multicast groups in a fabric network may be
  limited?
- How to manage seamless upgrades and interoperability with older OVN
  versions?

Connect ovn-controller to the northbound DB
===

This idea extends on a previous proposal to migrate the logical flows
creation in ovn-controller [3].

[3] 
https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/

If the first two proposals are implemented, the southbound database can
be removed from the picture. ovn-controller can directly translate the
northbound schema into OVS configuration bridges, ports and flow rules.

For other components that require access to the southbound DB (e.g.
neutron metadata agent), ovn-controller should provide an interface to
expose state and configuration data for local consumption.

All state information present in the NB DB should be moved to a separate
state database [4] for CMS consumption.

[4] https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html

For those who like visuals, I have started working on basic use cases
and how they would be implemented without a southbound database [5].

[5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3

Pros:

- The northbound DB is smaller by design: reduced network bandwidth and
  memory usage in all chassis.
- If we keep the northbound read-only for ovn-controller, it removes
  scaling issues when one controller updates one row that needs to be
  replicated everywhere.
- The northbound schema knows nothing about flows. We could introduce
  alternative dataplane backends configured by ovn-controller via
  plugins. I have done a minimal PoC to check if it could work with the
  linux network stack [6].

[6] https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go

Cons:

- This would be a serious API breakage for systems that depend on the
  southbound DB.
- Can all OVN constructs be implemented without a southbound DB?
- Is the community interested in alternative datapaths?

Closing thoughts


I mainly focused on OpenStack use cases for now, but I think these
propositions could benefit Kubernetes as well.

I hope I didn't bore everyone to death. Let me know what you think.

Cheers!

-- 
Robin Jarry
Red Hat, Telco/NFV

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-28 Thread Robin Jarry via discuss
Hello OVN community,

I'm glad the subject of this message has caught your attention :-)

I would like to start a discussion about how we could improve OVN on the
following topics:

* Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
* Support scaling of L2 connectivity across larger clusters.
* Simplify CMS interoperability.
* Allow support for alternative datapath implementations.

This first email will focus on the current issues that (in my view) are
preventing OVN from scaling L2 networks on larger clusters. I will send
another message with some change proposals to remove or fix these
issues.

Disclaimer:

I am fairly new to this project and my perception and understanding may
be incorrect in some aspects. Please forgive me in advance if I use the
wrong terms and/or make invalid statements. My intent is only to make
things better and not to put the blame on anyone for the current design
choices.

Southbound Design
=

In the current architecture, both databases contain a mix of state and
configuration. While this does not seem to cause any scaling issues for
the northbound DB, it can become a bottleneck for the southbound with
large numbers of chassis and logical network constructs.

The southbound database contains a mix of configuration (logical flows
transformed from the logical network topology) and state (chassis, port
bindings, mac bindings, FDB entries, etc.).

The "configuration" part is consumed by ovn-controller to implement the
network on every chassis and the "state" part is consumed by ovn-northd
to update the northbound "state" entries and to update logical flows.
Some CMS's [1] also depend on the southbound "state" in order to
function properly.

[1] 
https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40

Centralized decisions
=

Every chassis needs to be "aware" of all other chassis in the cluster.
This requirement mainly comes from overlay networks that are implemented
over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
limitations). It is not a scaling issue by itself, but it implies
a centralized decision which in turn puts pressure on the central node
at scale.

Due to ovsdb monitoring and caching, any change in the southbound DB
(either by northd or by any of the chassis controllers) is replicated on
every chassis. The monitor_all option is often enabled on large clusters
to avoid the conditional monitoring CPU cost on the central node.

This leads to high memory usage on all chassis, control plane traffic
and possible disruptions in the ovs-vswitchd datapath flow cache.
Unfortunately, I don't have any hard data to back this claim. This is
mainly coming from discussions I had with neutron contributors and from
brainstorming sessions with colleagues.

I hope that the current work on OVN heater to integrate openstack
support [2] will allow getting more insight.

[2] https://github.com/ovn-org/ovn-heater/pull/179

Dynamic mac learning


Logical switch ports on a given chassis are all connected to the same
OVS bridge, in the same VLAN. This prevents from using local mac address
learning and shifts the responsibility to a centralized ovn-northd to
create all the required logical flows to properly segment the network.

When using mac_address=unknown ports, centralized mac learning is
enabled and when a new address is seen entering a port, OVS sends it to
the local controller which updates the FDB table and recomputes flow
rules accordingly. With logical switches spanning across a large number
of chassis, this centralized mac address learning and aging can have an
impact on control plane and dataplane performance.

Closing thoughts


My understanding of L3 and L4 capabilities of OVN are too limited to
discuss if there are other issues that would prevent scaling to
thousands of nodes. My point was mainly focused on L2 network scaling.

I would love to get other opinions on these statements.

Cheers!

-- 
Robin Jarry
Red Hat, Telco/NFV

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Does dp:tc always mean adequate HW offload

2023-09-28 Thread Eelco Chaudron via discuss


On 28 Sep 2023, at 5:35, Levente Csikor via discuss wrote:

> Hi all,
>
> I am playing around with HW offloading and I have observed something
> strange that I cannot explain (as of now).
>
> I have two OvS switches (both supporting and having hw-offload:true)
> connected back-to-back through two physical ports.
>
> I am actually measuring latency with Pktgen-DPDK, so I send latency
> measuring packets through my first OvS, which will be returned by the
> second OvS. Both are running on a SmartNIC, but the scenario is
> somewhat similar of having them as hypervisor switches using kernel
> drivers and pktgen-dpdk app is running in a VM "above" the first OvS.
>
> The flow rules are based on Layer-3 information (instead of simple port
> forward).
>
> OvS 1 flow rules (pfXhpf face the VM, pX face the other OvS):
> - ip,in_port=pf0hpf,nw_src=192.168.0.1,nw_dst=192.168.1.1
> actions=output:p0
> - ip,in_port=p1,nw_src=192.168.0.1,nw_dst=192.168.1.1
> actions=output:pf1hpf
>
> OvS 2 flow rules:
> - ip,in_port=p0,nw_src=192.168.0.1,nw_dst=192.168.1.1 actions=output:p1
>
> The routing is working as expected, I get back all the packets, but due
> to the relatively bad latency I measured, I checked whether these flow
> rules, or in fact, their corresponding flow-caches, are indeed
> offloaded to the HW.
>
> In the case of the first OvS, the following command gives the below
> output:
> # ovs-appctl dpctl/dump-flows -m |grep offloaded
>
> ufid:66bfe654-11ac-4f01-8dec-414177df78c4,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0)
> ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p1),packet_type(ns=0/0
> ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00
> :00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=192.168.0.1,dst=192.16
> 8.1.1,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:368243396,
> bytes:29459460538, used:0.120s, offloaded:yes, dp:tc, actions:pf1hpf
> ufid:89516b0d-bb6f-4df0-ba91-973ef79e6fa3,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0)
> ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf0hpf),packet_type(ns
> =0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:0
> 0:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=192.168.0.1,dst=19
> 2.168.1.1,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:3630728485,
> bytes:275935348872, used:0.080s, offloaded:yes, dp:tc, actions:p0
> ufid:3fbc4e09-9d78-4e3b-8711-541d36828959,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0)
> ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf0hpf),packet_type(ns
> =0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:0
> 0:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,ds
> t=0.0.0.0/128.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
> packets:233401993, bytes:17738549256, used:0.080s, offloaded:yes,
> dp:tc, actions:drop
>
> The key observation here is that the entries offloaded are using tc,
> and the ones related to my layer-3 forwarding rules (1st and 2nd entry)
> are indeed offloaded to the hardware.
>
> However, on the second OvS, I see this:
> ufid:b88192f5-17d5-461d-ba61-2bd0b7e49b39,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0)
> ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p1),packet_type(ns=0/0
> ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00
> :00/00:00:00:00:00:00),eth_type(0x88cc), packets:0, bytes:0,
> used:never, offloaded:yes, dp:tc, actions:drop
> ufid:82f41402-9d5b-4229-a1c0-d75a35138ebe,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0)
> ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p0),packet_type(ns=0/0
> ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00
> :00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=192.168.0.1,dst=192.16
> 8.1.1,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:432344072,
> bytes:26805332464, used:0.000s, dp:tc, actions:p1
> ufid:da660698-a31e-47e5-bf30-34ee90486f57,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0)
> ,ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(p0),packet_type(ns=0/0
> ,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00
> :00/00:00:00:00:00:00),eth_type(0x88cc), packets:0, bytes:0,
> used:never, offloaded:yes, dp:tc, actions:drop
> ufid:68dc355c-0823-4324-8a02-fa104597b8d8,
> recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(p0),skb_mark(0/0),c
> t_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=18:5a:58:0
> c:c8:ca,dst=01:80:c2:00:00:00),eth_type(0/0x), packets:67094,
> bytes:4025640, used:0.313s, dp:ovs, actions:drop
> ufid:39f2ee6d-4a1b-4781-a73e-2d0a8d045df3,
> recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(p1),skb_mark(0/0),c
> t_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:0
> 0:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_
> type(0/0x), packets:4112, bytes:246720, used:1.725s, dp:ovs,
> actions:drop
>