Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
From: Thadeu Lima de Souza Cascardo Date: Thu, 15 Sep 2016 19:11:53 -0300 > Instead of using flow stats per NUMA node, use it per CPU. When using > megaflows, the stats lock can be a bottleneck in scalability. > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > ~15Mpps when forwarding between two 40GbE ports with a single flow > configured on the datapath. > > This has been tested on a system with possible CPUs 0-7,16-23. After > module removal, there were no corruption on the slab cache. > > Signed-off-by: Thadeu Lima de Souza Cascardo Also applied to net-next, thanks.
Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
On Thu, Sep 15, 2016 at 3:11 PM, Thadeu Lima de Souza Cascardo wrote: > Instead of using flow stats per NUMA node, use it per CPU. When using > megaflows, the stats lock can be a bottleneck in scalability. > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > ~15Mpps when forwarding between two 40GbE ports with a single flow > configured on the datapath. > > This has been tested on a system with possible CPUs 0-7,16-23. After > module removal, there were no corruption on the slab cache. > > Signed-off-by: Thadeu Lima de Souza Cascardo > Cc: pravin shelar Looks good. Acked-by: Pravin B Shelar
Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
On Thu, Sep 15, 2016 at 04:09:26PM -0700, Eric Dumazet wrote: > On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote: > > Instead of using flow stats per NUMA node, use it per CPU. When using > > megaflows, the stats lock can be a bottleneck in scalability. > > > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > > ~15Mpps when forwarding between two 40GbE ports with a single flow > > configured on the datapath. > > > > This has been tested on a system with possible CPUs 0-7,16-23. After > > module removal, there were no corruption on the slab cache. > > > > Signed-off-by: Thadeu Lima de Souza Cascardo > > Cc: pravin shelar > > --- > > > + /* We open code this to make sure cpu 0 is always considered */ > > + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, > > cpu_possible_mask)) > > + if (flow->stats[cpu]) > > kmem_cache_free(flow_stats_cache, > > - (struct flow_stats __force > > *)flow->stats[node]); > > + (struct flow_stats __force > > *)flow->stats[cpu]); > > kmem_cache_free(flow_cache, flow); > > } > > > > @@ -757,7 +749,7 @@ int ovs_flow_init(void) > > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); > > > > flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) > > - + (nr_node_ids > > + + (nr_cpu_ids > > * sizeof(struct flow_stats *)), > >0, 0, NULL); > > if (flow_cache == NULL) > > Well, if you switch to percpu stats, better use normal > alloc_percpu(struct flow_stats) > > The code was dealing with per node allocation so could not use existing > helper. > > No need to keep this forever. The problem is that the alloc_percpu uses a global spinlock and that affects some workloads on OVS that creates lots of flows, as described in commit 9ac56358dec1a5aa7f4275a42971f55fad1f7f35 ("datapath: Per NUMA node flow stats."). This problem would not happen on this version as the flow allocation does not suffer from the same scalability problem as when using alloc_percpu. Cascardo.
Re: [PATCH v2 2/2] openvswitch: use percpu flow stats
On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote: > Instead of using flow stats per NUMA node, use it per CPU. When using > megaflows, the stats lock can be a bottleneck in scalability. > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > ~15Mpps when forwarding between two 40GbE ports with a single flow > configured on the datapath. > > This has been tested on a system with possible CPUs 0-7,16-23. After > module removal, there were no corruption on the slab cache. > > Signed-off-by: Thadeu Lima de Souza Cascardo > Cc: pravin shelar > --- > + /* We open code this to make sure cpu 0 is always considered */ > + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, > cpu_possible_mask)) > + if (flow->stats[cpu]) > kmem_cache_free(flow_stats_cache, > - (struct flow_stats __force > *)flow->stats[node]); > + (struct flow_stats __force > *)flow->stats[cpu]); > kmem_cache_free(flow_cache, flow); > } > > @@ -757,7 +749,7 @@ int ovs_flow_init(void) > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); > > flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) > -+ (nr_node_ids > ++ (nr_cpu_ids > * sizeof(struct flow_stats *)), > 0, 0, NULL); > if (flow_cache == NULL) Well, if you switch to percpu stats, better use normal alloc_percpu(struct flow_stats) The code was dealing with per node allocation so could not use existing helper. No need to keep this forever.
[PATCH v2 2/2] openvswitch: use percpu flow stats
Instead of using flow stats per NUMA node, use it per CPU. When using megaflows, the stats lock can be a bottleneck in scalability. On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps when forwarding between two 40GbE ports with a single flow configured on the datapath. This has been tested on a system with possible CPUs 0-7,16-23. After module removal, there were no corruption on the slab cache. Signed-off-by: Thadeu Lima de Souza Cascardo Cc: pravin shelar --- v2: * use smp_processor_id as ovs_flow_stats_update is always called from BH context * use kmem_cache_zalloc to allocate flow --- net/openvswitch/flow.c | 42 ++ net/openvswitch/flow.h | 4 ++-- net/openvswitch/flow_table.c | 26 +- 3 files changed, 33 insertions(+), 39 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 5b80612..0fa45439 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, { struct flow_stats *stats; int node = numa_node_id(); + int cpu = smp_processor_id(); int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0); - stats = rcu_dereference(flow->stats[node]); + stats = rcu_dereference(flow->stats[cpu]); - /* Check if already have node-specific stats. */ + /* Check if already have CPU-specific stats. */ if (likely(stats)) { spin_lock(&stats->lock); /* Mark if we write on the pre-allocated stats. */ - if (node == 0 && unlikely(flow->stats_last_writer != node)) - flow->stats_last_writer = node; + if (cpu == 0 && unlikely(flow->stats_last_writer != cpu)) + flow->stats_last_writer = cpu; } else { stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */ spin_lock(&stats->lock); - /* If the current NUMA-node is the only writer on the + /* If the current CPU is the only writer on the * pre-allocated stats keep using them. */ - if (unlikely(flow->stats_last_writer != node)) { + if (unlikely(flow->stats_last_writer != cpu)) { /* A previous locker may have already allocated the -* stats, so we need to check again. If node-specific +* stats, so we need to check again. If CPU-specific * stats were already allocated, we update the pre- * allocated stats as we have already locked them. */ - if (likely(flow->stats_last_writer != NUMA_NO_NODE) - && likely(!rcu_access_pointer(flow->stats[node]))) { - /* Try to allocate node-specific stats. */ + if (likely(flow->stats_last_writer != -1) && + likely(!rcu_access_pointer(flow->stats[cpu]))) { + /* Try to allocate CPU-specific stats. */ struct flow_stats *new_stats; new_stats = @@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, new_stats->tcp_flags = tcp_flags; spin_lock_init(&new_stats->lock); - rcu_assign_pointer(flow->stats[node], + rcu_assign_pointer(flow->stats[cpu], new_stats); goto unlock; } } - flow->stats_last_writer = node; + flow->stats_last_writer = cpu; } } @@ -136,15 +138,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow, struct ovs_flow_stats *ovs_stats, unsigned long *used, __be16 *tcp_flags) { - int node; + int cpu; *used = 0; *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); - /* We open code this to make sure node 0 is always considered */ - for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) { - struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]); + /* We open code this to make sure cpu 0 is always considered */ + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpu_possible_mask)) { + struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[cpu]);