Re: [PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-18 Thread David Miller
From: Thadeu Lima de Souza Cascardo 
Date: Thu, 15 Sep 2016 19:11:53 -0300

> Instead of using flow stats per NUMA node, use it per CPU. When using
> megaflows, the stats lock can be a bottleneck in scalability.
> 
> On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> ~15Mpps when forwarding between two 40GbE ports with a single flow
> configured on the datapath.
> 
> This has been tested on a system with possible CPUs 0-7,16-23. After
> module removal, there were no corruption on the slab cache.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 

Also applied to net-next, thanks.


Re: [PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-17 Thread pravin shelar
On Thu, Sep 15, 2016 at 3:11 PM, Thadeu Lima de Souza Cascardo
 wrote:
> Instead of using flow stats per NUMA node, use it per CPU. When using
> megaflows, the stats lock can be a bottleneck in scalability.
>
> On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> ~15Mpps when forwarding between two 40GbE ports with a single flow
> configured on the datapath.
>
> This has been tested on a system with possible CPUs 0-7,16-23. After
> module removal, there were no corruption on the slab cache.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> Cc: pravin shelar 

Looks good.

Acked-by: Pravin B Shelar 


Re: [PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-15 Thread Thadeu Lima de Souza Cascardo
On Thu, Sep 15, 2016 at 04:09:26PM -0700, Eric Dumazet wrote:
> On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote:
> > Instead of using flow stats per NUMA node, use it per CPU. When using
> > megaflows, the stats lock can be a bottleneck in scalability.
> > 
> > On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> > ~15Mpps when forwarding between two 40GbE ports with a single flow
> > configured on the datapath.
> > 
> > This has been tested on a system with possible CPUs 0-7,16-23. After
> > module removal, there were no corruption on the slab cache.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo 
> > Cc: pravin shelar 
> > ---
> 
> > +   /* We open code this to make sure cpu 0 is always considered */
> > +   for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
> > cpu_possible_mask))
> > +   if (flow->stats[cpu])
> > kmem_cache_free(flow_stats_cache,
> > -   (struct flow_stats __force 
> > *)flow->stats[node]);
> > +   (struct flow_stats __force 
> > *)flow->stats[cpu]);
> > kmem_cache_free(flow_cache, flow);
> >  }
> >  
> > @@ -757,7 +749,7 @@ int ovs_flow_init(void)
> > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
> >  
> > flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> > -  + (nr_node_ids
> > +  + (nr_cpu_ids
> >   * sizeof(struct flow_stats *)),
> >0, 0, NULL);
> > if (flow_cache == NULL)
> 
> Well, if you switch to percpu stats, better use normal
> alloc_percpu(struct flow_stats)
> 
> The code was dealing with per node allocation so could not use existing
> helper.
> 
> No need to keep this forever.

The problem is that the alloc_percpu uses a global spinlock and that affects
some workloads on OVS that creates lots of flows, as described in commit
9ac56358dec1a5aa7f4275a42971f55fad1f7f35 ("datapath: Per NUMA node flow
stats.").

This problem would not happen on this version as the flow allocation does not
suffer from the same scalability problem as when using alloc_percpu.

Cascardo.


Re: [PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-15 Thread Eric Dumazet
On Thu, 2016-09-15 at 19:11 -0300, Thadeu Lima de Souza Cascardo wrote:
> Instead of using flow stats per NUMA node, use it per CPU. When using
> megaflows, the stats lock can be a bottleneck in scalability.
> 
> On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> ~15Mpps when forwarding between two 40GbE ports with a single flow
> configured on the datapath.
> 
> This has been tested on a system with possible CPUs 0-7,16-23. After
> module removal, there were no corruption on the slab cache.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> Cc: pravin shelar 
> ---

> + /* We open code this to make sure cpu 0 is always considered */
> + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
> cpu_possible_mask))
> + if (flow->stats[cpu])
>   kmem_cache_free(flow_stats_cache,
> - (struct flow_stats __force 
> *)flow->stats[node]);
> + (struct flow_stats __force 
> *)flow->stats[cpu]);
>   kmem_cache_free(flow_cache, flow);
>  }
>  
> @@ -757,7 +749,7 @@ int ovs_flow_init(void)
>   BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
>  
>   flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> -+ (nr_node_ids
> ++ (nr_cpu_ids
> * sizeof(struct flow_stats *)),
>  0, 0, NULL);
>   if (flow_cache == NULL)

Well, if you switch to percpu stats, better use normal
alloc_percpu(struct flow_stats)

The code was dealing with per node allocation so could not use existing
helper.

No need to keep this forever.




[PATCH v2 2/2] openvswitch: use percpu flow stats

2016-09-15 Thread Thadeu Lima de Souza Cascardo
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo 
Cc: pravin shelar 
---

v2:
* use smp_processor_id as ovs_flow_stats_update is always called from BH
context
* use kmem_cache_zalloc to allocate flow

---
 net/openvswitch/flow.c   | 42 ++
 net/openvswitch/flow.h   |  4 ++--
 net/openvswitch/flow_table.c | 26 +-
 3 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 5b80612..0fa45439 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 {
struct flow_stats *stats;
int node = numa_node_id();
+   int cpu = smp_processor_id();
int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-   stats = rcu_dereference(flow->stats[node]);
+   stats = rcu_dereference(flow->stats[cpu]);
 
-   /* Check if already have node-specific stats. */
+   /* Check if already have CPU-specific stats. */
if (likely(stats)) {
spin_lock(&stats->lock);
/* Mark if we write on the pre-allocated stats. */
-   if (node == 0 && unlikely(flow->stats_last_writer != node))
-   flow->stats_last_writer = node;
+   if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+   flow->stats_last_writer = cpu;
} else {
stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
spin_lock(&stats->lock);
 
-   /* If the current NUMA-node is the only writer on the
+   /* If the current CPU is the only writer on the
 * pre-allocated stats keep using them.
 */
-   if (unlikely(flow->stats_last_writer != node)) {
+   if (unlikely(flow->stats_last_writer != cpu)) {
/* A previous locker may have already allocated the
-* stats, so we need to check again.  If node-specific
+* stats, so we need to check again.  If CPU-specific
 * stats were already allocated, we update the pre-
 * allocated stats as we have already locked them.
 */
-   if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-   && likely(!rcu_access_pointer(flow->stats[node]))) {
-   /* Try to allocate node-specific stats. */
+   if (likely(flow->stats_last_writer != -1) &&
+   likely(!rcu_access_pointer(flow->stats[cpu]))) {
+   /* Try to allocate CPU-specific stats. */
struct flow_stats *new_stats;
 
new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
new_stats->tcp_flags = tcp_flags;
spin_lock_init(&new_stats->lock);
 
-   rcu_assign_pointer(flow->stats[node],
+   rcu_assign_pointer(flow->stats[cpu],
   new_stats);
goto unlock;
}
}
-   flow->stats_last_writer = node;
+   flow->stats_last_writer = cpu;
}
}
 
@@ -136,15 +138,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
 {
-   int node;
+   int cpu;
 
*used = 0;
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   /* We open code this to make sure node 0 is always considered */
-   for (node = 0; node < MAX_NUMNODES; node = next_node(node, 
node_possible_map)) {
-   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
+   /* We open code this to make sure cpu 0 is always considered */
+   for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
cpu_possible_mask)) {
+   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[cpu]);