Re: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-09-01 Thread Michal Kubecek
On Fri, Sep 01, 2017 at 09:41:56AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 31 Aug 2017 18:23:49 +0200 Michal Kubecek  wrote:
> 
> > If we go this way (which would IMHO require some benchmarks to make sure
> > it doesn't harm performance too much) we can drop the explicit checks
> > for zero thresholds which were added to work around the unreliability of
> > fast checks of percpu counters (or at least the second one was by commit
> > 30759219f562 ("net: disable fragment reassembly if high_thresh is zero").
>   
> After much considerations, together with Florian, I'm now instead
> looking at reverting the use of percpu_counter for this memory
> accounting use-case.  The complexity and maintenance cost is not worth
> it.  And I'm of-cause testing the perf effect, and currently I'm _not_
> seeing any perf regression on my 10G + 100G testlab (although this is
> not a NUMA system, which were my original optimization case).

This sounds reasonable to me. It is indeed questionable if percpu
counters are still worth the complexity if all checks have to be changed
to the exact version.

Perhaps there would be some gain for many CPUs if thresholds are large
enough to (almost) always avoid the need to calculate the sum. But once
we leave that safe area, I would be surprised if simple atomic_t
wouldn't be more efficient.

Michal Kubecek



Re: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-09-01 Thread Jesper Dangaard Brouer

On Thu, 31 Aug 2017 18:23:49 +0200 Michal Kubecek  wrote:

> If we go this way (which would IMHO require some benchmarks to make sure
> it doesn't harm performance too much) we can drop the explicit checks
> for zero thresholds which were added to work around the unreliability of
> fast checks of percpu counters (or at least the second one was by commit
> 30759219f562 ("net: disable fragment reassembly if high_thresh is zero").
  
After much considerations, together with Florian, I'm now instead
looking at reverting the use of percpu_counter for this memory
accounting use-case.  The complexity and maintenance cost is not worth
it.  And I'm of-cause testing the perf effect, and currently I'm _not_
seeing any perf regression on my 10G + 100G testlab (although this is
not a NUMA system, which were my original optimization case).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-09-01 Thread Jesper Dangaard Brouer

On Fri, 1 Sep 2017 02:25:32 + "liujian (CE)" <liujia...@huawei.com> wrote:

> > -Original Message-
> > From: Michal Kubecek [mailto:mkube...@suse.cz]
> > Sent: Friday, September 01, 2017 12:24 AM
> > To: Jesper Dangaard Brouer
> > Cc: liujian (CE); netdev@vger.kernel.org; Florian Westphal
> > Subject: Re: [RFC PATCH] net: frag limit checks need to use
> > percpu_counter_compare
> > 
> > On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote:  
> > > To: Liujian can you please test this patch?
> > >  I want to understand if using __percpu_counter_compare() solves  the
> > > problem correctness wise (even-though this will be slower  than using
> > > a simple atomic_t on your big system).  
> 
> I have test the patch, it can work. 

Thanks for confirming this.

> 1. make sure frag_mem_limit reach to thresh
>   ===>FRAG: inuse 0 memory 0 frag_mem_limit 5386864  
> 2. change NIC rx irq's affinity to a fixed CPU

If you pin the NIC RX queue to a single CPU, then the error issue
basically cannot happen.  Different CPU need to have a chance to "own"
part of the percpu_counter.  I guess default setup with irqbalance
could eventually screw the percpu_counter enough given enough CPUs, or
a network load with enough different L2-headers to high different RX
queues.

> 3. iperf -u -c 9.83.1.41 -l 1 -i 1 -t 1000 -P 10 -b 20M
>   And check /proc/net/snmp, there are no ReasmFails.

My quick check command is:
 nstat > /dev/null && sleep 1 && nstat && grep FRAG /proc/net/sockstat

> And I think it is a better way that adding some counter sync points
> as you said.

I've discussed this offlist with Florian, while it is doable, we are
adding too much complexity for something that can be solved much
simpler with an atomic_t (as before my patch).  Thus, I'm now looking
at reverting my original change (commit 6d7b857d541e ("net: use
lib/percpu_counter API for fragmentation mem accounting")).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


RE: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-08-31 Thread liujian (CE)



Best Regards,
liujian


> -Original Message-
> From: Michal Kubecek [mailto:mkube...@suse.cz]
> Sent: Friday, September 01, 2017 12:24 AM
> To: Jesper Dangaard Brouer
> Cc: liujian (CE); netdev@vger.kernel.org; Florian Westphal
> Subject: Re: [RFC PATCH] net: frag limit checks need to use
> percpu_counter_compare
> 
> On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote:
> > To: Liujian can you please test this patch?
> >  I want to understand if using __percpu_counter_compare() solves  the
> > problem correctness wise (even-though this will be slower  than using
> > a simple atomic_t on your big system).

I have test the patch, it can work. 
1. make sure frag_mem_limit reach to thresh
  ===>FRAG: inuse 0 memory 0 frag_mem_limit 5386864
2. change NIC rx irq's affinity to a fixed CPU
3. iperf -u -c 9.83.1.41 -l 1 -i 1 -t 1000 -P 10 -b 20M
  And check /proc/net/snmp, there are no ReasmFails.

And I think it is a better way that adding some counter sync points as you said.

> > Fix bug in fragmentation codes use of the percpu_counter API, that
> > cause issues on systems with many CPUs.
> >
> > The frag_mem_limit() just reads the global counter (fbc->count),
> > without considering other CPUs can have upto batch size (130K) that
> > haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
> > this become dangerous at >=24 CPUs (3*1024*1024/13=24).
> >
> > The __percpu_counter_compare() does the right thing, and takes into
> > account the number of (online) CPUs and batch size, to account for
> > this and call __percpu_counter_sum() when needed.
> >
> > On systems with many CPUs this will unfortunately always result in the
> > heavier fully locked __percpu_counter_sum() which touch the
> > per_cpu_ptr of all (online) CPUs.
> >
> > On systems with a smaller number of CPUs this solution is also not
> > optimal, because __percpu_counter_compare()/__percpu_counter_sum()
> > doesn't help synchronize the global counter.
> >  Florian Westphal have an idea of adding some counter sync points,
> > which should help address this issue.
> > ---
> >  include/net/inet_frag.h  |   16 ++--
> >  net/ipv4/inet_fragment.c |6 +++---
> >  2 files changed, 17 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index
> > 6fdcd2427776..b586e320783d 100644
> > --- a/include/net/inet_frag.h
> > +++ b/include/net/inet_frag.h
> > @@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct
> inet_frag_queue *q)
> >   */
> >  static unsigned int frag_percpu_counter_batch = 13;
> >
> > -static inline int frag_mem_limit(struct netns_frags *nf)
> > +static inline bool frag_mem_over_limit(struct netns_frags *nf, int
> > +thresh)
> >  {
> > -   return percpu_counter_read(>mem);
> > +   /* When reading counter here, __percpu_counter_compare() call
> > +* will invoke __percpu_counter_sum() when needed.  Which
> > +* depend on num_online_cpus()*batch size, as each CPU can
> > +* potentential can hold a batch count.
> > +*
> > +* With many CPUs this heavier sum operation will
> > +* unfortunately always occur.
> > +*/
> > +   if (__percpu_counter_compare(>mem, thresh,
> > +frag_percpu_counter_batch) > 0)
> > +   return true;
> > +   else
> > +   return false;
> >  }
> >
> >  static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index
> > 96e95e83cc61..ee2cf56900e6 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct
> > inet_frags *f)  static bool inet_fragq_should_evict(const struct
> > inet_frag_queue *q)  {
> > return q->net->low_thresh == 0 ||
> > -  frag_mem_limit(q->net) >= q->net->low_thresh;
> > +   frag_mem_over_limit(q->net, q->net->low_thresh);
> >  }
> >
> >  static unsigned int
> > @@ -355,7 +355,7 @@ static struct inet_frag_queue
> > *inet_frag_alloc(struct netns_frags *nf,  {
> > struct inet_frag_queue *q;
> >
> > -   if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> > +   if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) {
> > inet_frag_schedule_worker(f);
> > return NULL;
> > }
> 
> If we go t

Re: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-08-31 Thread Michal Kubecek
On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote:
> To: Liujian can you please test this patch?
>  I want to understand if using __percpu_counter_compare() solves
>  the problem correctness wise (even-though this will be slower
>  than using a simple atomic_t on your big system).
> 
> Fix bug in fragmentation codes use of the percpu_counter API, that
> cause issues on systems with many CPUs.
> 
> The frag_mem_limit() just reads the global counter (fbc->count),
> without considering other CPUs can have upto batch size (130K) that
> haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
> this become dangerous at >=24 CPUs (3*1024*1024/13=24).
> 
> The __percpu_counter_compare() does the right thing, and takes into
> account the number of (online) CPUs and batch size, to account for
> this and call __percpu_counter_sum() when needed.
> 
> On systems with many CPUs this will unfortunately always result in the
> heavier fully locked __percpu_counter_sum() which touch the
> per_cpu_ptr of all (online) CPUs.
> 
> On systems with a smaller number of CPUs this solution is also not
> optimal, because __percpu_counter_compare()/__percpu_counter_sum()
> doesn't help synchronize the global counter.
>  Florian Westphal have an idea of adding some counter sync points,
> which should help address this issue.
> ---
>  include/net/inet_frag.h  |   16 ++--
>  net/ipv4/inet_fragment.c |6 +++---
>  2 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 6fdcd2427776..b586e320783d 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct 
> inet_frag_queue *q)
>   */
>  static unsigned int frag_percpu_counter_batch = 13;
>  
> -static inline int frag_mem_limit(struct netns_frags *nf)
> +static inline bool frag_mem_over_limit(struct netns_frags *nf, int thresh)
>  {
> - return percpu_counter_read(>mem);
> + /* When reading counter here, __percpu_counter_compare() call
> +  * will invoke __percpu_counter_sum() when needed.  Which
> +  * depend on num_online_cpus()*batch size, as each CPU can
> +  * potentential can hold a batch count.
> +  *
> +  * With many CPUs this heavier sum operation will
> +  * unfortunately always occur.
> +  */
> + if (__percpu_counter_compare(>mem, thresh,
> +  frag_percpu_counter_batch) > 0)
> + return true;
> + else
> + return false;
>  }
>  
>  static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 96e95e83cc61..ee2cf56900e6 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct inet_frags *f)
>  static bool inet_fragq_should_evict(const struct inet_frag_queue *q)
>  {
>   return q->net->low_thresh == 0 ||
> -frag_mem_limit(q->net) >= q->net->low_thresh;
> + frag_mem_over_limit(q->net, q->net->low_thresh);
>  }
>  
>  static unsigned int
> @@ -355,7 +355,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct 
> netns_frags *nf,
>  {
>   struct inet_frag_queue *q;
>  
> - if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> + if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) {
>   inet_frag_schedule_worker(f);
>   return NULL;
>   }

If we go this way (which would IMHO require some benchmarks to make sure
it doesn't harm performance too much) we can drop the explicit checks
for zero thresholds which were added to work around the unreliability of
fast checks of percpu counters (or at least the second one was by commit
30759219f562 ("net: disable fragment reassembly if high_thresh is zero").
 
Michal Kubecek

> @@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags 
> *nf,
>   struct inet_frag_queue *q;
>   int depth = 0;
>  
> - if (frag_mem_limit(nf) > nf->low_thresh)
> + if (frag_mem_over_limit(nf, nf->low_thresh))
>   inet_frag_schedule_worker(f);
>  
>   hash &= (INETFRAGS_HASHSZ - 1);
> 


Re: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare

2017-08-31 Thread Stephen Hemminger
On Thu, 31 Aug 2017 12:20:19 +0200
Jesper Dangaard Brouer  wrote:

> +static inline bool frag_mem_over_limit(struct netns_frags *nf, int thresh)
>  {
> - return percpu_counter_read(>mem);
> + /* When reading counter here, __percpu_counter_compare() call
> +  * will invoke __percpu_counter_sum() when needed.  Which
> +  * depend on num_online_cpus()*batch size, as each CPU can
> +  * potentential can hold a batch count.
> +  *
> +  * With many CPUs this heavier sum operation will
> +  * unfortunately always occur.
> +  */
> + if (__percpu_counter_compare(>mem, thresh,
> +  frag_percpu_counter_batch) > 0)
> + return true;
> + else
> + return false;

You don't need an if() here.

return __percpu_counter_compare(>mem, thresh,
 frag_percpu_counter_batch) > 0;