Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-20 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 18:56:51 -0600 (CST)
Christopher Lameter  wrote:

> On Tue, 19 Dec 2017, Rao Shoaib wrote:
> 
> > > > mm/slab_common.c  
> > > It would be great to have separate patches so that we can review it
> > > properly:
> > >
> > > 1. Move the code into slab_common.c
> > > 2. The actual code changes to the kfree rcu mechanism
> > > 3. The whitespace changes  
> 
> > I can certainly break down the patch and submit smaller patches as you have
> > suggested.
> >
> > BTW -- This is my first ever patch to Linux, so I am still learning the
> > etiquette.  
> 
> You are doing great. Keep at improving the patches and we will get your
> changes into the kernel source. If you want to sent your first patchset
> then a tool like "quilt" or "git" might be helpful.

When working with patchsets (multiple separate patches, as requested
here), I personally prefer using the tool called Stacked Git[1] (StGit)
command line 'stg', as it allows me to easily adjust patches in the
middle of the patchset "stack".  It is similar to quilt, just git based
itself.

I guess most people on this list use 'git rebase --interactive' when
updating their patchsets (?)

[1] http://procode.org/stgit/doc/tutorial.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-20 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 18:56:51 -0600 (CST)
Christopher Lameter  wrote:

> On Tue, 19 Dec 2017, Rao Shoaib wrote:
> 
> > > > mm/slab_common.c  
> > > It would be great to have separate patches so that we can review it
> > > properly:
> > >
> > > 1. Move the code into slab_common.c
> > > 2. The actual code changes to the kfree rcu mechanism
> > > 3. The whitespace changes  
> 
> > I can certainly break down the patch and submit smaller patches as you have
> > suggested.
> >
> > BTW -- This is my first ever patch to Linux, so I am still learning the
> > etiquette.  
> 
> You are doing great. Keep at improving the patches and we will get your
> changes into the kernel source. If you want to sent your first patchset
> then a tool like "quilt" or "git" might be helpful.

When working with patchsets (multiple separate patches, as requested
here), I personally prefer using the tool called Stacked Git[1] (StGit)
command line 'stg', as it allows me to easily adjust patches in the
middle of the patchset "stack".  It is similar to quilt, just git based
itself.

I guess most people on this list use 'git rebase --interactive' when
updating their patchsets (?)

[1] http://procode.org/stgit/doc/tutorial.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-20 Thread Michal Hocko
On Tue 19-12-17 12:02:03, Rao Shoaib wrote:
> BTW -- This is my first ever patch to Linux, so I am still learning the
> etiquette.

Reading through Documentation/process/submitting-patches.rst might be
really helpful.

Good luck!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-20 Thread Michal Hocko
On Tue 19-12-17 12:02:03, Rao Shoaib wrote:
> BTW -- This is my first ever patch to Linux, so I am still learning the
> etiquette.

Reading through Documentation/process/submitting-patches.rst might be
really helpful.

Good luck!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 13:20:43 -0800 Rao Shoaib  wrote:

> On 12/19/2017 12:41 PM, Jesper Dangaard Brouer wrote:
> > On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:
> >  
> >> +/* Main RCU function that is called to free RCU structures */
> >> +static void
> >> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> >> lazy)
> >> +{
> >> +  unsigned long offset;
> >> +  void *ptr;
> >> +  struct rcu_bulk_free *rbf;
> >> +  struct rcu_bulk_free_container *rbfc = NULL;
> >> +
> >> +  rbf = this_cpu_ptr(_rbf);
> >> +
> >> +  if (unlikely(!rbf->rbf_init)) {
> >> +  spin_lock_init(>rbf_lock);
> >> +  rbf->rbf_cpu = smp_processor_id();
> >> +  rbf->rbf_init = true;
> >> +  }
> >> +
> >> +  /* hold lock to protect against other cpu's */
> >> +  spin_lock_bh(>rbf_lock);  
> >
> > I'm not sure this will be faster.  Having to take a cross CPU lock here
> > (+ BH-disable) could cause scaling issues.   Hopefully this lock will
> > not be used intensively by other CPUs, right?
> >
[...]
> 
> As Paul has pointed out the lock is a per-cpu lock, the only reason for 
> another CPU to access this lock is if the rcu callbacks run on a 
> different CPU and there is nothing the code can do to avoid that but 
> that should be rare anyways.

(loop in Paul's comment)
On Tue, 19 Dec 2017 12:56:29 -0800
"Paul E. McKenney"  wrote:

> Isn't this lock in a per-CPU object?  It -might- go cross-CPU in response
> to CPU-hotplug operations, but that should be rare.

Point taken.  If this lock is very unlikely to be taken on another CPU
then I withdraw my performance concerns (the cacheline can hopefully
stay in Modified(M) state on this CPU, and all other CPUs will have in
in Invalid(I) state based on MESI cache coherence protocol view[1]).

The lock's atomic operation does have some overhead, and _later_ if we
could get fancy and use seqlock (include/linux/seqlock.h) to remove
that.

[1] https://en.wikipedia.org/wiki/MESI_protocol
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 13:20:43 -0800 Rao Shoaib  wrote:

> On 12/19/2017 12:41 PM, Jesper Dangaard Brouer wrote:
> > On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:
> >  
> >> +/* Main RCU function that is called to free RCU structures */
> >> +static void
> >> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> >> lazy)
> >> +{
> >> +  unsigned long offset;
> >> +  void *ptr;
> >> +  struct rcu_bulk_free *rbf;
> >> +  struct rcu_bulk_free_container *rbfc = NULL;
> >> +
> >> +  rbf = this_cpu_ptr(_rbf);
> >> +
> >> +  if (unlikely(!rbf->rbf_init)) {
> >> +  spin_lock_init(>rbf_lock);
> >> +  rbf->rbf_cpu = smp_processor_id();
> >> +  rbf->rbf_init = true;
> >> +  }
> >> +
> >> +  /* hold lock to protect against other cpu's */
> >> +  spin_lock_bh(>rbf_lock);  
> >
> > I'm not sure this will be faster.  Having to take a cross CPU lock here
> > (+ BH-disable) could cause scaling issues.   Hopefully this lock will
> > not be used intensively by other CPUs, right?
> >
[...]
> 
> As Paul has pointed out the lock is a per-cpu lock, the only reason for 
> another CPU to access this lock is if the rcu callbacks run on a 
> different CPU and there is nothing the code can do to avoid that but 
> that should be rare anyways.

(loop in Paul's comment)
On Tue, 19 Dec 2017 12:56:29 -0800
"Paul E. McKenney"  wrote:

> Isn't this lock in a per-CPU object?  It -might- go cross-CPU in response
> to CPU-hotplug operations, but that should be rare.

Point taken.  If this lock is very unlikely to be taken on another CPU
then I withdraw my performance concerns (the cacheline can hopefully
stay in Modified(M) state on this CPU, and all other CPUs will have in
in Invalid(I) state based on MESI cache coherence protocol view[1]).

The lock's atomic operation does have some overhead, and _later_ if we
could get fancy and use seqlock (include/linux/seqlock.h) to remove
that.

[1] https://en.wikipedia.org/wiki/MESI_protocol
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 16:20:51 -0800
"Paul E. McKenney"  wrote:

> On Tue, Dec 19, 2017 at 02:12:06PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:  
> > > If I had to implement this: I would choose to do the optimization in
> > > __rcu_process_callbacks() create small on-call-stack ptr-array for
> > > kfree_bulk().  I would only optimize the case that call kfree()
> > > directly.  In the while(list) loop I would defer calling
> > > __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> > > them to the ptr-array (and flush if the array is full in loop, and
> > > kfree_bulk flush after loop).
> > > 
> > > The real advantage of kfree_bulk() comes from amortizing the per kfree
> > > (behind-the-scenes) sync cost.  There is an additional benefit, because
> > > objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> > > allocator is very fast for objects that gets recycled quickly (short
> > > lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> > > longer-lived/more-outstanding objects, as this hits a slower code-path,
> > > fully locked (cross-cpu) double-cmpxchg.
> > 
> > Something like this ...  (compile tested only)

Yes, exactly.

> > Considerably less code; Rao, what do you think?  
> 
> I am sorry, but I am not at all fan of this approach.
> 
> If we are going to make this sort of change, we should do so in a way
> that allows the slab code to actually do the optimizations that might
> make this sort of thing worthwhile.  After all, if the main goal was small
> code size, the best approach is to drop kfree_bulk() and get on with life
> in the usual fashion.
> 
> I would prefer to believe that something like kfree_bulk() can help,
> and if that is the case, we should give it a chance to do things like
> group kfree_rcu() requests by destination slab and soforth, allowing
> batching optimizations that might provide more significant increases
> in performance.  Furthermore, having this in slab opens the door to
> slab taking emergency action when memory is low.

I agree with your argument. Although in the (slub) code I do handle
different destination slab's, but only do a limited look-ahead to find
same dest-slab's which gives the speedup (see build_detached_freelist).

We do have a larger and more consistent speedup potential, if adding
infrastructure that allow us to pre-sort by destination slab, before
invoking kfree_bulk().  In that respect, Rao's patch is a better
approach.

> But for the patch below, NAK.
> 
>   Thanx, Paul
> 
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 59c471de342a..5ac4ed077233 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct 
> > rcu_head *head)
> >  }
> >  #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> > 
> > -void kfree(const void *);
> > -
> >  /*
> >   * Reclaim the specified callback, either by invoking it (non-lazy case)
> >   * or freeing it directly (lazy case).  Return true if lazy, false 
> > otherwise.
> >   */
> > -static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
> > +static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, 
> > void **kfree,
> > +   unsigned int *idx)
> >  {
> > unsigned long offset = (unsigned long)head->func;
> > 
> > rcu_lock_acquire(_callback_map);
> > if (__is_kfree_rcu_offset(offset)) {
> > RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
> > -   kfree((void *)head - offset);
> > +   kfree[*idx++] = (void *)head - offset;
> > rcu_lock_release(_callback_map);
> > return true;
> > } else {
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index f9c0ca2ccf0c..7e13979b4697 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)
> > struct rcu_head *rhp;
> > struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
> > long bl, count;
> > +   void *to_free[16];
> > +   unsigned int to_free_idx = 0;
> > 
> > /* If no callbacks are ready, just return. */
> > if (!rcu_segcblist_ready_cbs(>cblist)) {
> > @@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)
> > rhp = rcu_cblist_dequeue();
> > for (; rhp; rhp = rcu_cblist_dequeue()) {
> > debug_rcu_head_unqueue(rhp);
> > -   if (__rcu_reclaim(rsp->name, rhp))
> > +   if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
> > rcu_cblist_dequeued_lazy();
> > +   if (to_free_idx == 16)
> > +   kfree_bulk(16, to_free);
> > /*
> >  * Stop only if limit reached and CPU has something to do.
> >

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 16:20:51 -0800
"Paul E. McKenney"  wrote:

> On Tue, Dec 19, 2017 at 02:12:06PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:  
> > > If I had to implement this: I would choose to do the optimization in
> > > __rcu_process_callbacks() create small on-call-stack ptr-array for
> > > kfree_bulk().  I would only optimize the case that call kfree()
> > > directly.  In the while(list) loop I would defer calling
> > > __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> > > them to the ptr-array (and flush if the array is full in loop, and
> > > kfree_bulk flush after loop).
> > > 
> > > The real advantage of kfree_bulk() comes from amortizing the per kfree
> > > (behind-the-scenes) sync cost.  There is an additional benefit, because
> > > objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> > > allocator is very fast for objects that gets recycled quickly (short
> > > lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> > > longer-lived/more-outstanding objects, as this hits a slower code-path,
> > > fully locked (cross-cpu) double-cmpxchg.
> > 
> > Something like this ...  (compile tested only)

Yes, exactly.

> > Considerably less code; Rao, what do you think?  
> 
> I am sorry, but I am not at all fan of this approach.
> 
> If we are going to make this sort of change, we should do so in a way
> that allows the slab code to actually do the optimizations that might
> make this sort of thing worthwhile.  After all, if the main goal was small
> code size, the best approach is to drop kfree_bulk() and get on with life
> in the usual fashion.
> 
> I would prefer to believe that something like kfree_bulk() can help,
> and if that is the case, we should give it a chance to do things like
> group kfree_rcu() requests by destination slab and soforth, allowing
> batching optimizations that might provide more significant increases
> in performance.  Furthermore, having this in slab opens the door to
> slab taking emergency action when memory is low.

I agree with your argument. Although in the (slub) code I do handle
different destination slab's, but only do a limited look-ahead to find
same dest-slab's which gives the speedup (see build_detached_freelist).

We do have a larger and more consistent speedup potential, if adding
infrastructure that allow us to pre-sort by destination slab, before
invoking kfree_bulk().  In that respect, Rao's patch is a better
approach.

> But for the patch below, NAK.
> 
>   Thanx, Paul
> 
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 59c471de342a..5ac4ed077233 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct 
> > rcu_head *head)
> >  }
> >  #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> > 
> > -void kfree(const void *);
> > -
> >  /*
> >   * Reclaim the specified callback, either by invoking it (non-lazy case)
> >   * or freeing it directly (lazy case).  Return true if lazy, false 
> > otherwise.
> >   */
> > -static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
> > +static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, 
> > void **kfree,
> > +   unsigned int *idx)
> >  {
> > unsigned long offset = (unsigned long)head->func;
> > 
> > rcu_lock_acquire(_callback_map);
> > if (__is_kfree_rcu_offset(offset)) {
> > RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
> > -   kfree((void *)head - offset);
> > +   kfree[*idx++] = (void *)head - offset;
> > rcu_lock_release(_callback_map);
> > return true;
> > } else {
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index f9c0ca2ccf0c..7e13979b4697 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)
> > struct rcu_head *rhp;
> > struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
> > long bl, count;
> > +   void *to_free[16];
> > +   unsigned int to_free_idx = 0;
> > 
> > /* If no callbacks are ready, just return. */
> > if (!rcu_segcblist_ready_cbs(>cblist)) {
> > @@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)
> > rhp = rcu_cblist_dequeue();
> > for (; rhp; rhp = rcu_cblist_dequeue()) {
> > debug_rcu_head_unqueue(rhp);
> > -   if (__rcu_reclaim(rsp->name, rhp))
> > +   if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
> > rcu_cblist_dequeued_lazy();
> > +   if (to_free_idx == 16)
> > +   kfree_bulk(16, to_free);
> > /*
> >  * Stop only if limit reached and CPU has something to do.
> >  * Note: The rcl 

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 05:53:36PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 19, 2017 at 04:20:51PM -0800, Paul E. McKenney wrote:
> > If we are going to make this sort of change, we should do so in a way
> > that allows the slab code to actually do the optimizations that might
> > make this sort of thing worthwhile.  After all, if the main goal was small
> > code size, the best approach is to drop kfree_bulk() and get on with life
> > in the usual fashion.
> > 
> > I would prefer to believe that something like kfree_bulk() can help,
> > and if that is the case, we should give it a chance to do things like
> > group kfree_rcu() requests by destination slab and soforth, allowing
> > batching optimizations that might provide more significant increases
> > in performance.  Furthermore, having this in slab opens the door to
> > slab taking emergency action when memory is low.
> 
> kfree_bulk does sort by destination slab; look at build_detached_freelist.

Understood, but beside the point.  I suspect that giving it larger
scope makes it more efficient, similar to disk drives in the old days.
Grouping on the stack when processing RCU callbacks limits what can
reasonably be done.  Furthermore, using the vector approach going into the
grace period is much more cache-efficient than the linked-list approach,
given that the blocks have a reasonable chance of going cache-cold during
the grace period.

And the slab-related operations should really be in the slab code in any
case rather than within RCU.

Thanx, Paul



Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 05:53:36PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 19, 2017 at 04:20:51PM -0800, Paul E. McKenney wrote:
> > If we are going to make this sort of change, we should do so in a way
> > that allows the slab code to actually do the optimizations that might
> > make this sort of thing worthwhile.  After all, if the main goal was small
> > code size, the best approach is to drop kfree_bulk() and get on with life
> > in the usual fashion.
> > 
> > I would prefer to believe that something like kfree_bulk() can help,
> > and if that is the case, we should give it a chance to do things like
> > group kfree_rcu() requests by destination slab and soforth, allowing
> > batching optimizations that might provide more significant increases
> > in performance.  Furthermore, having this in slab opens the door to
> > slab taking emergency action when memory is low.
> 
> kfree_bulk does sort by destination slab; look at build_detached_freelist.

Understood, but beside the point.  I suspect that giving it larger
scope makes it more efficient, similar to disk drives in the old days.
Grouping on the stack when processing RCU callbacks limits what can
reasonably be done.  Furthermore, using the vector approach going into the
grace period is much more cache-efficient than the linked-list approach,
given that the blocks have a reasonable chance of going cache-cold during
the grace period.

And the slab-related operations should really be in the slab code in any
case rather than within RCU.

Thanx, Paul



Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 04:20:51PM -0800, Paul E. McKenney wrote:
> If we are going to make this sort of change, we should do so in a way
> that allows the slab code to actually do the optimizations that might
> make this sort of thing worthwhile.  After all, if the main goal was small
> code size, the best approach is to drop kfree_bulk() and get on with life
> in the usual fashion.
> 
> I would prefer to believe that something like kfree_bulk() can help,
> and if that is the case, we should give it a chance to do things like
> group kfree_rcu() requests by destination slab and soforth, allowing
> batching optimizations that might provide more significant increases
> in performance.  Furthermore, having this in slab opens the door to
> slab taking emergency action when memory is low.

kfree_bulk does sort by destination slab; look at build_detached_freelist.


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 04:20:51PM -0800, Paul E. McKenney wrote:
> If we are going to make this sort of change, we should do so in a way
> that allows the slab code to actually do the optimizations that might
> make this sort of thing worthwhile.  After all, if the main goal was small
> code size, the best approach is to drop kfree_bulk() and get on with life
> in the usual fashion.
> 
> I would prefer to believe that something like kfree_bulk() can help,
> and if that is the case, we should give it a chance to do things like
> group kfree_rcu() requests by destination slab and soforth, allowing
> batching optimizations that might provide more significant increases
> in performance.  Furthermore, having this in slab opens the door to
> slab taking emergency action when memory is low.

kfree_bulk does sort by destination slab; look at build_detached_freelist.


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, Rao Shoaib wrote:

> > > mm/slab_common.c
> > It would be great to have separate patches so that we can review it
> > properly:
> >
> > 1. Move the code into slab_common.c
> > 2. The actual code changes to the kfree rcu mechanism
> > 3. The whitespace changes

> I can certainly break down the patch and submit smaller patches as you have
> suggested.
>
> BTW -- This is my first ever patch to Linux, so I am still learning the
> etiquette.

You are doing great. Keep at improving the patches and we will get your
changes into the kernel source. If you want to sent your first patchset
then a tool like "quilt" or "git" might be helpful.





Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, Rao Shoaib wrote:

> > > mm/slab_common.c
> > It would be great to have separate patches so that we can review it
> > properly:
> >
> > 1. Move the code into slab_common.c
> > 2. The actual code changes to the kfree rcu mechanism
> > 3. The whitespace changes

> I can certainly break down the patch and submit smaller patches as you have
> suggested.
>
> BTW -- This is my first ever patch to Linux, so I am still learning the
> etiquette.

You are doing great. Keep at improving the patches and we will get your
changes into the kernel source. If you want to sent your first patchset
then a tool like "quilt" or "git" might be helpful.





Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 02:12:06PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:
> > If I had to implement this: I would choose to do the optimization in
> > __rcu_process_callbacks() create small on-call-stack ptr-array for
> > kfree_bulk().  I would only optimize the case that call kfree()
> > directly.  In the while(list) loop I would defer calling
> > __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> > them to the ptr-array (and flush if the array is full in loop, and
> > kfree_bulk flush after loop).
> > 
> > The real advantage of kfree_bulk() comes from amortizing the per kfree
> > (behind-the-scenes) sync cost.  There is an additional benefit, because
> > objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> > allocator is very fast for objects that gets recycled quickly (short
> > lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> > longer-lived/more-outstanding objects, as this hits a slower code-path,
> > fully locked (cross-cpu) double-cmpxchg.  
> 
> Something like this ...  (compile tested only)
> 
> Considerably less code; Rao, what do you think?

I am sorry, but I am not at all fan of this approach.

If we are going to make this sort of change, we should do so in a way
that allows the slab code to actually do the optimizations that might
make this sort of thing worthwhile.  After all, if the main goal was small
code size, the best approach is to drop kfree_bulk() and get on with life
in the usual fashion.

I would prefer to believe that something like kfree_bulk() can help,
and if that is the case, we should give it a chance to do things like
group kfree_rcu() requests by destination slab and soforth, allowing
batching optimizations that might provide more significant increases
in performance.  Furthermore, having this in slab opens the door to
slab taking emergency action when memory is low.

But for the patch below, NAK.

Thanx, Paul

> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 59c471de342a..5ac4ed077233 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct 
> rcu_head *head)
>  }
>  #endif   /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> 
> -void kfree(const void *);
> -
>  /*
>   * Reclaim the specified callback, either by invoking it (non-lazy case)
>   * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
>   */
> -static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
> +static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, void 
> **kfree,
> + unsigned int *idx)
>  {
>   unsigned long offset = (unsigned long)head->func;
> 
>   rcu_lock_acquire(_callback_map);
>   if (__is_kfree_rcu_offset(offset)) {
>   RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
> - kfree((void *)head - offset);
> + kfree[*idx++] = (void *)head - offset;
>   rcu_lock_release(_callback_map);
>   return true;
>   } else {
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f9c0ca2ccf0c..7e13979b4697 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
> rcu_data *rdp)
>   struct rcu_head *rhp;
>   struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
>   long bl, count;
> + void *to_free[16];
> + unsigned int to_free_idx = 0;
> 
>   /* If no callbacks are ready, just return. */
>   if (!rcu_segcblist_ready_cbs(>cblist)) {
> @@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
> rcu_data *rdp)
>   rhp = rcu_cblist_dequeue();
>   for (; rhp; rhp = rcu_cblist_dequeue()) {
>   debug_rcu_head_unqueue(rhp);
> - if (__rcu_reclaim(rsp->name, rhp))
> + if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
>   rcu_cblist_dequeued_lazy();
> + if (to_free_idx == 16)
> + kfree_bulk(16, to_free);
>   /*
>* Stop only if limit reached and CPU has something to do.
>* Note: The rcl structure counts down from zero.
> @@ -2766,6 +2770,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
> rcu_data *rdp)
>(!is_idle_task(current) && !rcu_is_callbacks_kthread(
>   break;
>   }
> + if (to_free_idx)
> + kfree_bulk(to_free_idx, to_free);
> 
>   local_irq_save(flags);
>   count = -rcl.len;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index db85ca3975f1..4127be06759b 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2189,6 +2189,8 @@ static int rcu_nocb_kthread(void *arg)
>   struct rcu_head 

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 02:12:06PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:
> > If I had to implement this: I would choose to do the optimization in
> > __rcu_process_callbacks() create small on-call-stack ptr-array for
> > kfree_bulk().  I would only optimize the case that call kfree()
> > directly.  In the while(list) loop I would defer calling
> > __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> > them to the ptr-array (and flush if the array is full in loop, and
> > kfree_bulk flush after loop).
> > 
> > The real advantage of kfree_bulk() comes from amortizing the per kfree
> > (behind-the-scenes) sync cost.  There is an additional benefit, because
> > objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> > allocator is very fast for objects that gets recycled quickly (short
> > lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> > longer-lived/more-outstanding objects, as this hits a slower code-path,
> > fully locked (cross-cpu) double-cmpxchg.  
> 
> Something like this ...  (compile tested only)
> 
> Considerably less code; Rao, what do you think?

I am sorry, but I am not at all fan of this approach.

If we are going to make this sort of change, we should do so in a way
that allows the slab code to actually do the optimizations that might
make this sort of thing worthwhile.  After all, if the main goal was small
code size, the best approach is to drop kfree_bulk() and get on with life
in the usual fashion.

I would prefer to believe that something like kfree_bulk() can help,
and if that is the case, we should give it a chance to do things like
group kfree_rcu() requests by destination slab and soforth, allowing
batching optimizations that might provide more significant increases
in performance.  Furthermore, having this in slab opens the door to
slab taking emergency action when memory is low.

But for the patch below, NAK.

Thanx, Paul

> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 59c471de342a..5ac4ed077233 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct 
> rcu_head *head)
>  }
>  #endif   /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> 
> -void kfree(const void *);
> -
>  /*
>   * Reclaim the specified callback, either by invoking it (non-lazy case)
>   * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
>   */
> -static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
> +static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, void 
> **kfree,
> + unsigned int *idx)
>  {
>   unsigned long offset = (unsigned long)head->func;
> 
>   rcu_lock_acquire(_callback_map);
>   if (__is_kfree_rcu_offset(offset)) {
>   RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
> - kfree((void *)head - offset);
> + kfree[*idx++] = (void *)head - offset;
>   rcu_lock_release(_callback_map);
>   return true;
>   } else {
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f9c0ca2ccf0c..7e13979b4697 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
> rcu_data *rdp)
>   struct rcu_head *rhp;
>   struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
>   long bl, count;
> + void *to_free[16];
> + unsigned int to_free_idx = 0;
> 
>   /* If no callbacks are ready, just return. */
>   if (!rcu_segcblist_ready_cbs(>cblist)) {
> @@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
> rcu_data *rdp)
>   rhp = rcu_cblist_dequeue();
>   for (; rhp; rhp = rcu_cblist_dequeue()) {
>   debug_rcu_head_unqueue(rhp);
> - if (__rcu_reclaim(rsp->name, rhp))
> + if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
>   rcu_cblist_dequeued_lazy();
> + if (to_free_idx == 16)
> + kfree_bulk(16, to_free);
>   /*
>* Stop only if limit reached and CPU has something to do.
>* Note: The rcl structure counts down from zero.
> @@ -2766,6 +2770,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
> rcu_data *rdp)
>(!is_idle_task(current) && !rcu_is_callbacks_kthread(
>   break;
>   }
> + if (to_free_idx)
> + kfree_bulk(to_free_idx, to_free);
> 
>   local_irq_save(flags);
>   count = -rcl.len;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index db85ca3975f1..4127be06759b 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2189,6 +2189,8 @@ static int rcu_nocb_kthread(void *arg)
>   struct rcu_head 

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 02:12 PM, Matthew Wilcox wrote:

On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:

If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).

The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.

Something like this ...  (compile tested only)

Considerably less code; Rao, what do you think?

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 59c471de342a..5ac4ed077233 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct rcu_head 
*head)
  }
  #endif/* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
  
-void kfree(const void *);

-
  /*
   * Reclaim the specified callback, either by invoking it (non-lazy case)
   * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
   */
-static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
+static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, void 
**kfree,
+   unsigned int *idx)
  {
unsigned long offset = (unsigned long)head->func;
  
  	rcu_lock_acquire(_callback_map);

if (__is_kfree_rcu_offset(offset)) {
RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
-   kfree((void *)head - offset);
+   kfree[*idx++] = (void *)head - offset;
rcu_lock_release(_callback_map);
return true;
} else {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..7e13979b4697 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
  
  	/* If no callbacks are ready, just return. */

if (!rcu_segcblist_ready_cbs(>cblist)) {
@@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
rhp = rcu_cblist_dequeue();
for (; rhp; rhp = rcu_cblist_dequeue()) {
debug_rcu_head_unqueue(rhp);
-   if (__rcu_reclaim(rsp->name, rhp))
+   if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
rcu_cblist_dequeued_lazy();
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
/*
 * Stop only if limit reached and CPU has something to do.
 * Note: The rcl structure counts down from zero.
@@ -2766,6 +2770,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
}
+   if (to_free_idx)
+   kfree_bulk(to_free_idx, to_free);
  
  	local_irq_save(flags);

count = -rcl.len;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index db85ca3975f1..4127be06759b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2189,6 +2189,8 @@ static int rcu_nocb_kthread(void *arg)
struct rcu_head *next;
struct rcu_head **tail;
struct rcu_data *rdp = arg;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
  
  	/* Each pass through this loop invokes one batch of callbacks */

for (;;) {
@@ -2226,13 +2228,18 @@ static int rcu_nocb_kthread(void *arg)
}
debug_rcu_head_unqueue(list);
local_bh_disable();
-   if (__rcu_reclaim(rdp->rsp->name, list))
+   if (__rcu_reclaim(rdp->rsp->name, list, to_free,
+   _free_idx))
cl++;
c++;
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
local_bh_enable();
cond_resched_rcu_qs();
list = next;
}
+   if (to_free_idx)
+   

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 02:12 PM, Matthew Wilcox wrote:

On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:

If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).

The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.

Something like this ...  (compile tested only)

Considerably less code; Rao, what do you think?

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 59c471de342a..5ac4ed077233 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct rcu_head 
*head)
  }
  #endif/* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
  
-void kfree(const void *);

-
  /*
   * Reclaim the specified callback, either by invoking it (non-lazy case)
   * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
   */
-static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
+static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, void 
**kfree,
+   unsigned int *idx)
  {
unsigned long offset = (unsigned long)head->func;
  
  	rcu_lock_acquire(_callback_map);

if (__is_kfree_rcu_offset(offset)) {
RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
-   kfree((void *)head - offset);
+   kfree[*idx++] = (void *)head - offset;
rcu_lock_release(_callback_map);
return true;
} else {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..7e13979b4697 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
  
  	/* If no callbacks are ready, just return. */

if (!rcu_segcblist_ready_cbs(>cblist)) {
@@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
rhp = rcu_cblist_dequeue();
for (; rhp; rhp = rcu_cblist_dequeue()) {
debug_rcu_head_unqueue(rhp);
-   if (__rcu_reclaim(rsp->name, rhp))
+   if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
rcu_cblist_dequeued_lazy();
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
/*
 * Stop only if limit reached and CPU has something to do.
 * Note: The rcl structure counts down from zero.
@@ -2766,6 +2770,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
}
+   if (to_free_idx)
+   kfree_bulk(to_free_idx, to_free);
  
  	local_irq_save(flags);

count = -rcl.len;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index db85ca3975f1..4127be06759b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2189,6 +2189,8 @@ static int rcu_nocb_kthread(void *arg)
struct rcu_head *next;
struct rcu_head **tail;
struct rcu_data *rdp = arg;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
  
  	/* Each pass through this loop invokes one batch of callbacks */

for (;;) {
@@ -2226,13 +2228,18 @@ static int rcu_nocb_kthread(void *arg)
}
debug_rcu_head_unqueue(list);
local_bh_disable();
-   if (__rcu_reclaim(rdp->rsp->name, list))
+   if (__rcu_reclaim(rdp->rsp->name, list, to_free,
+   _free_idx))
cl++;
c++;
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
local_bh_enable();
cond_resched_rcu_qs();
list = next;
}
+   if (to_free_idx)
+   

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:
> If I had to implement this: I would choose to do the optimization in
> __rcu_process_callbacks() create small on-call-stack ptr-array for
> kfree_bulk().  I would only optimize the case that call kfree()
> directly.  In the while(list) loop I would defer calling
> __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> them to the ptr-array (and flush if the array is full in loop, and
> kfree_bulk flush after loop).
> 
> The real advantage of kfree_bulk() comes from amortizing the per kfree
> (behind-the-scenes) sync cost.  There is an additional benefit, because
> objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> allocator is very fast for objects that gets recycled quickly (short
> lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> longer-lived/more-outstanding objects, as this hits a slower code-path,
> fully locked (cross-cpu) double-cmpxchg.  

Something like this ...  (compile tested only)

Considerably less code; Rao, what do you think?

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 59c471de342a..5ac4ed077233 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct rcu_head 
*head)
 }
 #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
 
-void kfree(const void *);
-
 /*
  * Reclaim the specified callback, either by invoking it (non-lazy case)
  * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
  */
-static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
+static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, void 
**kfree,
+   unsigned int *idx)
 {
unsigned long offset = (unsigned long)head->func;
 
rcu_lock_acquire(_callback_map);
if (__is_kfree_rcu_offset(offset)) {
RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
-   kfree((void *)head - offset);
+   kfree[*idx++] = (void *)head - offset;
rcu_lock_release(_callback_map);
return true;
} else {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..7e13979b4697 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
 
/* If no callbacks are ready, just return. */
if (!rcu_segcblist_ready_cbs(>cblist)) {
@@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
rhp = rcu_cblist_dequeue();
for (; rhp; rhp = rcu_cblist_dequeue()) {
debug_rcu_head_unqueue(rhp);
-   if (__rcu_reclaim(rsp->name, rhp))
+   if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
rcu_cblist_dequeued_lazy();
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
/*
 * Stop only if limit reached and CPU has something to do.
 * Note: The rcl structure counts down from zero.
@@ -2766,6 +2770,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
}
+   if (to_free_idx)
+   kfree_bulk(to_free_idx, to_free);
 
local_irq_save(flags);
count = -rcl.len;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index db85ca3975f1..4127be06759b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2189,6 +2189,8 @@ static int rcu_nocb_kthread(void *arg)
struct rcu_head *next;
struct rcu_head **tail;
struct rcu_data *rdp = arg;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
 
/* Each pass through this loop invokes one batch of callbacks */
for (;;) {
@@ -2226,13 +2228,18 @@ static int rcu_nocb_kthread(void *arg)
}
debug_rcu_head_unqueue(list);
local_bh_disable();
-   if (__rcu_reclaim(rdp->rsp->name, list))
+   if (__rcu_reclaim(rdp->rsp->name, list, to_free,
+   _free_idx))
cl++;
c++;
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
local_bh_enable();
cond_resched_rcu_qs();
list = next;
}
+   if (to_free_idx)
+   kfree_bulk(to_free_idx, to_free);

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:
> If I had to implement this: I would choose to do the optimization in
> __rcu_process_callbacks() create small on-call-stack ptr-array for
> kfree_bulk().  I would only optimize the case that call kfree()
> directly.  In the while(list) loop I would defer calling
> __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> them to the ptr-array (and flush if the array is full in loop, and
> kfree_bulk flush after loop).
> 
> The real advantage of kfree_bulk() comes from amortizing the per kfree
> (behind-the-scenes) sync cost.  There is an additional benefit, because
> objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> allocator is very fast for objects that gets recycled quickly (short
> lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> longer-lived/more-outstanding objects, as this hits a slower code-path,
> fully locked (cross-cpu) double-cmpxchg.  

Something like this ...  (compile tested only)

Considerably less code; Rao, what do you think?

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 59c471de342a..5ac4ed077233 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct rcu_head 
*head)
 }
 #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
 
-void kfree(const void *);
-
 /*
  * Reclaim the specified callback, either by invoking it (non-lazy case)
  * or freeing it directly (lazy case).  Return true if lazy, false otherwise.
  */
-static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
+static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, void 
**kfree,
+   unsigned int *idx)
 {
unsigned long offset = (unsigned long)head->func;
 
rcu_lock_acquire(_callback_map);
if (__is_kfree_rcu_offset(offset)) {
RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
-   kfree((void *)head - offset);
+   kfree[*idx++] = (void *)head - offset;
rcu_lock_release(_callback_map);
return true;
} else {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..7e13979b4697 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
 
/* If no callbacks are ready, just return. */
if (!rcu_segcblist_ready_cbs(>cblist)) {
@@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
rhp = rcu_cblist_dequeue();
for (; rhp; rhp = rcu_cblist_dequeue()) {
debug_rcu_head_unqueue(rhp);
-   if (__rcu_reclaim(rsp->name, rhp))
+   if (__rcu_reclaim(rsp->name, rhp, to_free, _free_idx))
rcu_cblist_dequeued_lazy();
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
/*
 * Stop only if limit reached and CPU has something to do.
 * Note: The rcl structure counts down from zero.
@@ -2766,6 +2770,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct 
rcu_data *rdp)
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
}
+   if (to_free_idx)
+   kfree_bulk(to_free_idx, to_free);
 
local_irq_save(flags);
count = -rcl.len;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index db85ca3975f1..4127be06759b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2189,6 +2189,8 @@ static int rcu_nocb_kthread(void *arg)
struct rcu_head *next;
struct rcu_head **tail;
struct rcu_data *rdp = arg;
+   void *to_free[16];
+   unsigned int to_free_idx = 0;
 
/* Each pass through this loop invokes one batch of callbacks */
for (;;) {
@@ -2226,13 +2228,18 @@ static int rcu_nocb_kthread(void *arg)
}
debug_rcu_head_unqueue(list);
local_bh_disable();
-   if (__rcu_reclaim(rdp->rsp->name, list))
+   if (__rcu_reclaim(rdp->rsp->name, list, to_free,
+   _free_idx))
cl++;
c++;
+   if (to_free_idx == 16)
+   kfree_bulk(16, to_free);
local_bh_enable();
cond_resched_rcu_qs();
list = next;
}
+   if (to_free_idx)
+   kfree_bulk(to_free_idx, to_free);

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 12:41 PM, Jesper Dangaard Brouer wrote:

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:


+/* Main RCU function that is called to free RCU structures */
+static void
+__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy)
+{
+   unsigned long offset;
+   void *ptr;
+   struct rcu_bulk_free *rbf;
+   struct rcu_bulk_free_container *rbfc = NULL;
+
+   rbf = this_cpu_ptr(_rbf);
+
+   if (unlikely(!rbf->rbf_init)) {
+   spin_lock_init(>rbf_lock);
+   rbf->rbf_cpu = smp_processor_id();
+   rbf->rbf_init = true;
+   }
+
+   /* hold lock to protect against other cpu's */
+   spin_lock_bh(>rbf_lock);

I'm not sure this will be faster.  Having to take a cross CPU lock here
(+ BH-disable) could cause scaling issues.   Hopefully this lock will
not be used intensively by other CPUs, right?


The current cost of __call_rcu() is a local_irq_save/restore (which is
quite expensive, but doesn't cause cross CPU chatter).

Later in __rcu_process_callbacks() we have a local_irq_save/restore for
the entire list, plus a per object cost doing local_bh_disable/enable.
And for each object we call __rcu_reclaim(), which in some cases
directly call kfree().


As Paul has pointed out the lock is a per-cpu lock, the only reason for 
another CPU to access this lock is if the rcu callbacks run on a 
different CPU and there is nothing the code can do to avoid that but 
that should be rare anyways.





If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).
This is exactly what the current code is doing. It accumulates only the 
calls made to

__kfree_rcu(head, offset) ==> kfree_call_rcu() ==> __bulk_free_rcu

__kfree_rcu has a check to make sure that an offset is being passed.

When a function pointer is passed the caller has to call 
call_rcu/call_rcu_sched


Accumulating early avoids the individual cost of calling __call_rcu

Perhaps I do not understand your point.

Shoaib


The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.


+
+   rbfc = rbf->rbf_container;
+
+   if (rbfc == NULL) {
+   if (rbf->rbf_cached_container == NULL) {
+   rbf->rbf_container =
+   kmalloc(sizeof(struct rcu_bulk_free_container),
+   GFP_ATOMIC);
+   rbf->rbf_container->rbfc_rbf = rbf;
+   } else {
+   rbf->rbf_container = rbf->rbf_cached_container;
+   rbf->rbf_container->rbfc_rbf = rbf;
+   cmpxchg(>rbf_cached_container,
+   rbf->rbf_cached_container, NULL);
+   }
+
+   if (unlikely(rbf->rbf_container == NULL)) {
+
+   /* Memory allocation failed maintain a list */
+
+   head->func = (void *)func;
+   head->next = rbf->rbf_list_head;
+   rbf->rbf_list_head = head;
+   rbf->rbf_list_size++;
+   if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
+   __rcu_bulk_schedule_list(rbf);
+
+   goto done;
+   }
+
+   rbfc = rbf->rbf_container;
+   rbfc->rbfc_entries = 0;
+
+   if (rbf->rbf_list_head != NULL)
+   __rcu_bulk_schedule_list(rbf);
+   }
+
+   offset = (unsigned long)func;
+   ptr = (void *)head - offset;
+
+   rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
+   if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
+
+   WRITE_ONCE(rbf->rbf_container, NULL);
+   spin_unlock_bh(>rbf_lock);
+   call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
+   return;
+   }
+
+done:
+   if (!rbf->rbf_monitor) {
+
+   call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
+   rbf->rbf_monitor = true;
+   }
+
+   spin_unlock_bh(>rbf_lock);
+}






Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 12:41 PM, Jesper Dangaard Brouer wrote:

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:


+/* Main RCU function that is called to free RCU structures */
+static void
+__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy)
+{
+   unsigned long offset;
+   void *ptr;
+   struct rcu_bulk_free *rbf;
+   struct rcu_bulk_free_container *rbfc = NULL;
+
+   rbf = this_cpu_ptr(_rbf);
+
+   if (unlikely(!rbf->rbf_init)) {
+   spin_lock_init(>rbf_lock);
+   rbf->rbf_cpu = smp_processor_id();
+   rbf->rbf_init = true;
+   }
+
+   /* hold lock to protect against other cpu's */
+   spin_lock_bh(>rbf_lock);

I'm not sure this will be faster.  Having to take a cross CPU lock here
(+ BH-disable) could cause scaling issues.   Hopefully this lock will
not be used intensively by other CPUs, right?


The current cost of __call_rcu() is a local_irq_save/restore (which is
quite expensive, but doesn't cause cross CPU chatter).

Later in __rcu_process_callbacks() we have a local_irq_save/restore for
the entire list, plus a per object cost doing local_bh_disable/enable.
And for each object we call __rcu_reclaim(), which in some cases
directly call kfree().


As Paul has pointed out the lock is a per-cpu lock, the only reason for 
another CPU to access this lock is if the rcu callbacks run on a 
different CPU and there is nothing the code can do to avoid that but 
that should be rare anyways.





If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).
This is exactly what the current code is doing. It accumulates only the 
calls made to

__kfree_rcu(head, offset) ==> kfree_call_rcu() ==> __bulk_free_rcu

__kfree_rcu has a check to make sure that an offset is being passed.

When a function pointer is passed the caller has to call 
call_rcu/call_rcu_sched


Accumulating early avoids the individual cost of calling __call_rcu

Perhaps I do not understand your point.

Shoaib


The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.


+
+   rbfc = rbf->rbf_container;
+
+   if (rbfc == NULL) {
+   if (rbf->rbf_cached_container == NULL) {
+   rbf->rbf_container =
+   kmalloc(sizeof(struct rcu_bulk_free_container),
+   GFP_ATOMIC);
+   rbf->rbf_container->rbfc_rbf = rbf;
+   } else {
+   rbf->rbf_container = rbf->rbf_cached_container;
+   rbf->rbf_container->rbfc_rbf = rbf;
+   cmpxchg(>rbf_cached_container,
+   rbf->rbf_cached_container, NULL);
+   }
+
+   if (unlikely(rbf->rbf_container == NULL)) {
+
+   /* Memory allocation failed maintain a list */
+
+   head->func = (void *)func;
+   head->next = rbf->rbf_list_head;
+   rbf->rbf_list_head = head;
+   rbf->rbf_list_size++;
+   if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
+   __rcu_bulk_schedule_list(rbf);
+
+   goto done;
+   }
+
+   rbfc = rbf->rbf_container;
+   rbfc->rbfc_entries = 0;
+
+   if (rbf->rbf_list_head != NULL)
+   __rcu_bulk_schedule_list(rbf);
+   }
+
+   offset = (unsigned long)func;
+   ptr = (void *)head - offset;
+
+   rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
+   if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
+
+   WRITE_ONCE(rbf->rbf_container, NULL);
+   spin_unlock_bh(>rbf_lock);
+   call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
+   return;
+   }
+
+done:
+   if (!rbf->rbf_monitor) {
+
+   call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
+   rbf->rbf_monitor = true;
+   }
+
+   spin_unlock_bh(>rbf_lock);
+}






Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:
> 
> On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:
> 
> > +/* Main RCU function that is called to free RCU structures */
> > +static void
> > +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> > lazy)
> > +{
> > +   unsigned long offset;
> > +   void *ptr;
> > +   struct rcu_bulk_free *rbf;
> > +   struct rcu_bulk_free_container *rbfc = NULL;
> > +
> > +   rbf = this_cpu_ptr(_rbf);
> > +
> > +   if (unlikely(!rbf->rbf_init)) {
> > +   spin_lock_init(>rbf_lock);
> > +   rbf->rbf_cpu = smp_processor_id();
> > +   rbf->rbf_init = true;
> > +   }
> > +
> > +   /* hold lock to protect against other cpu's */
> > +   spin_lock_bh(>rbf_lock);
> 
> I'm not sure this will be faster.  Having to take a cross CPU lock here
> (+ BH-disable) could cause scaling issues.   Hopefully this lock will
> not be used intensively by other CPUs, right?
> 
> 
> The current cost of __call_rcu() is a local_irq_save/restore (which is
> quite expensive, but doesn't cause cross CPU chatter).
> 
> Later in __rcu_process_callbacks() we have a local_irq_save/restore for
> the entire list, plus a per object cost doing local_bh_disable/enable.
> And for each object we call __rcu_reclaim(), which in some cases
> directly call kfree().

Isn't this lock in a per-CPU object?  It -might- go cross-CPU in response
to CPU-hotplug operations, but that should be rare.

Thanx, Paul

> If I had to implement this: I would choose to do the optimization in
> __rcu_process_callbacks() create small on-call-stack ptr-array for
> kfree_bulk().  I would only optimize the case that call kfree()
> directly.  In the while(list) loop I would defer calling
> __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> them to the ptr-array (and flush if the array is full in loop, and
> kfree_bulk flush after loop).
> 
> The real advantage of kfree_bulk() comes from amortizing the per kfree
> (behind-the-scenes) sync cost.  There is an additional benefit, because
> objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> allocator is very fast for objects that gets recycled quickly (short
> lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> longer-lived/more-outstanding objects, as this hits a slower code-path,
> fully locked (cross-cpu) double-cmpxchg.  
> 
> > +
> > +   rbfc = rbf->rbf_container;
> > +
> > +   if (rbfc == NULL) {
> > +   if (rbf->rbf_cached_container == NULL) {
> > +   rbf->rbf_container =
> > +   kmalloc(sizeof(struct rcu_bulk_free_container),
> > +   GFP_ATOMIC);
> > +   rbf->rbf_container->rbfc_rbf = rbf;
> > +   } else {
> > +   rbf->rbf_container = rbf->rbf_cached_container;
> > +   rbf->rbf_container->rbfc_rbf = rbf;
> > +   cmpxchg(>rbf_cached_container,
> > +   rbf->rbf_cached_container, NULL);
> > +   }
> > +
> > +   if (unlikely(rbf->rbf_container == NULL)) {
> > +
> > +   /* Memory allocation failed maintain a list */
> > +
> > +   head->func = (void *)func;
> > +   head->next = rbf->rbf_list_head;
> > +   rbf->rbf_list_head = head;
> > +   rbf->rbf_list_size++;
> > +   if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
> > +   __rcu_bulk_schedule_list(rbf);
> > +
> > +   goto done;
> > +   }
> > +
> > +   rbfc = rbf->rbf_container;
> > +   rbfc->rbfc_entries = 0;
> > +
> > +   if (rbf->rbf_list_head != NULL)
> > +   __rcu_bulk_schedule_list(rbf);
> > +   }
> > +
> > +   offset = (unsigned long)func;
> > +   ptr = (void *)head - offset;
> > +
> > +   rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
> > +   if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
> > +
> > +   WRITE_ONCE(rbf->rbf_container, NULL);
> > +   spin_unlock_bh(>rbf_lock);
> > +   call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
> > +   return;
> > +   }
> > +
> > +done:
> > +   if (!rbf->rbf_monitor) {
> > +
> > +   call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
> > +   rbf->rbf_monitor = true;
> > +   }
> > +
> > +   spin_unlock_bh(>rbf_lock);
> > +}
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 



Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:
> 
> On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:
> 
> > +/* Main RCU function that is called to free RCU structures */
> > +static void
> > +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> > lazy)
> > +{
> > +   unsigned long offset;
> > +   void *ptr;
> > +   struct rcu_bulk_free *rbf;
> > +   struct rcu_bulk_free_container *rbfc = NULL;
> > +
> > +   rbf = this_cpu_ptr(_rbf);
> > +
> > +   if (unlikely(!rbf->rbf_init)) {
> > +   spin_lock_init(>rbf_lock);
> > +   rbf->rbf_cpu = smp_processor_id();
> > +   rbf->rbf_init = true;
> > +   }
> > +
> > +   /* hold lock to protect against other cpu's */
> > +   spin_lock_bh(>rbf_lock);
> 
> I'm not sure this will be faster.  Having to take a cross CPU lock here
> (+ BH-disable) could cause scaling issues.   Hopefully this lock will
> not be used intensively by other CPUs, right?
> 
> 
> The current cost of __call_rcu() is a local_irq_save/restore (which is
> quite expensive, but doesn't cause cross CPU chatter).
> 
> Later in __rcu_process_callbacks() we have a local_irq_save/restore for
> the entire list, plus a per object cost doing local_bh_disable/enable.
> And for each object we call __rcu_reclaim(), which in some cases
> directly call kfree().

Isn't this lock in a per-CPU object?  It -might- go cross-CPU in response
to CPU-hotplug operations, but that should be rare.

Thanx, Paul

> If I had to implement this: I would choose to do the optimization in
> __rcu_process_callbacks() create small on-call-stack ptr-array for
> kfree_bulk().  I would only optimize the case that call kfree()
> directly.  In the while(list) loop I would defer calling
> __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> them to the ptr-array (and flush if the array is full in loop, and
> kfree_bulk flush after loop).
> 
> The real advantage of kfree_bulk() comes from amortizing the per kfree
> (behind-the-scenes) sync cost.  There is an additional benefit, because
> objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> allocator is very fast for objects that gets recycled quickly (short
> lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> longer-lived/more-outstanding objects, as this hits a slower code-path,
> fully locked (cross-cpu) double-cmpxchg.  
> 
> > +
> > +   rbfc = rbf->rbf_container;
> > +
> > +   if (rbfc == NULL) {
> > +   if (rbf->rbf_cached_container == NULL) {
> > +   rbf->rbf_container =
> > +   kmalloc(sizeof(struct rcu_bulk_free_container),
> > +   GFP_ATOMIC);
> > +   rbf->rbf_container->rbfc_rbf = rbf;
> > +   } else {
> > +   rbf->rbf_container = rbf->rbf_cached_container;
> > +   rbf->rbf_container->rbfc_rbf = rbf;
> > +   cmpxchg(>rbf_cached_container,
> > +   rbf->rbf_cached_container, NULL);
> > +   }
> > +
> > +   if (unlikely(rbf->rbf_container == NULL)) {
> > +
> > +   /* Memory allocation failed maintain a list */
> > +
> > +   head->func = (void *)func;
> > +   head->next = rbf->rbf_list_head;
> > +   rbf->rbf_list_head = head;
> > +   rbf->rbf_list_size++;
> > +   if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
> > +   __rcu_bulk_schedule_list(rbf);
> > +
> > +   goto done;
> > +   }
> > +
> > +   rbfc = rbf->rbf_container;
> > +   rbfc->rbfc_entries = 0;
> > +
> > +   if (rbf->rbf_list_head != NULL)
> > +   __rcu_bulk_schedule_list(rbf);
> > +   }
> > +
> > +   offset = (unsigned long)func;
> > +   ptr = (void *)head - offset;
> > +
> > +   rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
> > +   if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
> > +
> > +   WRITE_ONCE(rbf->rbf_container, NULL);
> > +   spin_unlock_bh(>rbf_lock);
> > +   call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
> > +   return;
> > +   }
> > +
> > +done:
> > +   if (!rbf->rbf_monitor) {
> > +
> > +   call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
> > +   rbf->rbf_monitor = true;
> > +   }
> > +
> > +   spin_unlock_bh(>rbf_lock);
> > +}
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 



Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> +/* Main RCU function that is called to free RCU structures */
> +static void
> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> lazy)
> +{
> + unsigned long offset;
> + void *ptr;
> + struct rcu_bulk_free *rbf;
> + struct rcu_bulk_free_container *rbfc = NULL;
> +
> + rbf = this_cpu_ptr(_rbf);
> +
> + if (unlikely(!rbf->rbf_init)) {
> + spin_lock_init(>rbf_lock);
> + rbf->rbf_cpu = smp_processor_id();
> + rbf->rbf_init = true;
> + }
> +
> + /* hold lock to protect against other cpu's */
> + spin_lock_bh(>rbf_lock);

I'm not sure this will be faster.  Having to take a cross CPU lock here
(+ BH-disable) could cause scaling issues.   Hopefully this lock will
not be used intensively by other CPUs, right?


The current cost of __call_rcu() is a local_irq_save/restore (which is
quite expensive, but doesn't cause cross CPU chatter).

Later in __rcu_process_callbacks() we have a local_irq_save/restore for
the entire list, plus a per object cost doing local_bh_disable/enable.
And for each object we call __rcu_reclaim(), which in some cases
directly call kfree().


If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).

The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.  

> +
> + rbfc = rbf->rbf_container;
> +
> + if (rbfc == NULL) {
> + if (rbf->rbf_cached_container == NULL) {
> + rbf->rbf_container =
> + kmalloc(sizeof(struct rcu_bulk_free_container),
> + GFP_ATOMIC);
> + rbf->rbf_container->rbfc_rbf = rbf;
> + } else {
> + rbf->rbf_container = rbf->rbf_cached_container;
> + rbf->rbf_container->rbfc_rbf = rbf;
> + cmpxchg(>rbf_cached_container,
> + rbf->rbf_cached_container, NULL);
> + }
> +
> + if (unlikely(rbf->rbf_container == NULL)) {
> +
> + /* Memory allocation failed maintain a list */
> +
> + head->func = (void *)func;
> + head->next = rbf->rbf_list_head;
> + rbf->rbf_list_head = head;
> + rbf->rbf_list_size++;
> + if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
> + __rcu_bulk_schedule_list(rbf);
> +
> + goto done;
> + }
> +
> + rbfc = rbf->rbf_container;
> + rbfc->rbfc_entries = 0;
> +
> + if (rbf->rbf_list_head != NULL)
> + __rcu_bulk_schedule_list(rbf);
> + }
> +
> + offset = (unsigned long)func;
> + ptr = (void *)head - offset;
> +
> + rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
> + if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
> +
> + WRITE_ONCE(rbf->rbf_container, NULL);
> + spin_unlock_bh(>rbf_lock);
> + call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
> + return;
> + }
> +
> +done:
> + if (!rbf->rbf_monitor) {
> +
> + call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
> + rbf->rbf_monitor = true;
> + }
> +
> + spin_unlock_bh(>rbf_lock);
> +}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> +/* Main RCU function that is called to free RCU structures */
> +static void
> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> lazy)
> +{
> + unsigned long offset;
> + void *ptr;
> + struct rcu_bulk_free *rbf;
> + struct rcu_bulk_free_container *rbfc = NULL;
> +
> + rbf = this_cpu_ptr(_rbf);
> +
> + if (unlikely(!rbf->rbf_init)) {
> + spin_lock_init(>rbf_lock);
> + rbf->rbf_cpu = smp_processor_id();
> + rbf->rbf_init = true;
> + }
> +
> + /* hold lock to protect against other cpu's */
> + spin_lock_bh(>rbf_lock);

I'm not sure this will be faster.  Having to take a cross CPU lock here
(+ BH-disable) could cause scaling issues.   Hopefully this lock will
not be used intensively by other CPUs, right?


The current cost of __call_rcu() is a local_irq_save/restore (which is
quite expensive, but doesn't cause cross CPU chatter).

Later in __rcu_process_callbacks() we have a local_irq_save/restore for
the entire list, plus a per object cost doing local_bh_disable/enable.
And for each object we call __rcu_reclaim(), which in some cases
directly call kfree().


If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).

The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.  

> +
> + rbfc = rbf->rbf_container;
> +
> + if (rbfc == NULL) {
> + if (rbf->rbf_cached_container == NULL) {
> + rbf->rbf_container =
> + kmalloc(sizeof(struct rcu_bulk_free_container),
> + GFP_ATOMIC);
> + rbf->rbf_container->rbfc_rbf = rbf;
> + } else {
> + rbf->rbf_container = rbf->rbf_cached_container;
> + rbf->rbf_container->rbfc_rbf = rbf;
> + cmpxchg(>rbf_cached_container,
> + rbf->rbf_cached_container, NULL);
> + }
> +
> + if (unlikely(rbf->rbf_container == NULL)) {
> +
> + /* Memory allocation failed maintain a list */
> +
> + head->func = (void *)func;
> + head->next = rbf->rbf_list_head;
> + rbf->rbf_list_head = head;
> + rbf->rbf_list_size++;
> + if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
> + __rcu_bulk_schedule_list(rbf);
> +
> + goto done;
> + }
> +
> + rbfc = rbf->rbf_container;
> + rbfc->rbfc_entries = 0;
> +
> + if (rbf->rbf_list_head != NULL)
> + __rcu_bulk_schedule_list(rbf);
> + }
> +
> + offset = (unsigned long)func;
> + ptr = (void *)head - offset;
> +
> + rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
> + if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
> +
> + WRITE_ONCE(rbf->rbf_container, NULL);
> + spin_unlock_bh(>rbf_lock);
> + call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
> + return;
> + }
> +
> +done:
> + if (!rbf->rbf_monitor) {
> +
> + call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
> + rbf->rbf_monitor = true;
> + }
> +
> + spin_unlock_bh(>rbf_lock);
> +}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 11:56:30AM -0800, Rao Shoaib wrote:
> On 12/19/2017 11:30 AM, Matthew Wilcox wrote:
> >On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:

[ . . . ]

> >I've been doing a lot of thinking about this because I really want a
> >way to kfree_rcu() an object without embedding a struct rcu_head in it.
> >But I see no way to do that today; even if we have an external memory
> >allocation to point to the object to be freed, we have to keep track of
> >the grace periods.
> I am not sure I understand. If you had external memory you can
> easily do that.
> I am exactly doing that, the only reason the RCU structure is needed
> is to get the pointer to the object being freed.

This can be done as long as you are willing to either:

1.  Occasionally have kfree_rcu() wait for a grace period.

2.  Occasionally have kfree_rcu() allocate memory.

3.  Keep the rcu_head, but use it only when you would otherwise
have to accept the above two penalties.  (The point of this
is that tracking lists of memory waiting for a grace period
using dense arrays improves cache locality.)

There might be others, and if you come up with one, please don't keep it
a secret.  The C++ standards committee insisted on an interface using
option #2 above.  (There is also an option to use their equivalent of
an rcu_head.)

Thanx, Paul



Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Paul E. McKenney
On Tue, Dec 19, 2017 at 11:56:30AM -0800, Rao Shoaib wrote:
> On 12/19/2017 11:30 AM, Matthew Wilcox wrote:
> >On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:

[ . . . ]

> >I've been doing a lot of thinking about this because I really want a
> >way to kfree_rcu() an object without embedding a struct rcu_head in it.
> >But I see no way to do that today; even if we have an external memory
> >allocation to point to the object to be freed, we have to keep track of
> >the grace periods.
> I am not sure I understand. If you had external memory you can
> easily do that.
> I am exactly doing that, the only reason the RCU structure is needed
> is to get the pointer to the object being freed.

This can be done as long as you are willing to either:

1.  Occasionally have kfree_rcu() wait for a grace period.

2.  Occasionally have kfree_rcu() allocate memory.

3.  Keep the rcu_head, but use it only when you would otherwise
have to accept the above two penalties.  (The point of this
is that tracking lists of memory waiting for a grace period
using dense arrays improves cache locality.)

There might be others, and if you come up with one, please don't keep it
a secret.  The C++ standards committee insisted on an interface using
option #2 above.  (There is also an option to use their equivalent of
an rcu_head.)

Thanx, Paul



Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 11:33 AM, Christopher Lameter wrote:

On Tue, 19 Dec 2017, rao.sho...@oracle.com wrote:


This patch updates kfree_rcu to use new bulk memory free functions as they
are more efficient. It also moves kfree_call_rcu() out of rcu related code to
mm/slab_common.c

It would be great to have separate patches so that we can review it
properly:

1. Move the code into slab_common.c
2. The actual code changes to the kfree rcu mechanism
3. The whitespace changes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
I can certainly break down the patch and submit smaller patches as you 
have suggested.


BTW -- This is my first ever patch to Linux, so I am still learning the 
etiquette.


Shoaib


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 11:33 AM, Christopher Lameter wrote:

On Tue, 19 Dec 2017, rao.sho...@oracle.com wrote:


This patch updates kfree_rcu to use new bulk memory free functions as they
are more efficient. It also moves kfree_call_rcu() out of rcu related code to
mm/slab_common.c

It would be great to have separate patches so that we can review it
properly:

1. Move the code into slab_common.c
2. The actual code changes to the kfree rcu mechanism
3. The whitespace changes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
I can certainly break down the patch and submit smaller patches as you 
have suggested.


BTW -- This is my first ever patch to Linux, so I am still learning the 
etiquette.


Shoaib


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 11:30 AM, Matthew Wilcox wrote:

On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:

@@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
flags, size_t nr,
  
  	for (i = 0; i < nr; i++) {

void *x = p[i] = kmem_cache_alloc(s, flags);
+
if (!x) {
__kmem_cache_free_bulk(s, i, p);
return 0;

Don't mix whitespace changes with significant patches.

OK.



+/* Main RCU function that is called to free RCU structures */
+static void
+__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy)
+{
+   unsigned long offset;
+   void *ptr;
+   struct rcu_bulk_free *rbf;
+   struct rcu_bulk_free_container *rbfc = NULL;
+
+   rbf = this_cpu_ptr(_rbf);
+
+   if (unlikely(!rbf->rbf_init)) {
+   spin_lock_init(>rbf_lock);
+   rbf->rbf_cpu = smp_processor_id();
+   rbf->rbf_init = true;
+   }
+
+   /* hold lock to protect against other cpu's */
+   spin_lock_bh(>rbf_lock);

Are you sure we can't call kfree_rcu() from interrupt context?
I thought about it, but the interrupts are off due to acquiring the 
lock. No ?



+   rbfc = rbf->rbf_container;
+   rbfc->rbfc_entries = 0;
+
+   if (rbf->rbf_list_head != NULL)
+   __rcu_bulk_schedule_list(rbf);

You've broken RCU.  Consider this scenario:

Thread 1Thread 2Thread 3
kfree_rcu(a)
schedule()
schedule()  
gets pointer to b
kfree_rcu(b)
processes rcu callbacks
uses b

Thread 3 will free a and also free b, so thread 2 is going to use freed
memory and go splat.  You can't batch up memory to be freed without
taking into account the grace periods.
The code does not change the grace period at all. In fact it adds to the 
grace period.
The free's are accumulated in an array, when a certain limit/time is 
reached the frees are submitted
to RCU for freeing. So the grace period is maintained starting from the 
time of the last free.


In case the memory allocation fails the code uses a list that is also 
submitted to RCU for freeing.


It might make sense for RCU to batch up all the memory it's going to free
in a single grace period, and hand it all off to slub at once, but that's
not what you've done here.
I am kind of doing that but not on a per grace period but on a per cpu 
basis.



I've been doing a lot of thinking about this because I really want a
way to kfree_rcu() an object without embedding a struct rcu_head in it.
But I see no way to do that today; even if we have an external memory
allocation to point to the object to be freed, we have to keep track of
the grace periods.
I am not sure I understand. If you had external memory you can easily do 
that.
I am exactly doing that, the only reason the RCU structure is needed is 
to get the pointer to the object being freed.


Shoaib




Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 11:30 AM, Matthew Wilcox wrote:

On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:

@@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
flags, size_t nr,
  
  	for (i = 0; i < nr; i++) {

void *x = p[i] = kmem_cache_alloc(s, flags);
+
if (!x) {
__kmem_cache_free_bulk(s, i, p);
return 0;

Don't mix whitespace changes with significant patches.

OK.



+/* Main RCU function that is called to free RCU structures */
+static void
+__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy)
+{
+   unsigned long offset;
+   void *ptr;
+   struct rcu_bulk_free *rbf;
+   struct rcu_bulk_free_container *rbfc = NULL;
+
+   rbf = this_cpu_ptr(_rbf);
+
+   if (unlikely(!rbf->rbf_init)) {
+   spin_lock_init(>rbf_lock);
+   rbf->rbf_cpu = smp_processor_id();
+   rbf->rbf_init = true;
+   }
+
+   /* hold lock to protect against other cpu's */
+   spin_lock_bh(>rbf_lock);

Are you sure we can't call kfree_rcu() from interrupt context?
I thought about it, but the interrupts are off due to acquiring the 
lock. No ?



+   rbfc = rbf->rbf_container;
+   rbfc->rbfc_entries = 0;
+
+   if (rbf->rbf_list_head != NULL)
+   __rcu_bulk_schedule_list(rbf);

You've broken RCU.  Consider this scenario:

Thread 1Thread 2Thread 3
kfree_rcu(a)
schedule()
schedule()  
gets pointer to b
kfree_rcu(b)
processes rcu callbacks
uses b

Thread 3 will free a and also free b, so thread 2 is going to use freed
memory and go splat.  You can't batch up memory to be freed without
taking into account the grace periods.
The code does not change the grace period at all. In fact it adds to the 
grace period.
The free's are accumulated in an array, when a certain limit/time is 
reached the frees are submitted
to RCU for freeing. So the grace period is maintained starting from the 
time of the last free.


In case the memory allocation fails the code uses a list that is also 
submitted to RCU for freeing.


It might make sense for RCU to batch up all the memory it's going to free
in a single grace period, and hand it all off to slub at once, but that's
not what you've done here.
I am kind of doing that but not on a per grace period but on a per cpu 
basis.



I've been doing a lot of thinking about this because I really want a
way to kfree_rcu() an object without embedding a struct rcu_head in it.
But I see no way to do that today; even if we have an external memory
allocation to point to the object to be freed, we have to keep track of
the grace periods.
I am not sure I understand. If you had external memory you can easily do 
that.
I am exactly doing that, the only reason the RCU structure is needed is 
to get the pointer to the object being freed.


Shoaib




Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 11:12 AM, Matthew Wilcox wrote:

On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:

This patch updates kfree_rcu to use new bulk memory free functions as they
are more efficient. It also moves kfree_call_rcu() out of rcu related code to
mm/slab_common.c

Signed-off-by: Rao Shoaib 
---
  include/linux/mm.h |   5 ++
  kernel/rcu/tree.c  |  14 
  kernel/sysctl.c|  40 +++
  mm/slab.h  |  23 +++
  mm/slab_common.c   | 198 -
  5 files changed, 264 insertions(+), 16 deletions(-)

You've added an awful lot of code.  Do you have any performance measurements
that shows this to be a win?
I did some micro benchmarking when I was developing the code and did see 
performance gains -- see attached.


I tried several networking benchmarks but was not able to get any 
improvement . The reason is that these benchmarks do not exercise the 
code we are improving. So I looked at the kernel source for users of  
kfree_rcu().  It turns out that directory deletion code calls  kfree_rcu 
to free the data structure when an entry is deleted. Based on that I 
created two benchmarks.


1) make_dirs -- This benchmark creates multi level directory structure 
and than deletes it. It's the delete part  where we see the performance 
gain of about 8.3%. The creation time remains same.


This benchmark was derived from fdtree benchmark at 
https://computing.llnl.gov/?set=code=sio_downloads ==> 
https://github.com/llnl/fdtree


2) tsock -- I also noticed that a socket has an entry in a directory and 
when the socket is closed the directory entry is deleted. So I wrote a 
simple benchmark that goes in a loop a million times and opens and 
closes 10 sockets per iteration. This shows an improvement of 7.6%


I have attached the benchmarks and results. Unchanged results are for 
stock kernel, Changed are for the modified kernel.


Shoaib


make_dirs.tar
Description: Unix tar archive


tsock.tar
Description: Unix tar archive


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Rao Shoaib



On 12/19/2017 11:12 AM, Matthew Wilcox wrote:

On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:

This patch updates kfree_rcu to use new bulk memory free functions as they
are more efficient. It also moves kfree_call_rcu() out of rcu related code to
mm/slab_common.c

Signed-off-by: Rao Shoaib 
---
  include/linux/mm.h |   5 ++
  kernel/rcu/tree.c  |  14 
  kernel/sysctl.c|  40 +++
  mm/slab.h  |  23 +++
  mm/slab_common.c   | 198 -
  5 files changed, 264 insertions(+), 16 deletions(-)

You've added an awful lot of code.  Do you have any performance measurements
that shows this to be a win?
I did some micro benchmarking when I was developing the code and did see 
performance gains -- see attached.


I tried several networking benchmarks but was not able to get any 
improvement . The reason is that these benchmarks do not exercise the 
code we are improving. So I looked at the kernel source for users of  
kfree_rcu().  It turns out that directory deletion code calls  kfree_rcu 
to free the data structure when an entry is deleted. Based on that I 
created two benchmarks.


1) make_dirs -- This benchmark creates multi level directory structure 
and than deletes it. It's the delete part  where we see the performance 
gain of about 8.3%. The creation time remains same.


This benchmark was derived from fdtree benchmark at 
https://computing.llnl.gov/?set=code=sio_downloads ==> 
https://github.com/llnl/fdtree


2) tsock -- I also noticed that a socket has an entry in a directory and 
when the socket is closed the directory entry is deleted. So I wrote a 
simple benchmark that goes in a loop a million times and opens and 
closes 10 sockets per iteration. This shows an improvement of 7.6%


I have attached the benchmarks and results. Unchanged results are for 
stock kernel, Changed are for the modified kernel.


Shoaib


make_dirs.tar
Description: Unix tar archive


tsock.tar
Description: Unix tar archive


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, rao.sho...@oracle.com wrote:

> This patch updates kfree_rcu to use new bulk memory free functions as they
> are more efficient. It also moves kfree_call_rcu() out of rcu related code to
> mm/slab_common.c

It would be great to have separate patches so that we can review it
properly:

1. Move the code into slab_common.c
2. The actual code changes to the kfree rcu mechanism
3. The whitespace changes


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Christopher Lameter
On Tue, 19 Dec 2017, rao.sho...@oracle.com wrote:

> This patch updates kfree_rcu to use new bulk memory free functions as they
> are more efficient. It also moves kfree_call_rcu() out of rcu related code to
> mm/slab_common.c

It would be great to have separate patches so that we can review it
properly:

1. Move the code into slab_common.c
2. The actual code changes to the kfree rcu mechanism
3. The whitespace changes


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index c8cb367..06fd12c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  
>   for (i = 0; i < nr; i++) {
>   void *x = p[i] = kmem_cache_alloc(s, flags);
> +
>   if (!x) {
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;
> @@ -353,6 +355,7 @@ unsigned long calculate_alignment(slab_flags_t flags,
>*/
>   if (flags & SLAB_HWCACHE_ALIGN) {
>   unsigned long ralign = cache_line_size();
> +
>   while (size <= ralign / 2)
>   ralign /= 2;
>   align = max(align, ralign);
> @@ -444,9 +447,8 @@ kmem_cache_create(const char *name, size_t size, size_t 
> align,
>   mutex_lock(_mutex);
>  
>   err = kmem_cache_sanity_check(name, size);
> - if (err) {
> + if (err)
>   goto out_unlock;
> - }
>  
>   /* Refuse requests with allocator specific flags */
>   if (flags & ~SLAB_FLAGS_PERMITTED) {
> @@ -1131,6 +1133,7 @@ EXPORT_SYMBOL(kmalloc_order);
>  void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
>  {
>   void *ret = kmalloc_order(size, flags, order);
> +
>   trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
>   return ret;
>  }

Looks like you are mixing in cleanups (which should be avoided, and
instead moved to another patch).

> @@ -1483,6 +1486,197 @@ void kzfree(const void *p)
[...]
> +
> +/* processes list of rcu structures
> + * used when conatiner can not be allocated
> + */

Spelling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index c8cb367..06fd12c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  
>   for (i = 0; i < nr; i++) {
>   void *x = p[i] = kmem_cache_alloc(s, flags);
> +
>   if (!x) {
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;
> @@ -353,6 +355,7 @@ unsigned long calculate_alignment(slab_flags_t flags,
>*/
>   if (flags & SLAB_HWCACHE_ALIGN) {
>   unsigned long ralign = cache_line_size();
> +
>   while (size <= ralign / 2)
>   ralign /= 2;
>   align = max(align, ralign);
> @@ -444,9 +447,8 @@ kmem_cache_create(const char *name, size_t size, size_t 
> align,
>   mutex_lock(_mutex);
>  
>   err = kmem_cache_sanity_check(name, size);
> - if (err) {
> + if (err)
>   goto out_unlock;
> - }
>  
>   /* Refuse requests with allocator specific flags */
>   if (flags & ~SLAB_FLAGS_PERMITTED) {
> @@ -1131,6 +1133,7 @@ EXPORT_SYMBOL(kmalloc_order);
>  void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
>  {
>   void *ret = kmalloc_order(size, flags, order);
> +
>   trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
>   return ret;
>  }

Looks like you are mixing in cleanups (which should be avoided, and
instead moved to another patch).

> @@ -1483,6 +1486,197 @@ void kzfree(const void *p)
[...]
> +
> +/* processes list of rcu structures
> + * used when conatiner can not be allocated
> + */

Spelling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:
> @@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  
>   for (i = 0; i < nr; i++) {
>   void *x = p[i] = kmem_cache_alloc(s, flags);
> +
>   if (!x) {
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;

Don't mix whitespace changes with significant patches.

> +/* Main RCU function that is called to free RCU structures */
> +static void
> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> lazy)
> +{
> + unsigned long offset;
> + void *ptr;
> + struct rcu_bulk_free *rbf;
> + struct rcu_bulk_free_container *rbfc = NULL;
> +
> + rbf = this_cpu_ptr(_rbf);
> +
> + if (unlikely(!rbf->rbf_init)) {
> + spin_lock_init(>rbf_lock);
> + rbf->rbf_cpu = smp_processor_id();
> + rbf->rbf_init = true;
> + }
> +
> + /* hold lock to protect against other cpu's */
> + spin_lock_bh(>rbf_lock);

Are you sure we can't call kfree_rcu() from interrupt context?

> + rbfc = rbf->rbf_container;
> + rbfc->rbfc_entries = 0;
> +
> + if (rbf->rbf_list_head != NULL)
> + __rcu_bulk_schedule_list(rbf);

You've broken RCU.  Consider this scenario:

Thread 1Thread 2Thread 3
kfree_rcu(a)
schedule()
schedule()  
gets pointer to b
kfree_rcu(b)
processes rcu callbacks
uses b

Thread 3 will free a and also free b, so thread 2 is going to use freed
memory and go splat.  You can't batch up memory to be freed without
taking into account the grace periods.

It might make sense for RCU to batch up all the memory it's going to free
in a single grace period, and hand it all off to slub at once, but that's
not what you've done here.


I've been doing a lot of thinking about this because I really want a
way to kfree_rcu() an object without embedding a struct rcu_head in it.
But I see no way to do that today; even if we have an external memory
allocation to point to the object to be freed, we have to keep track of
the grace periods.


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:
> @@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  
>   for (i = 0; i < nr; i++) {
>   void *x = p[i] = kmem_cache_alloc(s, flags);
> +
>   if (!x) {
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;

Don't mix whitespace changes with significant patches.

> +/* Main RCU function that is called to free RCU structures */
> +static void
> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> lazy)
> +{
> + unsigned long offset;
> + void *ptr;
> + struct rcu_bulk_free *rbf;
> + struct rcu_bulk_free_container *rbfc = NULL;
> +
> + rbf = this_cpu_ptr(_rbf);
> +
> + if (unlikely(!rbf->rbf_init)) {
> + spin_lock_init(>rbf_lock);
> + rbf->rbf_cpu = smp_processor_id();
> + rbf->rbf_init = true;
> + }
> +
> + /* hold lock to protect against other cpu's */
> + spin_lock_bh(>rbf_lock);

Are you sure we can't call kfree_rcu() from interrupt context?

> + rbfc = rbf->rbf_container;
> + rbfc->rbfc_entries = 0;
> +
> + if (rbf->rbf_list_head != NULL)
> + __rcu_bulk_schedule_list(rbf);

You've broken RCU.  Consider this scenario:

Thread 1Thread 2Thread 3
kfree_rcu(a)
schedule()
schedule()  
gets pointer to b
kfree_rcu(b)
processes rcu callbacks
uses b

Thread 3 will free a and also free b, so thread 2 is going to use freed
memory and go splat.  You can't batch up memory to be freed without
taking into account the grace periods.

It might make sense for RCU to batch up all the memory it's going to free
in a single grace period, and hand it all off to slub at once, but that's
not what you've done here.


I've been doing a lot of thinking about this because I really want a
way to kfree_rcu() an object without embedding a struct rcu_head in it.
But I see no way to do that today; even if we have an external memory
allocation to point to the object to be freed, we have to keep track of
the grace periods.


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:
> This patch updates kfree_rcu to use new bulk memory free functions as they
> are more efficient. It also moves kfree_call_rcu() out of rcu related code to
> mm/slab_common.c
> 
> Signed-off-by: Rao Shoaib 
> ---
>  include/linux/mm.h |   5 ++
>  kernel/rcu/tree.c  |  14 
>  kernel/sysctl.c|  40 +++
>  mm/slab.h  |  23 +++
>  mm/slab_common.c   | 198 
> -
>  5 files changed, 264 insertions(+), 16 deletions(-)

You've added an awful lot of code.  Do you have any performance measurements
that shows this to be a win?


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 09:52:27AM -0800, rao.sho...@oracle.com wrote:
> This patch updates kfree_rcu to use new bulk memory free functions as they
> are more efficient. It also moves kfree_call_rcu() out of rcu related code to
> mm/slab_common.c
> 
> Signed-off-by: Rao Shoaib 
> ---
>  include/linux/mm.h |   5 ++
>  kernel/rcu/tree.c  |  14 
>  kernel/sysctl.c|  40 +++
>  mm/slab.h  |  23 +++
>  mm/slab_common.c   | 198 
> -
>  5 files changed, 264 insertions(+), 16 deletions(-)

You've added an awful lot of code.  Do you have any performance measurements
that shows this to be a win?