subject:"\[PATCH v2\] locking\/percpu\-rwsem\: Optimize readers and reduce global impact"

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Om Dhyade




On 8/26/2016 9:47 AM, Dmitry Shmidt wrote:

On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo  wrote:

Hello, John.

On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:

Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!


Ah, that's good to hear.  Please feel free to ping me if you guys
wanna talk about cgroup usage in android.


Thanks a lot for all your help resolving this issue!


Thank you for the help.

Thanks!

--
tejun


--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Dmitry Shmidt

On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo  wrote:
> Hello, John.
>
> On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:
>> Hey! Good news. This patch along with Peter's locking changes pushes
>> the latencies down to an apparently acceptable level!
>
> Ah, that's good to hear.  Please feel free to ping me if you guys
> wanna talk about cgroup usage in android.

Thanks a lot for all your help resolving this issue!

> Thanks!
>
> --
> tejun

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Tejun Heo

Hello, John.

On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:
> Hey! Good news. This patch along with Peter's locking changes pushes
> the latencies down to an apparently acceptable level!

Ah, that's good to hear.  Please feel free to ping me if you guys
wanna talk about cgroup usage in android.

Thanks!

-- 
tejun

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-25 Thread John Stultz

On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo  wrote:
> Hello, John.
>
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
>> Hey Peter, Tejun, Oleg,
>>   So while you're tweaks for the percpu-rwsem have greatly helped the
>> regression folks were seeing (many thanks, by the way), as noted
>> above, the performance regression with the global lock compared to
>> earlier kernels is still ~3x slower (though again, much better then
>> the 80x slower that was seen earlier).
>>
>> So I was wondering if patches to go back to the per signal_struct
>> locking would still be considered? Or is the global lock approach the
>> only way forward?
>
> We can't simply revert but we can make the lock per signal_struct
> again.  It's just that it'd be quite a bit more complex (but, again,
> if we need it...) and for cases where migrations aren't as frequent
> percpu-rwsem would be at least a bit lower overhead.  Can you please
> test with the following patch applied just in case?
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440


Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!

Many thanks for the pointer!

thanks
-john

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread John Stultz

On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo  wrote:
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
>>
>> So I was wondering if patches to go back to the per signal_struct
>> locking would still be considered? Or is the global lock approach the
>> only way forward?
>
> We can't simply revert but we can make the lock per signal_struct
> again.  It's just that it'd be quite a bit more complex (but, again,
> if we need it...) and for cases where migrations aren't as frequent
> percpu-rwsem would be at least a bit lower overhead.  Can you please
> test with the following patch applied just in case?
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440

That does provide a reasonable improvement in my testing!  But, I'll
pass it along to folks who are doing more in-depth performance
analysis to make sure.

If that doesn't work, I'll talk w/ Dmitry about submitting his patch
reworking things back to the per signal_struct locking.

>> At a higher level, I'm worried that Android's use of cgroups as a
>> priority enforcement mechanism is at odds with developers focusing on
>> it as a container enforcement mechanism, as in the latter its not
>> common for tasks to change between cgroups, but with the former
>> priority adjustments are quite common.
>
> It has been at odds as long as android existed.  cgroup used to have
> synchronous synchronize_rcu() in the migration path which android
> kernel simply deleted (didn't break android's use case), re-labeling
> every page's memcg ownership on foreground and background switches
> (does it still do that? it's simply unworkable) and so on.

Yea. Android folks can correct me, but I think the memcg
forground/background bits have been dropped. They still use a apps
memcg group for low-memory-notification in their
userland-low-memory-killer-daemon (however my understanding is that
has yet to sufficiently replace the in-kernel low-memory-killer).

> I find it difficult to believe that android's requirements can be
> satisfied only through moving processes around.  cgroup usage model is
> affected by container use cases but isn't limited to it.  If android
> is willing to change, I'd be more than happy to work with you and
> solve obstacles.  If not, we'll surely try not to break it anymore
> than it has always been broken.

I think folks are open to ideas, but going from idea-from-the-list to
something reliable enough to ship is always a bit fraught, so it often
takes some time and effort to really validate if those ideas are
workable.

I have on my list to re-submit the can_attach logic Android uses to
provide permissions checks on processes migrating other tasks between
cgroups, so at least we can try to reduce the separate domains of
focus and get folks sharing the same code bases.

thanks
-john

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread Tejun Heo

Hello, John.

On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
> Hey Peter, Tejun, Oleg,
>   So while you're tweaks for the percpu-rwsem have greatly helped the
> regression folks were seeing (many thanks, by the way), as noted
> above, the performance regression with the global lock compared to
> earlier kernels is still ~3x slower (though again, much better then
> the 80x slower that was seen earlier).
> 
> So I was wondering if patches to go back to the per signal_struct
> locking would still be considered? Or is the global lock approach the
> only way forward?

We can't simply revert but we can make the lock per signal_struct
again.  It's just that it'd be quite a bit more complex (but, again,
if we need it...) and for cases where migrations aren't as frequent
percpu-rwsem would be at least a bit lower overhead.  Can you please
test with the following patch applied just in case?

https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440

> At a higher level, I'm worried that Android's use of cgroups as a
> priority enforcement mechanism is at odds with developers focusing on
> it as a container enforcement mechanism, as in the latter its not
> common for tasks to change between cgroups, but with the former
> priority adjustments are quite common.

It has been at odds as long as android existed.  cgroup used to have
synchronous synchronize_rcu() in the migration path which android
kernel simply deleted (didn't break android's use case), re-labeling
every page's memcg ownership on foreground and background switches
(does it still do that? it's simply unworkable) and so on.

I find it difficult to believe that android's requirements can be
satisfied only through moving processes around.  cgroup usage model is
affected by container use cases but isn't limited to it.  If android
is willing to change, I'd be more than happy to work with you and
solve obstacles.  If not, we'll surely try not to break it anymore
than it has always been broken.

Thanks.

-- 
tejun

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread John Stultz

On Fri, Aug 12, 2016 at 6:44 PM, Om Dhyade  wrote:
> Update from my tests:
> Use-case: Android application launches.
>
> I tested the patches on android N build, i see max latency ~7ms.
> In my tests, the wait is due to: copy_process(fork.c) blocks all threads in
> __cgroup_procs_write including threads which are not part of the forking
> process's thread-group.
>
> Dimtry had provided a hack patch which reverts to per-process rw-sem which
> had max latency of ~2ms.
>
> android user-space binder library does 2 cgroup write operations per
> transaction, apart from the copy_process(fork.c) wait, i see pre-emption in
> _cgroup_procs_write causing waits.

Hey Peter, Tejun, Oleg,
  So while you're tweaks for the percpu-rwsem have greatly helped the
regression folks were seeing (many thanks, by the way), as noted
above, the performance regression with the global lock compared to
earlier kernels is still ~3x slower (though again, much better then
the 80x slower that was seen earlier).

So I was wondering if patches to go back to the per signal_struct
locking would still be considered? Or is the global lock approach the
only way forward?

At a higher level, I'm worried that Android's use of cgroups as a
priority enforcement mechanism is at odds with developers focusing on
it as a container enforcement mechanism, as in the latter its not
common for tasks to change between cgroups, but with the former
priority adjustments are quite common.

thanks
-john

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-12 Thread Om Dhyade


Thank you Dimtry for sharing the patches.

Update from my tests:
Use-case: Android application launches.

I tested the patches on android N build, i see max latency ~7ms.
In my tests, the wait is due to: copy_process(fork.c) blocks all threads 
in __cgroup_procs_write including threads which are not part of the 
forking process's thread-group.


Dimtry had provided a hack patch which reverts to per-process rw-sem 
which had max latency of ~2ms.


android user-space binder library does 2 cgroup write operations per 
transaction, apart from the copy_process(fork.c) wait, i see pre-emption 
in _cgroup_procs_write causing waits.


Thanks.

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-10 Thread Peter Zijlstra

On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote:
> On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
> >
> > Currently the percpu-rwsem switches to (global) atomic ops while a
> > writer is waiting; which could be quite a while and slows down
> > releasing the readers.
> >
> > This patch cures this problem by ordering the reader-state vs
> > reader-count (see the comments in __percpu_down_read() and
> > percpu_down_write()). This changes a global atomic op into a full
> > memory barrier, which doesn't have the global cacheline contention.
> >
> > This also enables using the percpu-rwsem with rcu_sync disabled in order
> > to bias the implementation differently, reducing the writer latency by
> > adding some cost to readers.
> 
> So this by itself doesn't help us much, but including the following
> from Oleg does help quite a bit:

Correct, Oleg was going to send his rcu_sync rework on top of this. But
since its holiday season things might be tad delayed.

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-10 Thread Oleg Nesterov

On 08/10, Peter Zijlstra wrote:
>
> On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote:
> > On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
> > >
> > > Currently the percpu-rwsem switches to (global) atomic ops while a
> > > writer is waiting; which could be quite a while and slows down
> > > releasing the readers.
> > >
> > > This patch cures this problem by ordering the reader-state vs
> > > reader-count (see the comments in __percpu_down_read() and
> > > percpu_down_write()). This changes a global atomic op into a full
> > > memory barrier, which doesn't have the global cacheline contention.
> > >
> > > This also enables using the percpu-rwsem with rcu_sync disabled in order
> > > to bias the implementation differently, reducing the writer latency by
> > > adding some cost to readers.
> >
> > So this by itself doesn't help us much, but including the following
> > from Oleg does help quite a bit:
>
> Correct, Oleg was going to send his rcu_sync rework on top of this. But
> since its holiday season things might be tad delayed.

Yeees. The patch is ready and even seems to work, but I still can't (re)write
the comments. Will try to finish tomorrow.

But. We need something simple and backportable, so I am going to send the
patch below first. As you can see this is your "sabotage" change, just
the new helper was renamed + s/!GP_IDLE/GP_PASSED/ fix. And the only reason
I can't send it today is that somehow I can't write the changelog ;)

So I would be really happy if you send this change instead of me, I am going
to keep "From: peterz" anyway.

Oleg.


diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
index a63a33e..ece7ed9 100644
--- a/include/linux/rcu_sync.h
+++ b/include/linux/rcu_sync.h
@@ -59,6 +59,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
 }
 
 extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
+extern void rcu_sync_enter_start(struct rcu_sync *);
 extern void rcu_sync_enter(struct rcu_sync *);
 extern void rcu_sync_exit(struct rcu_sync *);
 extern void rcu_sync_dtor(struct rcu_sync *);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 75c0ff0..614f9cd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5609,6 +5609,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
+   rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
+
get_user_ns(init_cgroup_ns.user_ns);
 
mutex_lock(&cgroup_mutex);
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index be922c9..c9b7bc8 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -83,6 +83,18 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type 
type)
 }
 
 /**
+ * Must be called after rcu_sync_init() and before first use.
+ *
+ * Ensures rcu_sync_is_idle() returns false and rcu_sync_{enter,exit}() pairs
+ * turn into NO-OPs.
+ */
+void rcu_sync_enter_start(struct rcu_sync *rsp)
+{
+   rsp->gp_count++;
+   rsp->gp_state = GP_PASSED;
+}
+
+/**
  * rcu_sync_enter() - Force readers onto slowpath
  * @rsp: Pointer to rcu_sync structure to use for synchronization
  *

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-09 Thread John Stultz

On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
>
> Currently the percpu-rwsem switches to (global) atomic ops while a
> writer is waiting; which could be quite a while and slows down
> releasing the readers.
>
> This patch cures this problem by ordering the reader-state vs
> reader-count (see the comments in __percpu_down_read() and
> percpu_down_write()). This changes a global atomic op into a full
> memory barrier, which doesn't have the global cacheline contention.
>
> This also enables using the percpu-rwsem with rcu_sync disabled in order
> to bias the implementation differently, reducing the writer latency by
> adding some cost to readers.

So this by itself doesn't help us much, but including the following
from Oleg does help quite a bit:

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index db27804..9e9200b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5394,6 +5394,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

+   rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+
mutex_lock(&cgroup_mutex);

/* Add init_css_set to the hash table */


thanks
-john

[PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-09 Thread Peter Zijlstra


Currently the percpu-rwsem switches to (global) atomic ops while a
writer is waiting; which could be quite a while and slows down
releasing the readers.

This patch cures this problem by ordering the reader-state vs
reader-count (see the comments in __percpu_down_read() and
percpu_down_write()). This changes a global atomic op into a full
memory barrier, which doesn't have the global cacheline contention.

This also enables using the percpu-rwsem with rcu_sync disabled in order
to bias the implementation differently, reducing the writer latency by
adding some cost to readers.

Cc: Paul McKenney 
Reviewed-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/percpu-rwsem.h  |   84 +--
 kernel/locking/percpu-rwsem.c |  228 --
 2 files changed, 206 insertions(+), 106 deletions(-)

--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -10,30 +10,96 @@
 
 struct percpu_rw_semaphore {
struct rcu_sync rss;
-   unsigned int __percpu   *fast_read_ctr;
+   unsigned int __percpu   *read_count;
struct rw_semaphore rw_sem;
-   atomic_tslow_read_ctr;
-   wait_queue_head_t   write_waitq;
+   wait_queue_head_t   writer;
+   int readers_block;
 };
 
-extern void percpu_down_read(struct percpu_rw_semaphore *);
-extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
-extern void percpu_up_read(struct percpu_rw_semaphore *);
+extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
+extern void __percpu_up_read(struct percpu_rw_semaphore *);
+
+static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+   might_sleep();
+
+   rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_);
+
+   preempt_disable();
+   /*
+* We are in an RCU-sched read-side critical section, so the writer
+* cannot both change sem->state from readers_fast and start checking
+* counters while we are here. So if we see !sem->state, we know that
+* the writer won't be checking until we're past the preempt_enable()
+* and that one the synchronize_sched() is done, the writer will see
+* anything we did within this RCU-sched read-size critical section.
+*/
+   __this_cpu_inc(*sem->read_count);
+   if (unlikely(!rcu_sync_is_idle(&sem->rss)))
+   __percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+}
+
+static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
+{
+   int ret = 1;
+
+   preempt_disable();
+   /*
+* Same as in percpu_down_read().
+*/
+   __this_cpu_inc(*sem->read_count);
+   if (unlikely(!rcu_sync_is_idle(&sem->rss)))
+   ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+
+   if (ret)
+   rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 1, _RET_IP_);
+
+   return ret;
+}
+
+static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+   /*
+* The barrier() in preempt_disable() prevents the compiler from
+* bleeding the critical section out.
+*/
+   preempt_disable();
+   /*
+* Same as in percpu_down_read().
+*/
+   if (likely(rcu_sync_is_idle(&sem->rss)))
+   __this_cpu_dec(*sem->read_count);
+   else
+   __percpu_up_read(sem); /* Unconditional memory barrier */
+   preempt_enable();
+
+   rwsem_release(&sem->rw_sem.dep_map, 1, _RET_IP_);
+}
 
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
 extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
const char *, struct lock_class_key *);
+
 extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
-#define percpu_init_rwsem(brw) \
+#define percpu_init_rwsem(sem) \
 ({ \
static struct lock_class_key rwsem_key; \
-   __percpu_init_rwsem(brw, #brw, &rwsem_key); \
+   __percpu_init_rwsem(sem, #sem, &rwsem_key); \
 })
 
-
 #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
 
 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -8,152 +8,186 @@
 #include 
 #include 
 
-int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
+int __percpu_init_rwsem(struct percpu_rw_se

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

[PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

12 matches

Site Navigation

Mail list logo

Footer information