Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On 8/26/2016 9:47 AM, Dmitry Shmidt wrote: On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo wrote: Hello, John. On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote: Hey! Good news. This patch along with Peter's locking changes pushes the latencies down to an apparently acceptable level! Ah, that's good to hear. Please feel free to ping me if you guys wanna talk about cgroup usage in android. Thanks a lot for all your help resolving this issue! Thank you for the help. Thanks! -- tejun -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo wrote: > Hello, John. > > On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote: >> Hey! Good news. This patch along with Peter's locking changes pushes >> the latencies down to an apparently acceptable level! > > Ah, that's good to hear. Please feel free to ping me if you guys > wanna talk about cgroup usage in android. Thanks a lot for all your help resolving this issue! > Thanks! > > -- > tejun
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
Hello, John. On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote: > Hey! Good news. This patch along with Peter's locking changes pushes > the latencies down to an apparently acceptable level! Ah, that's good to hear. Please feel free to ping me if you guys wanna talk about cgroup usage in android. Thanks! -- tejun
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo wrote: > Hello, John. > > On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote: >> Hey Peter, Tejun, Oleg, >> So while you're tweaks for the percpu-rwsem have greatly helped the >> regression folks were seeing (many thanks, by the way), as noted >> above, the performance regression with the global lock compared to >> earlier kernels is still ~3x slower (though again, much better then >> the 80x slower that was seen earlier). >> >> So I was wondering if patches to go back to the per signal_struct >> locking would still be considered? Or is the global lock approach the >> only way forward? > > We can't simply revert but we can make the lock per signal_struct > again. It's just that it'd be quite a bit more complex (but, again, > if we need it...) and for cases where migrations aren't as frequent > percpu-rwsem would be at least a bit lower overhead. Can you please > test with the following patch applied just in case? > > > https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440 Hey! Good news. This patch along with Peter's locking changes pushes the latencies down to an apparently acceptable level! Many thanks for the pointer! thanks -john
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo wrote: > On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote: >> >> So I was wondering if patches to go back to the per signal_struct >> locking would still be considered? Or is the global lock approach the >> only way forward? > > We can't simply revert but we can make the lock per signal_struct > again. It's just that it'd be quite a bit more complex (but, again, > if we need it...) and for cases where migrations aren't as frequent > percpu-rwsem would be at least a bit lower overhead. Can you please > test with the following patch applied just in case? > > > https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440 That does provide a reasonable improvement in my testing! But, I'll pass it along to folks who are doing more in-depth performance analysis to make sure. If that doesn't work, I'll talk w/ Dmitry about submitting his patch reworking things back to the per signal_struct locking. >> At a higher level, I'm worried that Android's use of cgroups as a >> priority enforcement mechanism is at odds with developers focusing on >> it as a container enforcement mechanism, as in the latter its not >> common for tasks to change between cgroups, but with the former >> priority adjustments are quite common. > > It has been at odds as long as android existed. cgroup used to have > synchronous synchronize_rcu() in the migration path which android > kernel simply deleted (didn't break android's use case), re-labeling > every page's memcg ownership on foreground and background switches > (does it still do that? it's simply unworkable) and so on. Yea. Android folks can correct me, but I think the memcg forground/background bits have been dropped. They still use a apps memcg group for low-memory-notification in their userland-low-memory-killer-daemon (however my understanding is that has yet to sufficiently replace the in-kernel low-memory-killer). > I find it difficult to believe that android's requirements can be > satisfied only through moving processes around. cgroup usage model is > affected by container use cases but isn't limited to it. If android > is willing to change, I'd be more than happy to work with you and > solve obstacles. If not, we'll surely try not to break it anymore > than it has always been broken. I think folks are open to ideas, but going from idea-from-the-list to something reliable enough to ship is always a bit fraught, so it often takes some time and effort to really validate if those ideas are workable. I have on my list to re-submit the can_attach logic Android uses to provide permissions checks on processes migrating other tasks between cgroups, so at least we can try to reduce the separate domains of focus and get folks sharing the same code bases. thanks -john
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
Hello, John. On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote: > Hey Peter, Tejun, Oleg, > So while you're tweaks for the percpu-rwsem have greatly helped the > regression folks were seeing (many thanks, by the way), as noted > above, the performance regression with the global lock compared to > earlier kernels is still ~3x slower (though again, much better then > the 80x slower that was seen earlier). > > So I was wondering if patches to go back to the per signal_struct > locking would still be considered? Or is the global lock approach the > only way forward? We can't simply revert but we can make the lock per signal_struct again. It's just that it'd be quite a bit more complex (but, again, if we need it...) and for cases where migrations aren't as frequent percpu-rwsem would be at least a bit lower overhead. Can you please test with the following patch applied just in case? https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes&id=568ac888215c7fb2fabe8ea739b00ec3c1f5d440 > At a higher level, I'm worried that Android's use of cgroups as a > priority enforcement mechanism is at odds with developers focusing on > it as a container enforcement mechanism, as in the latter its not > common for tasks to change between cgroups, but with the former > priority adjustments are quite common. It has been at odds as long as android existed. cgroup used to have synchronous synchronize_rcu() in the migration path which android kernel simply deleted (didn't break android's use case), re-labeling every page's memcg ownership on foreground and background switches (does it still do that? it's simply unworkable) and so on. I find it difficult to believe that android's requirements can be satisfied only through moving processes around. cgroup usage model is affected by container use cases but isn't limited to it. If android is willing to change, I'd be more than happy to work with you and solve obstacles. If not, we'll surely try not to break it anymore than it has always been broken. Thanks. -- tejun
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On Fri, Aug 12, 2016 at 6:44 PM, Om Dhyade wrote: > Update from my tests: > Use-case: Android application launches. > > I tested the patches on android N build, i see max latency ~7ms. > In my tests, the wait is due to: copy_process(fork.c) blocks all threads in > __cgroup_procs_write including threads which are not part of the forking > process's thread-group. > > Dimtry had provided a hack patch which reverts to per-process rw-sem which > had max latency of ~2ms. > > android user-space binder library does 2 cgroup write operations per > transaction, apart from the copy_process(fork.c) wait, i see pre-emption in > _cgroup_procs_write causing waits. Hey Peter, Tejun, Oleg, So while you're tweaks for the percpu-rwsem have greatly helped the regression folks were seeing (many thanks, by the way), as noted above, the performance regression with the global lock compared to earlier kernels is still ~3x slower (though again, much better then the 80x slower that was seen earlier). So I was wondering if patches to go back to the per signal_struct locking would still be considered? Or is the global lock approach the only way forward? At a higher level, I'm worried that Android's use of cgroups as a priority enforcement mechanism is at odds with developers focusing on it as a container enforcement mechanism, as in the latter its not common for tasks to change between cgroups, but with the former priority adjustments are quite common. thanks -john
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
Thank you Dimtry for sharing the patches. Update from my tests: Use-case: Android application launches. I tested the patches on android N build, i see max latency ~7ms. In my tests, the wait is due to: copy_process(fork.c) blocks all threads in __cgroup_procs_write including threads which are not part of the forking process's thread-group. Dimtry had provided a hack patch which reverts to per-process rw-sem which had max latency of ~2ms. android user-space binder library does 2 cgroup write operations per transaction, apart from the copy_process(fork.c) wait, i see pre-emption in _cgroup_procs_write causing waits. Thanks. -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote: > On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra wrote: > > > > Currently the percpu-rwsem switches to (global) atomic ops while a > > writer is waiting; which could be quite a while and slows down > > releasing the readers. > > > > This patch cures this problem by ordering the reader-state vs > > reader-count (see the comments in __percpu_down_read() and > > percpu_down_write()). This changes a global atomic op into a full > > memory barrier, which doesn't have the global cacheline contention. > > > > This also enables using the percpu-rwsem with rcu_sync disabled in order > > to bias the implementation differently, reducing the writer latency by > > adding some cost to readers. > > So this by itself doesn't help us much, but including the following > from Oleg does help quite a bit: Correct, Oleg was going to send his rcu_sync rework on top of this. But since its holiday season things might be tad delayed.
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On 08/10, Peter Zijlstra wrote: > > On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote: > > On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra wrote: > > > > > > Currently the percpu-rwsem switches to (global) atomic ops while a > > > writer is waiting; which could be quite a while and slows down > > > releasing the readers. > > > > > > This patch cures this problem by ordering the reader-state vs > > > reader-count (see the comments in __percpu_down_read() and > > > percpu_down_write()). This changes a global atomic op into a full > > > memory barrier, which doesn't have the global cacheline contention. > > > > > > This also enables using the percpu-rwsem with rcu_sync disabled in order > > > to bias the implementation differently, reducing the writer latency by > > > adding some cost to readers. > > > > So this by itself doesn't help us much, but including the following > > from Oleg does help quite a bit: > > Correct, Oleg was going to send his rcu_sync rework on top of this. But > since its holiday season things might be tad delayed. Yeees. The patch is ready and even seems to work, but I still can't (re)write the comments. Will try to finish tomorrow. But. We need something simple and backportable, so I am going to send the patch below first. As you can see this is your "sabotage" change, just the new helper was renamed + s/!GP_IDLE/GP_PASSED/ fix. And the only reason I can't send it today is that somehow I can't write the changelog ;) So I would be really happy if you send this change instead of me, I am going to keep "From: peterz" anyway. Oleg. diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h index a63a33e..ece7ed9 100644 --- a/include/linux/rcu_sync.h +++ b/include/linux/rcu_sync.h @@ -59,6 +59,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp) } extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type); +extern void rcu_sync_enter_start(struct rcu_sync *); extern void rcu_sync_enter(struct rcu_sync *); extern void rcu_sync_exit(struct rcu_sync *); extern void rcu_sync_dtor(struct rcu_sync *); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 75c0ff0..614f9cd 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5609,6 +5609,8 @@ int __init cgroup_init(void) BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files)); + rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss); + get_user_ns(init_cgroup_ns.user_ns); mutex_lock(&cgroup_mutex); diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c index be922c9..c9b7bc8 100644 --- a/kernel/rcu/sync.c +++ b/kernel/rcu/sync.c @@ -83,6 +83,18 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type type) } /** + * Must be called after rcu_sync_init() and before first use. + * + * Ensures rcu_sync_is_idle() returns false and rcu_sync_{enter,exit}() pairs + * turn into NO-OPs. + */ +void rcu_sync_enter_start(struct rcu_sync *rsp) +{ + rsp->gp_count++; + rsp->gp_state = GP_PASSED; +} + +/** * rcu_sync_enter() - Force readers onto slowpath * @rsp: Pointer to rcu_sync structure to use for synchronization *
Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra wrote: > > Currently the percpu-rwsem switches to (global) atomic ops while a > writer is waiting; which could be quite a while and slows down > releasing the readers. > > This patch cures this problem by ordering the reader-state vs > reader-count (see the comments in __percpu_down_read() and > percpu_down_write()). This changes a global atomic op into a full > memory barrier, which doesn't have the global cacheline contention. > > This also enables using the percpu-rwsem with rcu_sync disabled in order > to bias the implementation differently, reducing the writer latency by > adding some cost to readers. So this by itself doesn't help us much, but including the following from Oleg does help quite a bit: diff --git a/kernel/cgroup.c b/kernel/cgroup.c index db27804..9e9200b 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5394,6 +5394,8 @@ int __init cgroup_init(void) BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files)); + rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); + mutex_lock(&cgroup_mutex); /* Add init_css_set to the hash table */ thanks -john
[PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact
Currently the percpu-rwsem switches to (global) atomic ops while a writer is waiting; which could be quite a while and slows down releasing the readers. This patch cures this problem by ordering the reader-state vs reader-count (see the comments in __percpu_down_read() and percpu_down_write()). This changes a global atomic op into a full memory barrier, which doesn't have the global cacheline contention. This also enables using the percpu-rwsem with rcu_sync disabled in order to bias the implementation differently, reducing the writer latency by adding some cost to readers. Cc: Paul McKenney Reviewed-by: Oleg Nesterov Signed-off-by: Peter Zijlstra (Intel) --- include/linux/percpu-rwsem.h | 84 +-- kernel/locking/percpu-rwsem.c | 228 -- 2 files changed, 206 insertions(+), 106 deletions(-) --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -10,30 +10,96 @@ struct percpu_rw_semaphore { struct rcu_sync rss; - unsigned int __percpu *fast_read_ctr; + unsigned int __percpu *read_count; struct rw_semaphore rw_sem; - atomic_tslow_read_ctr; - wait_queue_head_t write_waitq; + wait_queue_head_t writer; + int readers_block; }; -extern void percpu_down_read(struct percpu_rw_semaphore *); -extern int percpu_down_read_trylock(struct percpu_rw_semaphore *); -extern void percpu_up_read(struct percpu_rw_semaphore *); +extern int __percpu_down_read(struct percpu_rw_semaphore *, int); +extern void __percpu_up_read(struct percpu_rw_semaphore *); + +static inline void percpu_down_read(struct percpu_rw_semaphore *sem) +{ + might_sleep(); + + rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_); + + preempt_disable(); + /* +* We are in an RCU-sched read-side critical section, so the writer +* cannot both change sem->state from readers_fast and start checking +* counters while we are here. So if we see !sem->state, we know that +* the writer won't be checking until we're past the preempt_enable() +* and that one the synchronize_sched() is done, the writer will see +* anything we did within this RCU-sched read-size critical section. +*/ + __this_cpu_inc(*sem->read_count); + if (unlikely(!rcu_sync_is_idle(&sem->rss))) + __percpu_down_read(sem, false); /* Unconditional memory barrier */ + preempt_enable(); + /* +* The barrier() from preempt_enable() prevents the compiler from +* bleeding the critical section out. +*/ +} + +static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem) +{ + int ret = 1; + + preempt_disable(); + /* +* Same as in percpu_down_read(). +*/ + __this_cpu_inc(*sem->read_count); + if (unlikely(!rcu_sync_is_idle(&sem->rss))) + ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */ + preempt_enable(); + /* +* The barrier() from preempt_enable() prevents the compiler from +* bleeding the critical section out. +*/ + + if (ret) + rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 1, _RET_IP_); + + return ret; +} + +static inline void percpu_up_read(struct percpu_rw_semaphore *sem) +{ + /* +* The barrier() in preempt_disable() prevents the compiler from +* bleeding the critical section out. +*/ + preempt_disable(); + /* +* Same as in percpu_down_read(). +*/ + if (likely(rcu_sync_is_idle(&sem->rss))) + __this_cpu_dec(*sem->read_count); + else + __percpu_up_read(sem); /* Unconditional memory barrier */ + preempt_enable(); + + rwsem_release(&sem->rw_sem.dep_map, 1, _RET_IP_); +} extern void percpu_down_write(struct percpu_rw_semaphore *); extern void percpu_up_write(struct percpu_rw_semaphore *); extern int __percpu_init_rwsem(struct percpu_rw_semaphore *, const char *, struct lock_class_key *); + extern void percpu_free_rwsem(struct percpu_rw_semaphore *); -#define percpu_init_rwsem(brw) \ +#define percpu_init_rwsem(sem) \ ({ \ static struct lock_class_key rwsem_key; \ - __percpu_init_rwsem(brw, #brw, &rwsem_key); \ + __percpu_init_rwsem(sem, #sem, &rwsem_key); \ }) - #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem) static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -8,152 +8,186 @@ #include #include -int __percpu_init_rwsem(struct percpu_rw_semaphore *brw, +int __percpu_init_rwsem(struct percpu_rw_se