Re: [PATCHSET v2] slab: make memcg slab destruction scalable

2017-01-18 Thread Tejun Heo
Hello,

On Wed, Jan 18, 2017 at 04:54:48PM +0900, Joonsoo Kim wrote:
> That problem is caused by slow release path and then contention on the
> slab_mutex. With an ordered workqueue, kworker would not be created a
> lot but it can be possible that a lot of work items to create a new
> cache for memcg is pending for a long time due to slow release path.

How many work items are pending and how many workers are on them
shouldn't affect the actual completion time that much when most of
them are serialized by a mutex.  Anyways, this patchset moves all the
slow parts out of slab_mutex, so none of this is a problem anymore.

> Your patchset replaces optimization for release path so it's better to
> check that the work isn't pending for a long time in above workload.

Yeap, it seems to work fine.

Thanks.

-- 
tejun


Re: [PATCHSET v2] slab: make memcg slab destruction scalable

2017-01-17 Thread Joonsoo Kim
On Tue, Jan 17, 2017 at 08:49:13AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 17, 2017 at 09:12:57AM +0900, Joonsoo Kim wrote:
> > Could you confirm that your series solves the problem that is reported
> > by Doug? It would be great if the result is mentioned to the patch
> > description.
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=172991
> 
> So, that's an issue in the creation path which is already resolved by
> switching to an ordered workqueue (it'd probably be better to use
> per-cpu wq w/ @max_active == 1 tho).  This patchset is about relesae
> path.  slab_mutex contention would definitely go down with this but
> I don't think there's more connection to it than that.

That problem is caused by slow release path and then contention on the
slab_mutex. With an ordered workqueue, kworker would not be created a
lot but it can be possible that a lot of work items to create a new
cache for memcg is pending for a long time due to slow release path.

Your patchset replaces optimization for release path so it's better to
check that the work isn't pending for a long time in above workload.

Thanks.


Re: [PATCHSET v2] slab: make memcg slab destruction scalable

2017-01-17 Thread Tejun Heo
Hello,

On Tue, Jan 17, 2017 at 09:12:57AM +0900, Joonsoo Kim wrote:
> Could you confirm that your series solves the problem that is reported
> by Doug? It would be great if the result is mentioned to the patch
> description.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=172991

So, that's an issue in the creation path which is already resolved by
switching to an ordered workqueue (it'd probably be better to use
per-cpu wq w/ @max_active == 1 tho).  This patchset is about relesae
path.  slab_mutex contention would definitely go down with this but
I don't think there's more connection to it than that.

Thanks.

-- 
tejun


Re: [PATCHSET v2] slab: make memcg slab destruction scalable

2017-01-16 Thread Joonsoo Kim
On Sat, Jan 14, 2017 at 01:48:26PM -0500, Tejun Heo wrote:
> This is v2.  Changes from the last version[L] are
> 
> * 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch was
>   incorrect and dropped.
> 
> * 0006-slab-don-t-put-memcg-caches-on-slab_caches-list.patch
>   incorrectly converted places which needed to walk all caches.
>   Replaced with 0005-slab-implement-slab_root_caches-list.patch which
>   adds root-only list instead of converting slab_caches list to list
>   only root caches.
> 
> * Misc fixes.
> 
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.
> 
> I've seen machines which end up with hundred thousands of caches and
> many millions of kernfs_nodes.  The current code is O(N^2) on the
> total number of caches and has synchronous rcu_barrier() and
> synchronize_sched() in cgroup offline / release path which is executed
> while holding cgroup_mutex.  Combined, this leads to very expensive
> and slow cache destruction operations which can easily keep running
> for half a day.
> 
> This also messes up /proc/slabinfo along with other cache iterating
> operations.  seq_file operates on 4k chunks and on each 4k boundary
> tries to seek to the last position in the list.  With a huge number of
> caches on the list, this becomes very slow and very prone to the list
> content changing underneath it leading to a lot of missing and/or
> duplicate entries.
> 
> This patchset addresses the scalability problem.
> 
> * Add root and per-memcg lists.  Update each user to use the
>   appropriate list.
> 
> * Replace rcu_barrier() and synchronize_rcu() with call_rcu() and
>   call_rcu_sched().
> 
> * For dying empty slub caches, remove the sysfs files after
>   deactivation so that we don't end up with millions of sysfs files
>   without any useful information on them.

Could you confirm that your series solves the problem that is reported
by Doug? It would be great if the result is mentioned to the patch
description.

https://bugzilla.kernel.org/show_bug.cgi?id=172991

Thanks.