On 29.11.2019 16:00, Pavel Tikhomirov wrote: > We have a problem that shrinker_rwsem can be held for a long time for > read, while many processes are trying to manage shrinkers and thus take > the lock for write - all such processes hang like: > > crash> bt ffffa0117d226000 > PID: 45299 TASK: ffffa0117d226000 CPU: 16 COMMAND: "vzctl" > #0 [ffff9ffe8b793b28] __schedule at ffffffffaf799a12 > #1 [ffff9ffe8b793bb0] schedule at ffffffffaf799ed9 > #2 [ffff9ffe8b793bc0] rwsem_down_write_failed at ffffffffaf79b815 > #3 [ffff9ffe8b793c58] call_rwsem_down_write_failed at ffffffffaf3ae807 > #4 [ffff9ffe8b793ca0] down_write at ffffffffaf7991ed > #5 [ffff9ffe8b793cb8] register_shrinker at ffffffffaf1ddc8d > #6 [ffff9ffe8b793cd8] sget_userns at ffffffffaf264329 > #7 [ffff9ffe8b793d30] sget at ffffffffaf2643fd > #8 [ffff9ffe8b793d70] mount_bdev at ffffffffaf26451a > #9 [ffff9ffe8b793de0] ext4_mount at ffffffffc07d9f94 [ext4] > #10 [ffff9ffe8b793e10] mount_fs at ffffffffaf2652de > #11 [ffff9ffe8b793e58] vfs_kern_mount at ffffffffaf283a27 > #12 [ffff9ffe8b793e90] do_mount at ffffffffaf2863e9 > #13 [ffff9ffe8b793f18] sys_mount at ffffffffaf287263 > #14 [ffff9ffe8b793f50] system_call_fastpath at ffffffffaf7a6edb > RIP: 00007fb4d1d3978a RSP: 00007fff835a1f60 RFLAGS: 00010206 > RAX: 00000000000000a5 RBX: 00007fff835a43f0 RCX: 00007fb4d2ff2820 > RDX: 00007fb4d205d4b7 RSI: 0000558e62ddf520 RDI: 00007fff835a4290 > RBP: 00007fb4d205d4b7 R8: 00007fff835a2ba0 R9: 00007fb4d1c8820d > R10: 0000000000000000 R11: 0000000000000206 R12: 00007fff835a2ba0 > R13: 00007fff835a4290 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b > > This means that mounting anything new is currently impossible as > register_shrinker can't take the lock, similar thing for umounting. > > The holder of shrinker_rwsem is hanging in do_shrink_slab: > > crash> bt ffffa02ebab7c000 > PID: 977061 TASK: ffffa02ebab7c000 CPU: 30 COMMAND: "crond" > #0 [ffffa02a7b2b74e8] __schedule at ffffffffaf799a12 > #1 [ffffa02a7b2b7570] schedule at ffffffffaf799ed9 > #2 [ffffa02a7b2b7580] rpc_wait_bit_killable at ffffffffc02d12f1 [sunrpc] > #3 [ffffa02a7b2b7598] __wait_on_bit at ffffffffaf797b37 > #4 [ffffa02a7b2b75d8] out_of_line_wait_on_bit at ffffffffaf797ca1 > #5 [ffffa02a7b2b7650] __rpc_wait_for_completion_task at ffffffffc02d12dd > [sunrpc] > #6 [ffffa02a7b2b7660] _nfs4_proc_delegreturn at ffffffffc0bcabfa [nfsv4] > #7 [ffffa02a7b2b7710] nfs4_proc_delegreturn at ffffffffc0bd0b31 [nfsv4] > #8 [ffffa02a7b2b7788] nfs_do_return_delegation at ffffffffc0be4699 [nfsv4] > #9 [ffffa02a7b2b77a8] nfs_inode_return_delegation_noreclaim at > ffffffffc0be53c7 [nfsv4] > #10 [ffffa02a7b2b77c0] nfs4_evict_inode at ffffffffc0be3a19 [nfsv4] > #11 [ffffa02a7b2b77d8] evict at ffffffffaf27e7b4 > #12 [ffffa02a7b2b7800] dispose_list at ffffffffaf27e8be > #13 [ffffa02a7b2b7828] prune_icache_sb at ffffffffaf27f9e8 > #14 [ffffa02a7b2b7880] super_cache_scan at ffffffffaf264ae7 > #15 [ffffa02a7b2b78c0] do_shrink_slab at ffffffffaf1dd443 > #16 [ffffa02a7b2b7938] shrink_slab at ffffffffaf1ddef4 > #17 [ffffa02a7b2b79b8] shrink_zone at ffffffffaf1e1451 > #18 [ffffa02a7b2b7a60] do_try_to_free_pages at ffffffffaf1e2790 > #19 [ffffa02a7b2b7b08] try_to_free_mem_cgroup_pages at ffffffffaf1e2e3e > #20 [ffffa02a7b2b7ba8] try_charge at ffffffffc0a33e8c > [kpatch_cumulative_89_2_r1] > #21 [ffffa02a7b2b7c70] memcg_charge_kmem at ffffffffaf251611 > #22 [ffffa02a7b2b7c90] __memcg_kmem_newpage_charge at ffffffffaf2519c4 > #23 [ffffa02a7b2b7cb8] __alloc_pages_nodemask at ffffffffaf1d4696 > #24 [ffffa02a7b2b7d70] alloc_pages_current at ffffffffaf226fc8 > #25 [ffffa02a7b2b7db8] pte_alloc_one at ffffffffaf078d17 > #26 [ffffa02a7b2b7dd0] __pte_alloc at ffffffffaf1fe683 > #27 [ffffa02a7b2b7e08] handle_mm_fault at ffffffffaf204e59 > #28 [ffffa02a7b2b7eb0] __do_page_fault at ffffffffaf7a16e3 > #29 [ffffa02a7b2b7f20] do_page_fault at ffffffffaf7a1a05 > #30 [ffffa02a7b2b7f50] page_fault at ffffffffaf79d768 > RIP: 00007f1e54913fa0 RSP: 00007ffe50a21898 RFLAGS: 00010206 > RAX: 00007f1e50bce1f0 RBX: 000000037ffff1a0 RCX: 00007f1e549226a7 > RDX: 0000000000000002 RSI: 0000557241e48a20 RDI: 0000557241e489e0 > RBP: 00007ffe50a21a00 R8: 00007f1e50bce000 R9: 0000000070000021 > R10: 0000000000000031 R11: 0000000000000246 R12: 0000557241e489e0 > R13: 00007ffe50a21ae0 R14: 0000000000000003 R15: 000000006ffffeff > ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b > > It takes shrinker_rwsem in shrink_slab while traversing shrinker_list. > It tries to shrink something on nfs (hard) but nfs server is dead at > these moment already and rpc will never succeed. Generally any shrinker > can take significant time to do_shrink_slab, so it's a bad idea to hold > the list lock here. > > We have a similar problem in shrink_slab_memcg, except that we are > traversing shrinker_map+shrinker_idr there. > > The idea of the patch is to inc a refcount to the chosen shrinker so it > won't disappear and release shrinker_rwsem while we are in > do_shrink_slab, after processing we will reacquire shrinker_rwsem, dec > the refcount and continue the traversal. > > We also need a wait_queue so that unregister_shrinker can wait for the > refcnt to become zero. Only after these we can safely remove the > shrinker from list and idr, and free the shrinker. > > I've also managed to reproduce the nfs hang in do_shrink_slab with the > patch applied, all other mount/unmount and all other CT start/stop pass > fine without any hang. See more info on the reproduction in jira. > > https://jira.sw.ru/browse/PSBM-99181 > > Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> > --- > include/linux/memcontrol.h | 4 ++ > include/linux/shrinker.h | 6 +++ > mm/memcontrol.c | 15 +++++++ > mm/vmscan.c | 86 +++++++++++++++++++++++++++++++++++++- > 4 files changed, 110 insertions(+), 1 deletion(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c70279b2616b..c1e50faf00e5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -704,6 +704,8 @@ static __always_inline struct mem_cgroup > *mem_cgroup_from_kmem(void *ptr) > extern int memcg_expand_shrinker_maps(int new_id); > extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id); > +extern void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, > + int nid, int shrinker_id); > > extern struct memcg_shrinker_map *memcg_nid_shrinker_map(struct mem_cgroup > *memcg, int nid); > #else > @@ -772,6 +774,8 @@ static inline struct mem_cgroup > *mem_cgroup_from_kmem(void *ptr) > } > static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id) { } > +static inline void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, > + int nid, int shrinker_id) { } > #endif /* CONFIG_MEMCG_KMEM */ > #endif /* _LINUX_MEMCONTROL_H */ > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index f6938dc6c068..2eea35772794 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -1,6 +1,9 @@ > #ifndef _LINUX_SHRINKER_H > #define _LINUX_SHRINKER_H > > +#include <linux/refcount.h> > +#include <linux/wait.h> > + > /* > * This struct is used to pass information from page reclaim to the > shrinkers. > * We consolidate the values for easier extention later. > @@ -68,6 +71,9 @@ struct shrinker { > struct list_head list; > /* objs pending delete, per node */ > atomic_long_t *nr_deferred; > + > + refcount_t refcnt; > + wait_queue_head_t wq; > }; > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2f988271d111..077361eb399b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -846,6 +846,21 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id) > } > } > > +void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, int nid, int > shrinker_id) > +{ > + struct memcg_shrinker_map *map; > + > + /* > + * The map for refcounted memcg can only be freed in > + * memcg_free_shrinker_map_rcu so we can safely protect > + * map with rcu_read_lock. > + */ > + rcu_read_lock(); > + map = rcu_dereference(memcg->info.nodeinfo[nid]->shrinker_map); > + clear_bit(shrinker_id, map->map); > + rcu_read_unlock(); > +} > + > struct memcg_shrinker_map *memcg_nid_shrinker_map(struct mem_cgroup *memcg, > int nid) > { > return > rcu_dereference_protected(memcg->info.nodeinfo[nid]->shrinker_map, > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b2e465cd87f4..c91cd633a0cf 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -303,6 +303,13 @@ int register_shrinker(struct shrinker *shrinker) > if (!shrinker->nr_deferred) > return -ENOMEM; > > + /* > + * Shrinker is not yet visible through shrinker_idr or shrinker_list, > + * so no locks required for initialization. > + */ > + refcount_set(&shrinker->refcnt, 1); > + init_waitqueue_head(&shrinker->wq); > + > if (shrinker->flags & SHRINKER_MEMCG_AWARE) { > if (register_memcg_shrinker(shrinker)) > goto free_deferred; > @@ -328,6 +335,13 @@ void unregister_shrinker(struct shrinker *shrinker) > if (!shrinker->nr_deferred) > return; > > + /* > + * If refcnt is not zero we need to wait these shrinker to finish all > + * it's do_shrink_slab() calls. > + */ > + if (!refcount_dec_and_test(&shrinker->refcnt)) > + wait_event(shrinker->wq, (refcount_read(&shrinker->refcnt) == > 0)); > + > if (shrinker->flags & SHRINKER_MEMCG_AWARE) > unregister_memcg_shrinker(shrinker); > > @@ -435,6 +449,32 @@ static unsigned long do_shrink_slab(struct > shrink_control *shrinkctl, > return freed; > } > > +struct shrinker *get_shrinker(struct shrinker *shrinker) > +{ > + /* > + * Pairs with refcnt dec in unregister_shrinker(), if refcnt is zero > + * these shrinker is already in the middle of unregister_shrinker() and > + * we can't use it. > + */ > + if (!refcount_inc_not_zero(&shrinker->refcnt)) > + shrinker = NULL; > + return shrinker; > +} > + > +void put_shrinker(struct shrinker *shrinker) > +{ > + /* > + * If refcnt becomes zero, we already have an unregister_shrinker() > + * waiting for us to finish. > + */ > + if (!refcount_dec_and_test(&shrinker->refcnt)) { > + /* Pairs with smp_mb in wait_event()->prepare_to_wait() */ > + smp_mb(); > + if (waitqueue_active(&shrinker->wq)) > + wake_up(&shrinker->wq);
This races with unregister_shrinker(). Right after you have decremented the counter, unregister_shrinker() may go forward, and then its caller destructs shrinker. So, wake_up(&shrinker->wq) will operate on freed memory. > + } > +} > + > #ifdef CONFIG_MEMCG_KMEM > static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, int priority) > @@ -468,9 +508,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > continue; > } > > + /* > + * Take a refcnt on a shrinker so that it can't be freed or > + * removed from shrinker_idr (and shrinker_list). These way we > + * can safely release shrinker_rwsem. > + * > + * We need to release shrinker_rwsem here as do_shrink_slab can > + * take too much time to finish (e.g. on nfs). And holding > + * global shrinker_rwsem can block registring and unregistring > + * of shrinkers. > + */ > + if(!get_shrinker(shrinker)) > + continue; > + up_read(&shrinker_rwsem); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) { > - clear_bit(i, map->map); > + memcg_clear_shrinker_bit(memcg, nid, i); > /* > * After the shrinker reported that it had no objects to > * free, but before we cleared the corresponding bit in > @@ -495,6 +549,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > } > freed += ret; > > + /* > + * Re-aquire shrinker_rwsem, put refcount and reload > + * shrinker_map as it can be modified in > + * memcg_expand_one_shrinker_map if new shrinkers > + * were registred in the meanwhile. > + */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + freed = freed ? : 1; > + put_shrinker(shrinker); > + return freed; > + } > + put_shrinker(shrinker); > + map = memcg_nid_shrinker_map(memcg, nid); > + > if (rwsem_is_contended(&shrinker_rwsem)) { > freed = freed ? : 1; > break; > @@ -577,10 +645,26 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int > nid, > if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) > sc.nid = 0; > > + /* See comment in shrink_slab_memcg. */ > + if(!get_shrinker(shrinker)) > + continue; > + up_read(&shrinker_rwsem); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > + /* > + * We can safely continue traverse of the shrinker_list as > + * our shrinker is still on the list due to refcount. > + */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + freed = freed ? : 1; > + put_shrinker(shrinker); > + goto out; > + } > + put_shrinker(shrinker); > } > > up_read(&shrinker_rwsem); > _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel