On Tue, 18 Jul 2017, Dave Chinner wrote: > > Thanks for looking into this, Dave! > > > > The number of GFP_NOFS allocations that build up the deferred counts can > > be unbounded, however, so this can become excessive, and the oom killer > > will not kill any processes in this context. Although the motivation to > > do additional reclaim because of past GFP_NOFS reclaim attempts is > > worthwhile, I think it should be limited because currently it only > > increases until something is able to start draining these excess counts. > > Usually kswapd is kicked in by this point and starts doing work. Why > isn't kswapd doing the shrinker work in the background? >
It is, and often gets preempted itself while in lru scanning or shrink_slab(), most often super_cache_count() itself. The issue is that it gets preempted by networking packets being sent in irq context which ends up eating up GFP_ATOMIC memory. One of the key traits of this is that per-zone free memory is far below the min watermarks so not only is there insufficient memory for GFP_NOFS, but also insufficient memory for GFP_ATOMIC. Kswapd will only slab shrink a proportion of the lru scanned if it is not lucky enough to grab the excess nr_deferred. And meanwhile other threads end up increasing it. It's various workloads and I can't show a specific example of GFP_NOFS allocations in flight because we have made changes to prevent this, specifically ignoring nr_deferred counts for SHRINKER_MEMCG_AWARE shrinkers since they are largely erroneous. This can also occur if we cannot grab the trylock on the superblock itself. > Ugh. The per-node lru list count was designed to run unlocked and so > avoid this sort of (known) scalability problem. > > Ah, see the difference between list_lru_count_node() and > list_lru_count_one(). list_lru_count_one() should only take locks > for memcg lookups if it is trying to shrink a memcg. That needs to > be fixed before anything else and, if possible, the memcg lookup be > made lockless.... > We've done that as part of this fix, actually, by avoiding doing resizing of these list_lru's when the number of memcg cache ids increase. We just preallocate the max amount, MEMCG_CACHES_MAX_SIZE, to do lockless reads since the lock there is only needed to prevent concurrent remapping. > Yup, the memcg shrinking was shoe-horned into the per-node LRU > infrastructure, and the high level accounting is completely unaware > of the fact that memcgs have their own private LRUs. We left the > windup in place because slab caches are shared, and it's possible > that memory can't be freed because pages have objects from different > memcgs pinning them. Hence we need to bleed at least some of that > "we can't make progress" count back into the global "deferred > reclaim" pool to get other contexts to do some reclaim. > Right, now we've patched our kernel to avoid looking at the nr_deferred count for SHRINKER_MEMCG_AWARE but that's obviously a short-term solution, and I'm not sure that we can spare the tax to get per-memcg per-node deferred counts. It seems that some other metadata would be needed in this case to indicate excessive windup for slab shrinking that cannot actually do any scanning in super_cache_scan(). Vladimir, do you have a suggestion, or is there someone else that is working on this?