On 02/14/2014 01:20 AM, Johannes Weiner wrote: > On Thu, Feb 13, 2014 at 09:33:32PM +0400, Vladimir Davydov wrote: >> On 02/13/2014 02:01 AM, Johannes Weiner wrote: >>> On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote: >>>> On 02/12/2014 12:19 AM, Johannes Weiner wrote: >>>>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote: >>>>>> Hi Michal, Johannes, David, >>>>>> >>>>>> Could you please take a look at this if you have time? Without your >>>>>> review, it'll never get committed. >>>>> There is simply no review bandwidth for new features as long as we are >>>>> fixing fundamental bugs in memcg. >>>>> >>>>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote: >>>>>>> Hi, >>>>>>> >>>>>>> This is the 15th iteration of Glauber Costa's patch-set implementing >>>>>>> slab >>>>>>> shrinking on memcg pressure. The main idea is to make the list_lru >>>>>>> structure >>>>>>> used by most FS shrinkers per-memcg. When adding or removing an element >>>>>>> from a >>>>>>> list_lru, we use the page information to figure out which memcg it >>>>>>> belongs to >>>>>>> and relay it to the appropriate list. This allows scanning kmem objects >>>>>>> accounted to different memcgs independently. >>>>>>> >>>>>>> Please note that this patch-set implements slab shrinking only when we >>>>>>> hit the >>>>>>> user memory limit so that kmem allocations will still fail if we are >>>>>>> below the >>>>>>> user memory limit, but close to the kmem limit. I am going to fix this >>>>>>> in a >>>>>>> separate patch-set, but currently it is only worthwhile setting the >>>>>>> kmem limit >>>>>>> to be greater than the user mem limit just to enable per-memcg slab >>>>>>> accounting >>>>>>> and reclaim. >>>>>>> >>>>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 >>>>>>> (there are >>>>>>> some vmscan cleanups that I need committed there) and organized as >>>>>>> follows: >>>>>>> - patches 1-4 introduce some minor changes to memcg needed for this >>>>>>> set; >>>>>>> - patches 5-7 prepare fs for per-memcg list_lru; >>>>>>> - patch 8 implement kmemcg reclaim core; >>>>>>> - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker >>>>>>> memcg-aware; >>>>>>> - patch 10 is trivial - it issues shrinkers on memcg destruction; >>>>>>> - patches 12 and 13 introduce shrinking of dead kmem caches to >>>>>>> facilitate >>>>>>> memcg destruction. >>>>> In the context of the ongoing discussions about charge reparenting I >>>>> was curious how you deal with charges becoming unreclaimable after a >>>>> memcg has been offlined. >>>>> >>>>> Patch #11 drops all charged objects at offlining by just invoking >>>>> shrink_slab() in a loop until "only a few" (10) objects are remaining. >>>>> How long is this going to take? And why is it okay to destroy these >>>>> caches when somebody else might still be using them? >>>> IMHO, on container destruction we have to drop as many objects accounted >>>> to this container as we can, because otherwise any container will be >>>> able to get access to any number of unaccounted objects by fetching them >>>> and then rebooting. >>> They're accounted to and subject to the limit of the parent. I don't >>> see how this is different than page cache. >>> >>>>> That still leaves you with the free objects that slab caches retain >>>>> for allocation efficiency, so now you put all dead memcgs in the >>>>> system on a global list, and on a vmpressure event on root_mem_cgroup >>>>> you walk the global list and drain the freelist of all remaining >>>>> caches. >>>>> >>>>> This is a lot of complexity and scalability problems for less than >>>>> desirable behavior. >>>>> >>>>> Please think about how we can properly reparent kmemcg charges during >>>>> memcg teardown. That would simplify your code immensely and help >>>>> clean up this unholy mess of css pinning. >>>>> >>>>> Slab caches are already collected in the memcg and on destruction >>>>> could be reassigned to the parent. Kmemcg uncharge from slab freeing >>>>> would have to be updated to use the memcg from the cache, not from the >>>>> individual page, but I don't see why this wouldn't work right now. >>>> I don't think I understand what you mean by reassigning slab caches to >>>> the parent. >>>> >>>> If you mean moving all pages (slabs) from the cache of the memcg being >>>> destroyed to the corresponding root cache (or the parent memcg's cache) >>>> and then destroying the memcg's cache, I don't think this is feasible, >>>> because slub free's fast path is lockless, so AFAIU we can't remove a >>>> partial slab from a cache w/o risking to race with kmem_cache_free. >>>> >>>> If you mean clearing all pointers from the memcg's cache to the memcg >>>> (changing them to the parent or root memcg), then AFAIU this won't solve >>>> the problem with "dangling" caches - we will still have to shrink them >>>> on vmpressure. So although this would allow us to put the reference to >>>> the memcg from kmem caches on memcg's death, it wouldn't simplify the >>>> code at all, in fact, it would even make it more complicated, because we >>>> would have to handle various corner cases like reparenting vs >>>> list_lru_{add,remove}. >>> I think we have different concepts of what's complicated. There is an >>> existing model of what to do with left-over cache memory when a cgroup >>> is destroyed, which is reparenting. The rough steps will be the same, >>> the object lifetime will be the same, the css refcounting will be the >>> same, the user-visible behavior will be the same. Any complexity from >>> charge vs. reparent races will be contained to a few lines of code. >>> >>> Weird refcounting tricks during offline, trashing kmem caches instead >>> of moving them to the parent like other memory, a global list of dead >>> memcgs and sudden freelist thrashing on a vmpressure event, that's what >>> adds complexity and what makes this code unpredictable, fragile, and >>> insanely hard to work with. It's not acceptable. >>> >>> By reparenting I meant reassigning the memcg cache parameter pointer >>> from the slab cache such that it points to the parent. This should be >>> an atomic operation. All css lookups already require RCU (I think slab >>> does not follow this yet because we guarantee that css reference, but >>> it should be changed). So switch the cache param pointer, insert an >>> RCU graceperiod to wait for all the ongoing charges and uncharges until >>> nobody sees the memcg anymore, then safely reparent all the remaining >>> memcg objects to the parent. Maybe individually, maybe we can just >>> splice the lists to the parent's list_lru lists. >> But what should we do with objects that do not reside on any list_lru? >> How can we reparent them? > If there are no actual list_lru objects, we only have to make sure > that any allocations are properly uncharged against the parent when > they get freed later. > > If the slab freeing path would uncharge against the per-memcg cache's > backpointer (s->memcg_params->memcg) instead of the per-page memcg > association, then we could reparent whole caches with a single pointer > update, without touching each individual slab page. The kmemcg > interface for slab would have to be reworked to not use > lookup_page_cgroup() but have slab pass s->memcg_params->memcg. > > Once that is in place, css_free() can walk memcg->memcg_slab_caches > and move all the items to the parent's memcg_slab_caches, and while > doing that change the memcg pointer of each item, memcg_params->memcg, > to point to the parent. The cache is now owned by the parent and will > stay alive until the last page is freed. > > There won't be any new allocations in these caches because there are > no tasks in the group anymore, so no races from that side, and the > perfect time to shrink the freelists. > > There will be racing frees of outstanding allocations, but we can deal > with that. Frees use page->slab_cache to look up the proper per-memcg > cache, which may or may not have been reparented at this time. If it > has not, the cache's memcg backpointer (s->memcg_params->memcg) will > point to the dying child (rcu protected) and uncharge the child's > res_counter and the parent's. If the backpointer IS already pointing > to the parent, it will uncharge the res_counter of the parent without > the child's but we don't care, it's dying anyway. > > If there are no list_lru tracked objects, we are done at this point. > The css can be freed, the freelists have been purged, and any pages > still in the cache will get unaccounted properly from the parent. > > If there are list_lru objects, they have to be moved to the parent's > list_lru so that they can be reclaimed properly on memory pressure. > > Does this make sense?
Definitely. Thank you very much for such a detailed explanation! I guess I'll try to implement this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/