Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Thu, 23 Jan 2014, Michal Hocko wrote: > On Thu 23-01-14 02:42:58, Hugh Dickins wrote: > > > > > > Actually both patches are needed. If we had only 2/2 then we wouldn't > > > endless loop inside mem_cgroup_iter but we could still return root to > > > caller all the time because mem_cgroup_iter_load would return NULL on > > > css_tryget failure on the cached root. Or am I missing something that > > > would prevent that? > > > > In theory I agree with you; and if you're missing something, I can't see > > it either. But in practice, all my earlier testing of 3.12 and 3.13 was > > just with 2/2, and I've tried without your 1/2 since: whereas I have hung > > on 3.12 and 3.13 a convincing number of times without 2/2, I have never > > hung on them with 2/2 without 1/2. In practice 1/2 appears essential > > for 3.10 and 3.11, but irrelevant for 3.12 and 3.13. > > > > That could be easy to explain if there were a difference at the calling > > end, shrink_zone(), between those releases: but I don't see that. Odd. > > Either we're both missing something, or my testing is even less reliable > > than I'd thought. But since I certainly don't dispute 1/2, it is merely > > academic. Though still bothersome. > > I would assume that it is (sc->nr_reclaimed >= sc->nr_to_reclaim) that > helps us to back off. SWAP_CLUSTER_MAX shouldn't be that hard to get to > before css_offline racing part reparents all the memory. But wouldn't explain why I could see it on 3.10,11 but not on 3.12,13. Perhaps the 2/2 problem is a lot easier to hit than the 1/2 problem, and I mistakenly expected to see the 1/2 problem in the timescale that I saw the 2/2 problem; but I don't really think either is the case. > > Anyway, I would feel safer if this was pushed fixed although you haven't > reporoduced it. Absolutely. > > Before Andrew sends these all off to Linus, I should admit that there's > > probably a refinement still to come to the CSS_ONLINE one. I'm ashamed > > to admit that I overlooked a much earlier comment from Greg Thelen, who > > suggested that a memory barrier might be required. > > I was thinking about mem barrier while reviewing your patch but then I > convinced myself that we should be safe also without using one when > checking CSS_ONLINE. > We have basically two situations. > - online_css when we can miss it being set which is OK because > we would miss a new empty group. > - offline_css when we could still see the flag being set but > then css_tryget would be already failing. > > So while all this is subtle and relies on cgroup core heavily I think we > should be safe wrt. memory barriers. > > Or did you mean something else here? Something else. My CSS_OFFLINE patch claims to prevent the iterator from returning an uninitialized struct mem_cgroup: if that is to be relied upon, then it needs to make sure that the initialization of the mem_cgroup is visible to the caller before the CSS_OFFLINE flag. kernel/cgroup.c online_css() does nowadays have an smp_wmb() buried in its rcu_assign_pointer(); but it's not in the right place to make this particular guarantee. And an smp_rmb() needed somewhere too, if it doesn't already come for free somehow. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Thu 23-01-14 02:42:58, Hugh Dickins wrote: > On Wed, 22 Jan 2014, Michal Hocko wrote: > > On Tue 21-01-14 13:18:42, Hugh Dickins wrote: > > [...] > > > We do have a confusing situation. The hang goes back to 3.10 but takes > > > two different forms, because of intervening changes: in 3.10 and 3.11 > > > mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and > > > 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next > > > and cannot return to its caller. > > > > > > Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly > > > to 3.11, but will have to be rediffed for 3.10 because of rearrangement > > > in between). > > > > I will backport it when it reaches stable queue. > > Thank you. > > Testing, at home and at work, has confirmed your two patches are good. Great. Thanks a lot! > Greg's particular test on 3.11 gave convincing results, and I was helped > by Linus's tree of last night (df32e43a54d0) happening to be quite quick > to reproduce the issue on my laptop. > > > > > > Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies > > > correctly to neither of them because it's diffed on top of my CSS_ONLINE > > > fix). Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear > > > whether Michal is claiming that it would also fix the hang in 3.12 and > > > 3.13 if we didn't have 2/2: I doubt that, and haven't tested that. > > > > Actually both patches are needed. If we had only 2/2 then we wouldn't > > endless loop inside mem_cgroup_iter but we could still return root to > > caller all the time because mem_cgroup_iter_load would return NULL on > > css_tryget failure on the cached root. Or am I missing something that > > would prevent that? > > In theory I agree with you; and if you're missing something, I can't see > it either. But in practice, all my earlier testing of 3.12 and 3.13 was > just with 2/2, and I've tried without your 1/2 since: whereas I have hung > on 3.12 and 3.13 a convincing number of times without 2/2, I have never > hung on them with 2/2 without 1/2. In practice 1/2 appears essential > for 3.10 and 3.11, but irrelevant for 3.12 and 3.13. > > That could be easy to explain if there were a difference at the calling > end, shrink_zone(), between those releases: but I don't see that. Odd. > Either we're both missing something, or my testing is even less reliable > than I'd thought. But since I certainly don't dispute 1/2, it is merely > academic. Though still bothersome. I would assume that it is (sc->nr_reclaimed >= sc->nr_to_reclaim) that helps us to back off. SWAP_CLUSTER_MAX shouldn't be that hard to get to before css_offline racing part reparents all the memory. Anyway, I would feel safer if this was pushed fixed although you haven't reporoduced it. > > > Given how Michal has diffed this patch on top of my CSS_ONLINE one > > > (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch), > > > it would be helpful if you could mark that one also for stable 3.12+, > > > to save us from having to rediff this one for stable. We don't have > > > a concrete example of a problem it solves in the vanilla kernel, but > > > it makes more sense to include it than to exclude it. > > > > Yes, I think it makes sense to queue it for 3.12+ as well because it is > > non intrusive and potential issues would be really subtle. > > Before Andrew sends these all off to Linus, I should admit that there's > probably a refinement still to come to the CSS_ONLINE one. I'm ashamed > to admit that I overlooked a much earlier comment from Greg Thelen, who > suggested that a memory barrier might be required. I was thinking about mem barrier while reviewing your patch but then I convinced myself that we should be safe also without using one when checking CSS_ONLINE. We have basically two situations. - online_css when we can miss it being set which is OK because we would miss a new empty group. - offline_css when we could still see the flag being set but then css_tryget would be already failing. So while all this is subtle and relies on cgroup core heavily I think we should be safe wrt. memory barriers. Or did you mean something else here? > I think he's right, and I'd have liked to say exactly what and where > before answering you now; but barriers are tricky elusive things, I've > not yet decided, and better report back to you now on the testing > result. > > Hugh Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Wed, 22 Jan 2014, Michal Hocko wrote: > On Tue 21-01-14 13:18:42, Hugh Dickins wrote: > [...] > > We do have a confusing situation. The hang goes back to 3.10 but takes > > two different forms, because of intervening changes: in 3.10 and 3.11 > > mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and > > 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next > > and cannot return to its caller. > > > > Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly > > to 3.11, but will have to be rediffed for 3.10 because of rearrangement > > in between). > > I will backport it when it reaches stable queue. Thank you. Testing, at home and at work, has confirmed your two patches are good. Greg's particular test on 3.11 gave convincing results, and I was helped by Linus's tree of last night (df32e43a54d0) happening to be quite quick to reproduce the issue on my laptop. > > > Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies > > correctly to neither of them because it's diffed on top of my CSS_ONLINE > > fix). Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear > > whether Michal is claiming that it would also fix the hang in 3.12 and > > 3.13 if we didn't have 2/2: I doubt that, and haven't tested that. > > Actually both patches are needed. If we had only 2/2 then we wouldn't > endless loop inside mem_cgroup_iter but we could still return root to > caller all the time because mem_cgroup_iter_load would return NULL on > css_tryget failure on the cached root. Or am I missing something that > would prevent that? In theory I agree with you; and if you're missing something, I can't see it either. But in practice, all my earlier testing of 3.12 and 3.13 was just with 2/2, and I've tried without your 1/2 since: whereas I have hung on 3.12 and 3.13 a convincing number of times without 2/2, I have never hung on them with 2/2 without 1/2. In practice 1/2 appears essential for 3.10 and 3.11, but irrelevant for 3.12 and 3.13. That could be easy to explain if there were a difference at the calling end, shrink_zone(), between those releases: but I don't see that. Odd. Either we're both missing something, or my testing is even less reliable than I'd thought. But since I certainly don't dispute 1/2, it is merely academic. Though still bothersome. > > > Given how Michal has diffed this patch on top of my CSS_ONLINE one > > (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch), > > it would be helpful if you could mark that one also for stable 3.12+, > > to save us from having to rediff this one for stable. We don't have > > a concrete example of a problem it solves in the vanilla kernel, but > > it makes more sense to include it than to exclude it. > > Yes, I think it makes sense to queue it for 3.12+ as well because it is > non intrusive and potential issues would be really subtle. Before Andrew sends these all off to Linus, I should admit that there's probably a refinement still to come to the CSS_ONLINE one. I'm ashamed to admit that I overlooked a much earlier comment from Greg Thelen, who suggested that a memory barrier might be required. I think he's right, and I'd have liked to say exactly what and where before answering you now; but barriers are tricky elusive things, I've not yet decided, and better report back to you now on the testing result. Hugh > > > (You would be right to point out that the CSS_ONLINE one fixes > > something that goes back to 3.10: I'm saying 3.12+ because I'm not > > motivated to rediff it for 3.10 and 3.11 when there's nothing to > > go on top; but that's not a very good reason to lie - overrule me.) > > > > Hugh > > -- > Michal Hocko > SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Tue 21-01-14 13:18:42, Hugh Dickins wrote: [...] > We do have a confusing situation. The hang goes back to 3.10 but takes > two different forms, because of intervening changes: in 3.10 and 3.11 > mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and > 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next > and cannot return to its caller. > > Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly > to 3.11, but will have to be rediffed for 3.10 because of rearrangement > in between). I will backport it when it reaches stable queue. > Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies > correctly to neither of them because it's diffed on top of my CSS_ONLINE > fix). Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear > whether Michal is claiming that it would also fix the hang in 3.12 and > 3.13 if we didn't have 2/2: I doubt that, and haven't tested that. Actually both patches are needed. If we had only 2/2 then we wouldn't endless loop inside mem_cgroup_iter but we could still return root to caller all the time because mem_cgroup_iter_load would return NULL on css_tryget failure on the cached root. Or am I missing something that would prevent that? > Given how Michal has diffed this patch on top of my CSS_ONLINE one > (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch), > it would be helpful if you could mark that one also for stable 3.12+, > to save us from having to rediff this one for stable. We don't have > a concrete example of a problem it solves in the vanilla kernel, but > it makes more sense to include it than to exclude it. Yes, I think it makes sense to queue it for 3.12+ as well because it is non intrusive and potential issues would be really subtle. > (You would be right to point out that the CSS_ONLINE one fixes > something that goes back to 3.10: I'm saying 3.12+ because I'm not > motivated to rediff it for 3.10 and 3.11 when there's nothing to > go on top; but that's not a very good reason to lie - overrule me.) > > Hugh -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Tue 21-01-14 11:42:19, Andrew Morton wrote: > On Tue, 21 Jan 2014 11:45:43 +0100 Michal Hocko wrote: > > > 19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized > > mem_cgroup_iter code in order to simplify it. A part of that change was > > dropping an optimization which didn't call css_tryget on the root of > > the walked tree. The patch however didn't change the css_put part in > > mem_cgroup_iter which excludes root. > > This wasn't an issue at the time because __mem_cgroup_iter_next bailed > > out for root early without taking a reference as cgroup iterators > > (css_next_descendant_pre) didn't visit root themselves. > > > > Nevertheless cgroup iterators have been reworked to visit root by > > bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include > > the origin css in the iteration) when the root bypass have been dropped > > in __mem_cgroup_iter_next. This means that css_put is not called for > > root and so css along with mem_cgroup and other cgroup internal object > > tied by css lifetime are never freed. > > > > Fix the issue by reintroducing root check in __mem_cgroup_iter_next > > and do not take css reference for it. > > > > This reference counting magic protects us also from another issue, an > > endless loop reported by Hugh Dickins when reclaim races with root > > removal and css_tryget called by iterator internally would fail. There > > would be no other nodes to visit so __mem_cgroup_iter_next would return > > NULL and mem_cgroup_iter would interpret it as "start looping from root > > again" and so mem_cgroup_iter would loop forever internally. > > I grabbed these two patches but I will sit on them for a week or so, > pending review-n-test. Yes, there is no rush and this needs a proper review. > > Cc: sta...@vger.kernel.org # mem_leak part 3.12+ > > What does this mean? Dohh. I had both patches in one but then I decided to split it. This is just left over. Should be 3.12+. Sorry about the confusion. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Tue, 21 Jan 2014, Andrew Morton wrote: > On Tue, 21 Jan 2014 11:45:43 +0100 Michal Hocko wrote: > > > 19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized > > mem_cgroup_iter code in order to simplify it. A part of that change was > > dropping an optimization which didn't call css_tryget on the root of > > the walked tree. The patch however didn't change the css_put part in > > mem_cgroup_iter which excludes root. > > This wasn't an issue at the time because __mem_cgroup_iter_next bailed > > out for root early without taking a reference as cgroup iterators > > (css_next_descendant_pre) didn't visit root themselves. > > > > Nevertheless cgroup iterators have been reworked to visit root by > > bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include > > the origin css in the iteration) when the root bypass have been dropped > > in __mem_cgroup_iter_next. This means that css_put is not called for > > root and so css along with mem_cgroup and other cgroup internal object > > tied by css lifetime are never freed. > > > > Fix the issue by reintroducing root check in __mem_cgroup_iter_next > > and do not take css reference for it. > > > > This reference counting magic protects us also from another issue, an > > endless loop reported by Hugh Dickins when reclaim races with root > > removal and css_tryget called by iterator internally would fail. There > > would be no other nodes to visit so __mem_cgroup_iter_next would return > > NULL and mem_cgroup_iter would interpret it as "start looping from root > > again" and so mem_cgroup_iter would loop forever internally. > > I grabbed these two patches but I will sit on them for a week or so, > pending review-n-test. Thank you, yes, I'm about to give them more testing. > > > Cc: sta...@vger.kernel.org # mem_leak part 3.12+ > > What does this mean? It's certainly a confusing comment. I suggest just deleting the "mem_leak part ": Michal isn't referring to any two parts of the patch itself, but to parts of his commit comment; but it's still unclear what he's claiming. We do have a confusing situation. The hang goes back to 3.10 but takes two different forms, because of intervening changes: in 3.10 and 3.11 mem_cgroup_iter repeatedly returns root memcg to its caller, in 3.12 and 3.13 mem_cgroup_iter repeatedly gets NULL memcg from mem_cgroup_iter_next and cannot return to its caller. Patch 1/2 is what's needed to fix 3.10 and 3.11 (and applies correctly to 3.11, but will have to be rediffed for 3.10 because of rearrangement in between). Patch 2/2 is what's needed to fix 3.12 and 3.13 (but applies correctly to neither of them because it's diffed on top of my CSS_ONLINE fix). Patch 1/2 is correct but unnecessary in 3.12 and 3.13: I'm unclear whether Michal is claiming that it would also fix the hang in 3.12 and 3.13 if we didn't have 2/2: I doubt that, and haven't tested that. Given how Michal has diffed this patch on top of my CSS_ONLINE one (mm-memcg-iteration-skip-memcgs-not-yet-fully-initialized.patch), it would be helpful if you could mark that one also for stable 3.12+, to save us from having to rediff this one for stable. We don't have a concrete example of a problem it solves in the vanilla kernel, but it makes more sense to include it than to exclude it. (You would be right to point out that the CSS_ONLINE one fixes something that goes back to 3.10: I'm saying 3.12+ because I'm not motivated to rediff it for 3.10 and 3.11 when there's nothing to go on top; but that's not a very good reason to lie - overrule me.) Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
On Tue, 21 Jan 2014 11:45:43 +0100 Michal Hocko wrote: > 19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized > mem_cgroup_iter code in order to simplify it. A part of that change was > dropping an optimization which didn't call css_tryget on the root of > the walked tree. The patch however didn't change the css_put part in > mem_cgroup_iter which excludes root. > This wasn't an issue at the time because __mem_cgroup_iter_next bailed > out for root early without taking a reference as cgroup iterators > (css_next_descendant_pre) didn't visit root themselves. > > Nevertheless cgroup iterators have been reworked to visit root by > bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include > the origin css in the iteration) when the root bypass have been dropped > in __mem_cgroup_iter_next. This means that css_put is not called for > root and so css along with mem_cgroup and other cgroup internal object > tied by css lifetime are never freed. > > Fix the issue by reintroducing root check in __mem_cgroup_iter_next > and do not take css reference for it. > > This reference counting magic protects us also from another issue, an > endless loop reported by Hugh Dickins when reclaim races with root > removal and css_tryget called by iterator internally would fail. There > would be no other nodes to visit so __mem_cgroup_iter_next would return > NULL and mem_cgroup_iter would interpret it as "start looping from root > again" and so mem_cgroup_iter would loop forever internally. I grabbed these two patches but I will sit on them for a week or so, pending review-n-test. > Cc: sta...@vger.kernel.org # mem_leak part 3.12+ What does this mean? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 2/2] memcg: fix css reference leak and endless loop in mem_cgroup_iter
19f39402864e (memcg: simplify mem_cgroup_iter) has reorganized mem_cgroup_iter code in order to simplify it. A part of that change was dropping an optimization which didn't call css_tryget on the root of the walked tree. The patch however didn't change the css_put part in mem_cgroup_iter which excludes root. This wasn't an issue at the time because __mem_cgroup_iter_next bailed out for root early without taking a reference as cgroup iterators (css_next_descendant_pre) didn't visit root themselves. Nevertheless cgroup iterators have been reworked to visit root by bd8815a6d802 (cgroup: make css_for_each_descendant() and friends include the origin css in the iteration) when the root bypass have been dropped in __mem_cgroup_iter_next. This means that css_put is not called for root and so css along with mem_cgroup and other cgroup internal object tied by css lifetime are never freed. Fix the issue by reintroducing root check in __mem_cgroup_iter_next and do not take css reference for it. This reference counting magic protects us also from another issue, an endless loop reported by Hugh Dickins when reclaim races with root removal and css_tryget called by iterator internally would fail. There would be no other nodes to visit so __mem_cgroup_iter_next would return NULL and mem_cgroup_iter would interpret it as "start looping from root again" and so mem_cgroup_iter would loop forever internally. Cc: sta...@vger.kernel.org # mem_leak part 3.12+ Reported-by: Hugh Dickins Signed-off-by: Michal Hocko --- mm/memcontrol.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 45786dc129dc..55bb1f8c6907 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1076,14 +1076,22 @@ skip_node: * skipped and we should continue the tree walk. * last_visited css is safe to use because it is * protected by css_get and the tree walk is rcu safe. +* +* We do not take a reference on the root of the tree walk +* because we might race with the root removal when it would +* be the only node in the iterated hierarchy and mem_cgroup_iter +* would end up in an endless loop because it expects that at +* least one valid node will be returned. Root cannot disappear +* because caller of the iterator should hold it already so +* skipping css reference should be safe. */ if (next_css) { - if ((next_css->flags & CSS_ONLINE) && css_tryget(next_css)) + if ((next_css->flags & CSS_ONLINE) && + (next_css == &root->css || css_tryget(next_css))) return mem_cgroup_from_css(next_css); - else { - prev_css = next_css; - goto skip_node; - } + + prev_css = next_css; + goto skip_node; } return NULL; -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/