Re: Memory hotplug softlock issue
On Wed, 21 Nov 2018, Michal Hocko wrote: > On Mon 19-11-18 21:44:41, Hugh Dickins wrote: > [...] > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > > > We have all assumed that it is essential to hold a page reference while > > waiting on a page lock: partly to guarantee that there is still a struct > > page when MEMORY_HOTREMOVE is configured, but also to protect against > > reuse of the struct page going to someone who then holds the page locked > > indefinitely, when the waiter can reasonably expect timely unlocking. > > I would add the following for the "problem statement". Feel free to > reuse per your preference: > " > An elevated reference count, however, stands in the way of migration and > forces it to fail with a bad timing. This is especially a problem for > memory offlining which retries for ever (or until the operation is > terminated from userspace) because a heavy refault workload can trigger > essentially an endless loop of migration failures. Therefore > __migration_entry_wait is essentially harmful for the even it is waiting > for. > " Okay, I do have a lot written from way back when I prepared the now-abandoned migration_waitqueue patch internally, but I'll factor in what you say above when I get there - in particular, you highlight the memory offlining aspect, as in this mailthread: which is very helpful, because it's outside my experience so I won't have mentioned it - thanks. I just know that there's some important linkage to do, to the August 2017 WQ_FLAG_BOOKMARK discussion: so it's a research and editing job I have to work myself up to at the right moment. > > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > > and is careful not to rely on struct page contents thereafter, there is > > no need to hold a reference to the page while waiting on it. That does > > mean that this case cannot go back through the loop: but that's fine for > > the page migration case, and even if used more widely, is limited by the > > "Stop walking if it's locked" optimization in wake_page_function(). > > I would appreciate this would be more explicit about the existence of > the elevated-ref-count problem but it reduces it to a tiny time window > compared to the whole time the waiter is blocked. So a great > improvement. Fair enough, I'll do so. (But that's a bit like when we say we've attached something and then forget to do so: please check that I've been honest when I do post.) > > > Add interface put_and_wait_on_page_locked() to do this, using negative > > value of the lock arg to wait_on_page_bit_common() to implement it. > > No interruptible or killable variant needed yet, but they might follow: > > I have a vague notion that reporting -EINTR should take precedence over > > return from wait_on_page_bit_common() without knowing the page state, > > so arrange it accordingly - but that may be nothing but pedantic. > > > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! > > and I can imagine a bad one. Do we really have to be so clever here? > The unlock_page went away in the name of performance (a978d6f521063) > and I would argue that this is a slow path where this is just not worth > it. Do we really have to be so clever here? That's a good question: now we have PG_waiters, we probably do not need to bother with this cleverness, and it would save me from having to expand on that comment as I was asked. I'll try going back to a simple unlock_page() there: and can always restore the __ClearPageLocked if a reviewer demands, or 0-day notices regression, > > > this > > survived a lot of testing before that showed up. It does raise the > > question: should is_page_cache_freeable() and __remove_mapping() now > > treat a PG_waiters page as if an extra reference were held? Perhaps, > > but I don't think it matters much, since shrink_page_list() already > > had to win its trylock_page(), so waiters are not very common there: I > > noticed no difference when trying the bigger change, and it's surely not > > needed while put_and_wait_on_page_locked() is only for page migration. > > > > Signed-off-by: Hugh Dickins > > The patch looks good to me - quite ugly but it doesn't make the existing > code much worse. > > With the problem described Vlastimil fixed, feel free to add > Acked-by: Michal Hocko Thanks! > > And thanks for a prompt patch. This is something I've been chasing for > quite some time. __migration_entry_wait came to my radar only recently > because this is an extremely volatile area. You are very gracious to describe a patch promised six months ago as "prompt". But it does help me a lot to have it fixing a real problem for someone (thank you Baoquan) - well, it fixed a real problem for us internally too, but very nice to gather more backing for it like this. Hugh
Re: Memory hotplug softlock issue
On Mon 19-11-18 21:44:41, Hugh Dickins wrote: [...] > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > We have all assumed that it is essential to hold a page reference while > waiting on a page lock: partly to guarantee that there is still a struct > page when MEMORY_HOTREMOVE is configured, but also to protect against > reuse of the struct page going to someone who then holds the page locked > indefinitely, when the waiter can reasonably expect timely unlocking. I would add the following for the "problem statement". Feel free to reuse per your preference: " An elevated reference count, however, stands in the way of migration and forces it to fail with a bad timing. This is especially a problem for memory offlining which retries for ever (or until the operation is terminated from userspace) because a heavy refault workload can trigger essentially an endless loop of migration failures. Therefore __migration_entry_wait is essentially harmful for the even it is waiting for. " > But in fact, so long as wait_on_page_bit_common() does the put_page(), > and is careful not to rely on struct page contents thereafter, there is > no need to hold a reference to the page while waiting on it. That does > mean that this case cannot go back through the loop: but that's fine for > the page migration case, and even if used more widely, is limited by the > "Stop walking if it's locked" optimization in wake_page_function(). I would appreciate this would be more explicit about the existence of the elevated-ref-count problem but it reduces it to a tiny time window compared to the whole time the waiter is blocked. So a great improvement. > Add interface put_and_wait_on_page_locked() to do this, using negative > value of the lock arg to wait_on_page_bit_common() to implement it. > No interruptible or killable variant needed yet, but they might follow: > I have a vague notion that reporting -EINTR should take precedence over > return from wait_on_page_bit_common() without knowing the page state, > so arrange it accordingly - but that may be nothing but pedantic. > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! and I can imagine a bad one. Do we really have to be so clever here? The unlock_page went away in the name of performance (a978d6f521063) and I would argue that this is a slow path where this is just not worth it. > this > survived a lot of testing before that showed up. It does raise the > question: should is_page_cache_freeable() and __remove_mapping() now > treat a PG_waiters page as if an extra reference were held? Perhaps, > but I don't think it matters much, since shrink_page_list() already > had to win its trylock_page(), so waiters are not very common there: I > noticed no difference when trying the bigger change, and it's surely not > needed while put_and_wait_on_page_locked() is only for page migration. > > Signed-off-by: Hugh Dickins The patch looks good to me - quite ugly but it doesn't make the existing code much worse. With the problem described Vlastimil fixed, feel free to add Acked-by: Michal Hocko And thanks for a prompt patch. This is something I've been chasing for quite some time. __migration_entry_wait came to my radar only recently because this is an extremely volatile area. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Tue, 20 Nov 2018, Hugh Dickins wrote: > On Tue, 20 Nov 2018, Vlastimil Babka wrote: > > > > > > finish_wait(q, wait); > > > > ... the code continues by: > > > > if (thrashing) { > > if (!PageSwapBacked(page)) > > > > So maybe we should not set 'thrashing' true when lock < 0? > > Very good catch, thank you Vlastimil: as you might have guessed, the > patch from a pre-PSI kernel applied cleanly, and I just hadn't reviewed > the surrounding context properly before sending out. > > I cannot say immediately what the right answer is, I'll have to do some > research first: maybe not enter the block that sets thrashing true when > lock < 0, as you suggest, or maybe force lock < 0 to 0 and put_page() > afterwards, or... ... I still won't adjust the patch tonight, but the answer is obvious now I look closer: as you show in your extract above, the only thing it does with "page" at the end is to ask if it was SwapBacked, so we just need to set one more bool at the beginning to check at the end (or I could make "thrashing" a -1, 0, 1 int like "lock": but my guess is that that would not be to other people's taste: acceptable for the arg, but stretching your patience for the local variable). By the way, I do have a further patch to wait_on_page_bit_common(), which I could send at the same time, if it sounds right to you (but it's a no-op in the put_and_wait page migration case). That __add_wait_queue_entry_tail() is right for the first time into the loop, but I maintain that it should use __add_wait_queue() for fairness thereafter, to avoid repeatedly sending older waiters back to the back of the queue. I don't have hard numbers for it, but it's one of several patches, each of which helped to reduce our wait_on_page_bit lockups in some (perhaps unrealistic) stress tests. Hugh
Re: Memory hotplug softlock issue
On Tue, 20 Nov 2018, Baoquan He wrote: > On 11/20/18 at 02:38pm, Vlastimil Babka wrote: > > On 11/20/18 6:44 AM, Hugh Dickins wrote: > > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > > > > > We have all assumed that it is essential to hold a page reference while > > > waiting on a page lock: partly to guarantee that there is still a struct > > > page when MEMORY_HOTREMOVE is configured, but also to protect against > > > reuse of the struct page going to someone who then holds the page locked > > > indefinitely, when the waiter can reasonably expect timely unlocking. > > > > > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > > > and is careful not to rely on struct page contents thereafter, there is > > > no need to hold a reference to the page while waiting on it. That does > > > > So there's still a moment where refcount is elevated, but hopefully > > short enough, right? Let's see if it survives Baoquan's stress testing. > > Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on > that machine, after many times of hot removing/adding, the endless > looping during mirgrating is not seen any more. The test result for > Hugh's patch is positive. I even suggested Ping increasing the memory > pressure to "stress -m 250", it still succeeded to offline and remove. > > So I think this patch works to solve the issue. Thanks a lot for your > help, all of you. Very good to hear, thanks a lot for your quick feedback. > > High, will you post a formal patch in a separate thread? Yes, I promise that I shall do so in the next few days, but not today: some other things have to take priority. And Vlastimil has raised an excellent point about the interaction with PSI "thrashing": I need to read up and decide which way to go on that (and add Johannes to the Cc when I post). I think I shall probably post it directly to Linus (lists and other people Cc'ed of course): not because I think it should be rushed in too quickly, nor to sidestep Andrew, but because Linus was very closely involved in both the PG_waiters and WQ_FLAG_BOOKMARK discussions: it is an area of special interest to him. Hugh
Re: Memory hotplug softlock issue
On Tue, 20 Nov 2018, Vlastimil Babka wrote: > On 11/20/18 6:44 AM, Hugh Dickins wrote: > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > > > We have all assumed that it is essential to hold a page reference while > > waiting on a page lock: partly to guarantee that there is still a struct > > page when MEMORY_HOTREMOVE is configured, but also to protect against > > reuse of the struct page going to someone who then holds the page locked > > indefinitely, when the waiter can reasonably expect timely unlocking. > > > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > > and is careful not to rely on struct page contents thereafter, there is > > no need to hold a reference to the page while waiting on it. That does > > So there's still a moment where refcount is elevated, but hopefully > short enough, right? Correct: and given page migration's 10 passes, it would have to be very unlucky to hit one of those transiently elevated refcounts every time: so I don't think it's a grave drawback at all - certainly much less grave than how it's done at present. I admit that doing a get_page_unless_zero() immediately before the put_and_wait_on_page_locked() looks rather silly, but I think we do have to hold a reference in order to set PG_waiters. Then for other future uses (e.g. in find_get_entry() or lock_page_or_retry()), the reference to be dropped has been taken earlier anyway. > Let's see if it survives Baoquan's stress testing. > > > mean that this case cannot go back through the loop: but that's fine for > > the page migration case, and even if used more widely, is limited by the > > "Stop walking if it's locked" optimization in wake_page_function(). > > > > Add interface put_and_wait_on_page_locked() to do this, using negative > > value of the lock arg to wait_on_page_bit_common() to implement it. > > No interruptible or killable variant needed yet, but they might follow: > > I have a vague notion that reporting -EINTR should take precedence over > > return from wait_on_page_bit_common() without knowing the page state, > > so arrange it accordingly - but that may be nothing but pedantic. > > > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this > > survived a lot of testing before that showed up. It does raise the > > question: should is_page_cache_freeable() and __remove_mapping() now > > treat a PG_waiters page as if an extra reference were held? Perhaps, > > but I don't think it matters much, since shrink_page_list() already > > had to win its trylock_page(), so waiters are not very common there: I > > noticed no difference when trying the bigger change, and it's surely not > > needed while put_and_wait_on_page_locked() is only for page migration. > > > > Signed-off-by: Hugh Dickins > > --- > > ... > > > @@ -1100,6 +,17 @@ static inline int > > wait_on_page_bit_common(wait_queue_head_t *q, > > ret = -EINTR; > > break; > > } > > + > > + if (lock < 0) { > > + /* > > +* We can no longer safely access page->flags: > > Hmm... > > > +* even if CONFIG_MEMORY_HOTREMOVE is not enabled, > > +* there is a risk of waiting forever on a page reused > > +* for something that keeps it locked indefinitely. > > +* But best check for -EINTR above before breaking. > > +*/ > > + break; > > + } > > } > > > > finish_wait(q, wait); > > ... the code continues by: > > if (thrashing) { > if (!PageSwapBacked(page)) > > So maybe we should not set 'thrashing' true when lock < 0? Very good catch, thank you Vlastimil: as you might have guessed, the patch from a pre-PSI kernel applied cleanly, and I just hadn't reviewed the surrounding context properly before sending out. I cannot say immediately what the right answer is, I'll have to do some research first: maybe not enter the block that sets thrashing true when lock < 0, as you suggest, or maybe force lock < 0 to 0 and put_page() afterwards, or... Hugh
Re: Memory hotplug softlock issue
On 11/20/18 at 03:05pm, Michal Hocko wrote: > > Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on > > that machine, after many times of hot removing/adding, the endless > > looping during mirgrating is not seen any more. The test result for > > Hugh's patch is positive. I even suggested Ping increasing the memory > > pressure to "stress -m 250", it still succeeded to offline and remove. > > > > So I think this patch works to solve the issue. Thanks a lot for your > > help, all of you. > > This is a great news! Thanks for your swift feedback. I will go and try > to review Hugh's patch soon. Yeah. Thanks a lot for your help on debugging and narrowing down to position the cause of the issue, Michal, really appreciated! > > Meanwhile we found sometime onlining page may not add back all memory > > blocks on one memory board, then hot removing/adding them will cause > > kernel panic. I will investigate further and collect information, see if > > it's a kernel issue or udev issue. > > It would be great to get a report in a new email thread. Sure, will do after reproducing and inforamtion arranging. Thanks Baoquan
Re: Memory hotplug softlock issue
On Tue 20-11-18 21:58:03, Baoquan He wrote: > Hi, > > On 11/20/18 at 02:38pm, Vlastimil Babka wrote: > > On 11/20/18 6:44 AM, Hugh Dickins wrote: > > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > > > > > We have all assumed that it is essential to hold a page reference while > > > waiting on a page lock: partly to guarantee that there is still a struct > > > page when MEMORY_HOTREMOVE is configured, but also to protect against > > > reuse of the struct page going to someone who then holds the page locked > > > indefinitely, when the waiter can reasonably expect timely unlocking. > > > > > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > > > and is careful not to rely on struct page contents thereafter, there is > > > no need to hold a reference to the page while waiting on it. That does > > > > So there's still a moment where refcount is elevated, but hopefully > > short enough, right? Let's see if it survives Baoquan's stress testing. > > Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on > that machine, after many times of hot removing/adding, the endless > looping during mirgrating is not seen any more. The test result for > Hugh's patch is positive. I even suggested Ping increasing the memory > pressure to "stress -m 250", it still succeeded to offline and remove. > > So I think this patch works to solve the issue. Thanks a lot for your > help, all of you. This is a great news! Thanks for your swift feedback. I will go and try to review Hugh's patch soon. > High, will you post a formal patch in a separate thread? > > Meanwhile we found sometime onlining page may not add back all memory > blocks on one memory board, then hot removing/adding them will cause > kernel panic. I will investigate further and collect information, see if > it's a kernel issue or udev issue. It would be great to get a report in a new email thread. > > Thanks > Baoquan > > > > > > mean that this case cannot go back through the loop: but that's fine for > > > the page migration case, and even if used more widely, is limited by the > > > "Stop walking if it's locked" optimization in wake_page_function(). > > > > > > Add interface put_and_wait_on_page_locked() to do this, using negative > > > value of the lock arg to wait_on_page_bit_common() to implement it. > > > No interruptible or killable variant needed yet, but they might follow: > > > I have a vague notion that reporting -EINTR should take precedence over > > > return from wait_on_page_bit_common() without knowing the page state, > > > so arrange it accordingly - but that may be nothing but pedantic. > > > > > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this > > > survived a lot of testing before that showed up. It does raise the > > > question: should is_page_cache_freeable() and __remove_mapping() now > > > treat a PG_waiters page as if an extra reference were held? Perhaps, > > > but I don't think it matters much, since shrink_page_list() already > > > had to win its trylock_page(), so waiters are not very common there: I > > > noticed no difference when trying the bigger change, and it's surely not > > > needed while put_and_wait_on_page_locked() is only for page migration. > > > > > > Signed-off-by: Hugh Dickins > > > --- > > > > ... > > > > > @@ -1100,6 +,17 @@ static inline int > > > wait_on_page_bit_common(wait_queue_head_t *q, > > > ret = -EINTR; > > > break; > > > } > > > + > > > + if (lock < 0) { > > > + /* > > > + * We can no longer safely access page->flags: > > > > Hmm... > > > > > + * even if CONFIG_MEMORY_HOTREMOVE is not enabled, > > > + * there is a risk of waiting forever on a page reused > > > + * for something that keeps it locked indefinitely. > > > + * But best check for -EINTR above before breaking. > > > + */ > > > + break; > > > + } > > > } > > > > > > finish_wait(q, wait); > > > > ... the code continues by: > > > > if (thrashing) { > > if (!PageSwapBacked(page)) > > > > So maybe we should not set 'thrashing' true when lock < 0? > > > > Thanks! > > Vlastimil -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
Hi, On 11/20/18 at 02:38pm, Vlastimil Babka wrote: > On 11/20/18 6:44 AM, Hugh Dickins wrote: > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > > > We have all assumed that it is essential to hold a page reference while > > waiting on a page lock: partly to guarantee that there is still a struct > > page when MEMORY_HOTREMOVE is configured, but also to protect against > > reuse of the struct page going to someone who then holds the page locked > > indefinitely, when the waiter can reasonably expect timely unlocking. > > > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > > and is careful not to rely on struct page contents thereafter, there is > > no need to hold a reference to the page while waiting on it. That does > > So there's still a moment where refcount is elevated, but hopefully > short enough, right? Let's see if it survives Baoquan's stress testing. Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on that machine, after many times of hot removing/adding, the endless looping during mirgrating is not seen any more. The test result for Hugh's patch is positive. I even suggested Ping increasing the memory pressure to "stress -m 250", it still succeeded to offline and remove. So I think this patch works to solve the issue. Thanks a lot for your help, all of you. High, will you post a formal patch in a separate thread? Meanwhile we found sometime onlining page may not add back all memory blocks on one memory board, then hot removing/adding them will cause kernel panic. I will investigate further and collect information, see if it's a kernel issue or udev issue. Thanks Baoquan > > > mean that this case cannot go back through the loop: but that's fine for > > the page migration case, and even if used more widely, is limited by the > > "Stop walking if it's locked" optimization in wake_page_function(). > > > > Add interface put_and_wait_on_page_locked() to do this, using negative > > value of the lock arg to wait_on_page_bit_common() to implement it. > > No interruptible or killable variant needed yet, but they might follow: > > I have a vague notion that reporting -EINTR should take precedence over > > return from wait_on_page_bit_common() without knowing the page state, > > so arrange it accordingly - but that may be nothing but pedantic. > > > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this > > survived a lot of testing before that showed up. It does raise the > > question: should is_page_cache_freeable() and __remove_mapping() now > > treat a PG_waiters page as if an extra reference were held? Perhaps, > > but I don't think it matters much, since shrink_page_list() already > > had to win its trylock_page(), so waiters are not very common there: I > > noticed no difference when trying the bigger change, and it's surely not > > needed while put_and_wait_on_page_locked() is only for page migration. > > > > Signed-off-by: Hugh Dickins > > --- > > ... > > > @@ -1100,6 +,17 @@ static inline int > > wait_on_page_bit_common(wait_queue_head_t *q, > > ret = -EINTR; > > break; > > } > > + > > + if (lock < 0) { > > + /* > > +* We can no longer safely access page->flags: > > Hmm... > > > +* even if CONFIG_MEMORY_HOTREMOVE is not enabled, > > +* there is a risk of waiting forever on a page reused > > +* for something that keeps it locked indefinitely. > > +* But best check for -EINTR above before breaking. > > +*/ > > + break; > > + } > > } > > > > finish_wait(q, wait); > > ... the code continues by: > > if (thrashing) { > if (!PageSwapBacked(page)) > > So maybe we should not set 'thrashing' true when lock < 0? > > Thanks! > Vlastimil
Re: Memory hotplug softlock issue
On 11/20/18 6:44 AM, Hugh Dickins wrote: > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated > > We have all assumed that it is essential to hold a page reference while > waiting on a page lock: partly to guarantee that there is still a struct > page when MEMORY_HOTREMOVE is configured, but also to protect against > reuse of the struct page going to someone who then holds the page locked > indefinitely, when the waiter can reasonably expect timely unlocking. > > But in fact, so long as wait_on_page_bit_common() does the put_page(), > and is careful not to rely on struct page contents thereafter, there is > no need to hold a reference to the page while waiting on it. That does So there's still a moment where refcount is elevated, but hopefully short enough, right? Let's see if it survives Baoquan's stress testing. > mean that this case cannot go back through the loop: but that's fine for > the page migration case, and even if used more widely, is limited by the > "Stop walking if it's locked" optimization in wake_page_function(). > > Add interface put_and_wait_on_page_locked() to do this, using negative > value of the lock arg to wait_on_page_bit_common() to implement it. > No interruptible or killable variant needed yet, but they might follow: > I have a vague notion that reporting -EINTR should take precedence over > return from wait_on_page_bit_common() without knowing the page state, > so arrange it accordingly - but that may be nothing but pedantic. > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this > survived a lot of testing before that showed up. It does raise the > question: should is_page_cache_freeable() and __remove_mapping() now > treat a PG_waiters page as if an extra reference were held? Perhaps, > but I don't think it matters much, since shrink_page_list() already > had to win its trylock_page(), so waiters are not very common there: I > noticed no difference when trying the bigger change, and it's surely not > needed while put_and_wait_on_page_locked() is only for page migration. > > Signed-off-by: Hugh Dickins > --- ... > @@ -1100,6 +,17 @@ static inline int > wait_on_page_bit_common(wait_queue_head_t *q, > ret = -EINTR; > break; > } > + > + if (lock < 0) { > + /* > + * We can no longer safely access page->flags: Hmm... > + * even if CONFIG_MEMORY_HOTREMOVE is not enabled, > + * there is a risk of waiting forever on a page reused > + * for something that keeps it locked indefinitely. > + * But best check for -EINTR above before breaking. > + */ > + break; > + } > } > > finish_wait(q, wait); ... the code continues by: if (thrashing) { if (!PageSwapBacked(page)) So maybe we should not set 'thrashing' true when lock < 0? Thanks! Vlastimil
Re: Memory hotplug softlock issue
On Tue, 20 Nov 2018, Baoquan He wrote: > On 11/19/18 at 09:59pm, Michal Hocko wrote: > > On Mon 19-11-18 12:34:09, Hugh Dickins wrote: > > > I'm glad that I delayed, what I had then (migration_waitqueue instead > > > of using page_waitqueue) was not wrong, but what I've been using the > > > last couple of months is rather better (and can be put to use to solve > > > similar problems in collapsing pages on huge tmpfs. but we don't need > > > to get into that at this time): put_and_wait_on_page_locked(). > > > > > > What I have not yet done is verify it on latest kernel, and research > > > the interested Cc list (Linus and Tim Chen come immediately to mind), > > > and write the commit comment. I have some testing to do on the latest > > > kernel today, so I'll throw put_and_wait_on_page_locked() in too, > > > and post tomorrow I hope. > > > > Cool, it seems that Baoquan has a reliable test case to trigger the > > pathological case. > > Yes. I will test Hugh's patch. Thanks: I've completed some of the retesting now, so it would probably help us all better if I post the patch in this thread, even without completing its description and links and Cc list yet - there isn't even a problem description below, I still have to paste that in from the unposted patch that I made six months ago. Here is today's... [PATCH] mm: put_and_wait_on_page_locked() while page is migrated We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using negative value of the lock arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. shrink_page_list()'s __ClearPageLocked(): that was a surprise! this survived a lot of testing before that showed up. It does raise the question: should is_page_cache_freeable() and __remove_mapping() now treat a PG_waiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only for page migration. Signed-off-by: Hugh Dickins --- include/linux/pagemap.h | 2 ++ mm/filemap.c| 53 - mm/huge_memory.c| 6 ++--- mm/migrate.c| 12 -- mm/vmscan.c | 11 + 5 files changed, 57 insertions(+), 27 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 226f96f0dee0..e2d7039af6a3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page) return wait_on_page_bit_killable(compound_head(page), PG_locked); } +extern void put_and_wait_on_page_locked(struct page *page); + /* * Wait for a page to complete writeback */ diff --git a/mm/filemap.c b/mm/filemap.c index 81adec8ee02c..ef82119032d8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, if (wait_page->bit_nr != key->bit_nr) return 0; - /* Stop walking if it's locked */ + /* +* Stop walking if it's locked. +* Is this safe if put_and_wait_on_page_locked() is in use? +* Yes: the waker must hold a reference to this page, and if PG_locked +* has now already been set by another task, that task must also hold +* a reference to the *same usage* of this page; so there is no need +* to walk on to wake even the put_and_wait_on_page_locked() callers. +*/ if (test_bit(key->bit_nr, &key->page->flags)) return -1; @@ -1050,13 +1057,14 @@ static void wake_up_page(struct page *page, int bit) } static inline int wait_on_page_bit_common(wait_queue_head_t *q, - struct page *page, int bit_nr, int state, bool lock) +
Re: Memory hotplug softlock issue
On 11/19/18 at 09:59pm, Michal Hocko wrote: > On Mon 19-11-18 12:34:09, Hugh Dickins wrote: > > I'm glad that I delayed, what I had then (migration_waitqueue instead > > of using page_waitqueue) was not wrong, but what I've been using the > > last couple of months is rather better (and can be put to use to solve > > similar problems in collapsing pages on huge tmpfs. but we don't need > > to get into that at this time): put_and_wait_on_page_locked(). > > > > What I have not yet done is verify it on latest kernel, and research > > the interested Cc list (Linus and Tim Chen come immediately to mind), > > and write the commit comment. I have some testing to do on the latest > > kernel today, so I'll throw put_and_wait_on_page_locked() in too, > > and post tomorrow I hope. > > Cool, it seems that Baoquan has a reliable test case to trigger the > pathological case. Yes. I will test Hugh's patch.
Re: Memory hotplug softlock issue
On Mon 19-11-18 12:34:09, Hugh Dickins wrote: > On Mon, 19 Nov 2018, Michal Hocko wrote: > > On Mon 19-11-18 15:10:16, Michal Hocko wrote: > > [...] > > > In other words. Why cannot we do the following? > > > > Baoquan, this is certainly not the right fix but I would be really > > curious whether it makes the problem go away. > > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > > index f7e4bfdc13b7..7ccab29bcf9a 100644 > > > --- a/mm/migrate.c > > > +++ b/mm/migrate.c > > > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, > > > pte_t *ptep, > > > goto out; > > > > > > page = migration_entry_to_page(entry); > > > - > > > - /* > > > - * Once page cache replacement of page migration started, page_count > > > - * *must* be zero. And, we don't want to call wait_on_page_locked() > > > - * against a page without get_page(). > > > - * So, we use get_page_unless_zero(), here. Even failed, page fault > > > - * will occur again. > > > - */ > > > - if (!get_page_unless_zero(page)) > > > - goto out; > > > pte_unmap_unlock(ptep, ptl); > > > - wait_on_page_locked(page); > > > - put_page(page); > > > + page_lock(page); > > > + page_unlock(page); > > > return; > > > out: > > > pte_unmap_unlock(ptep, ptl); > > Thanks for Cc'ing me. I did mention precisely this issue two or three > times at LSF/MM this year, and claimed then that I would post the fix. I've had a recollection about some issue I just couldn't remember what was it exactly. Tried to make Vlastimil to remember but failed there as well ;) > I'm glad that I delayed, what I had then (migration_waitqueue instead > of using page_waitqueue) was not wrong, but what I've been using the > last couple of months is rather better (and can be put to use to solve > similar problems in collapsing pages on huge tmpfs. but we don't need > to get into that at this time): put_and_wait_on_page_locked(). > > What I have not yet done is verify it on latest kernel, and research > the interested Cc list (Linus and Tim Chen come immediately to mind), > and write the commit comment. I have some testing to do on the latest > kernel today, so I'll throw put_and_wait_on_page_locked() in too, > and post tomorrow I hope. Cool, it seems that Baoquan has a reliable test case to trigger the pathological case. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Mon, 19 Nov 2018, Michal Hocko wrote: > On Mon 19-11-18 15:10:16, Michal Hocko wrote: > [...] > > In other words. Why cannot we do the following? > > Baoquan, this is certainly not the right fix but I would be really > curious whether it makes the problem go away. > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index f7e4bfdc13b7..7ccab29bcf9a 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, > > pte_t *ptep, > > goto out; > > > > page = migration_entry_to_page(entry); > > - > > - /* > > -* Once page cache replacement of page migration started, page_count > > -* *must* be zero. And, we don't want to call wait_on_page_locked() > > -* against a page without get_page(). > > -* So, we use get_page_unless_zero(), here. Even failed, page fault > > -* will occur again. > > -*/ > > - if (!get_page_unless_zero(page)) > > - goto out; > > pte_unmap_unlock(ptep, ptl); > > - wait_on_page_locked(page); > > - put_page(page); > > + page_lock(page); > > + page_unlock(page); > > return; > > out: > > pte_unmap_unlock(ptep, ptl); Thanks for Cc'ing me. I did mention precisely this issue two or three times at LSF/MM this year, and claimed then that I would post the fix. I'm glad that I delayed, what I had then (migration_waitqueue instead of using page_waitqueue) was not wrong, but what I've been using the last couple of months is rather better (and can be put to use to solve similar problems in collapsing pages on huge tmpfs. but we don't need to get into that at this time): put_and_wait_on_page_locked(). What I have not yet done is verify it on latest kernel, and research the interested Cc list (Linus and Tim Chen come immediately to mind), and write the commit comment. I have some testing to do on the latest kernel today, so I'll throw put_and_wait_on_page_locked() in too, and post tomorrow I hope. Hugh
Re: Memory hotplug softlock issue
On Mon 19-11-18 15:10:16, Michal Hocko wrote: [...] > In other words. Why cannot we do the following? Baoquan, this is certainly not the right fix but I would be really curious whether it makes the problem go away. > diff --git a/mm/migrate.c b/mm/migrate.c > index f7e4bfdc13b7..7ccab29bcf9a 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t > *ptep, > goto out; > > page = migration_entry_to_page(entry); > - > - /* > - * Once page cache replacement of page migration started, page_count > - * *must* be zero. And, we don't want to call wait_on_page_locked() > - * against a page without get_page(). > - * So, we use get_page_unless_zero(), here. Even failed, page fault > - * will occur again. > - */ > - if (!get_page_unless_zero(page)) > - goto out; > pte_unmap_unlock(ptep, ptl); > - wait_on_page_locked(page); > - put_page(page); > + page_lock(page); > + page_unlock(page); > return; > out: > pte_unmap_unlock(ptep, ptl); -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Mon 19-11-18 17:48:35, Vlastimil Babka wrote: > On 11/19/18 5:46 PM, Vlastimil Babka wrote: > > On 11/19/18 5:46 PM, Michal Hocko wrote: > >> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote: > >>> > >>> So what protects us from locking a page whose refcount dropped to zero? > >>> and is being freed? The checks in freeing path won't be happy about a > >>> stray lock. > >> > >> Nothing really prevents that. But does it matter. The worst that might > >> happen is that we lock a freed or reused page. Who would complain? > > > > free_pages_check() for example > > > > PAGE_FLAGS_CHECK_AT_FREE includes PG_locked Right you are. > And besides... what about the last page being offlined and then the > whole struct page's part of vmemmap destroyed as the node goes away? Yeah, that is quite unlikely though because the there is quite a large time window between the two events. I am not entirely sure we are safe right now TBH. Any access to the struct page after the put_page is unsafe theoretically. Then we have to come up with something more clever I am afraid. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/19/18 5:46 PM, Vlastimil Babka wrote: > On 11/19/18 5:46 PM, Michal Hocko wrote: >> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote: >>> >>> So what protects us from locking a page whose refcount dropped to zero? >>> and is being freed? The checks in freeing path won't be happy about a >>> stray lock. >> >> Nothing really prevents that. But does it matter. The worst that might >> happen is that we lock a freed or reused page. Who would complain? > > free_pages_check() for example > > PAGE_FLAGS_CHECK_AT_FREE includes PG_locked And besides... what about the last page being offlined and then the whole struct page's part of vmemmap destroyed as the node goes away?
Re: Memory hotplug softlock issue
On 11/19/18 5:46 PM, Michal Hocko wrote: > On Mon 19-11-18 17:36:21, Vlastimil Babka wrote: >> >> So what protects us from locking a page whose refcount dropped to zero? >> and is being freed? The checks in freeing path won't be happy about a >> stray lock. > > Nothing really prevents that. But does it matter. The worst that might > happen is that we lock a freed or reused page. Who would complain? free_pages_check() for example PAGE_FLAGS_CHECK_AT_FREE includes PG_locked
Re: Memory hotplug softlock issue
On Mon 19-11-18 17:36:21, Vlastimil Babka wrote: > On 11/19/18 3:10 PM, Michal Hocko wrote: > > On Mon 19-11-18 13:51:21, Michal Hocko wrote: > >> On Mon 19-11-18 13:40:33, Michal Hocko wrote: > >>> How are > >>> we supposed to converge when the swapin code waits for the migration to > >>> finish with the reference count elevated? > > Indeed this looks wrong. How comes we only found this out now? I guess > the race window where refcounts matter is only a part of the whole > migration, where we update the mapping (migrate_page_move_mapping()). > That's before copying contents, flags etc. I guess we simply never found out because most migration callers simply fail after few attempts. The notable exception is memory offline which tries retries until it suceeds or the caller terminates the process by a fatal signal > >> Just to clarify. This is not only about swapin obviously. Any caller of > >> __migration_entry_wait is affected the same way AFAICS. > > > > In other words. Why cannot we do the following? > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index f7e4bfdc13b7..7ccab29bcf9a 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, > > pte_t *ptep, > > goto out; > > > > page = migration_entry_to_page(entry); > > - > > - /* > > -* Once page cache replacement of page migration started, page_count > > -* *must* be zero. And, we don't want to call wait_on_page_locked() > > -* against a page without get_page(). > > -* So, we use get_page_unless_zero(), here. Even failed, page fault > > -* will occur again. > > -*/ > > - if (!get_page_unless_zero(page)) > > - goto out; > > pte_unmap_unlock(ptep, ptl); > > - wait_on_page_locked(page); > > - put_page(page); > > + page_lock(page); > > + page_unlock(page); > > So what protects us from locking a page whose refcount dropped to zero? > and is being freed? The checks in freeing path won't be happy about a > stray lock. Nothing really prevents that. But does it matter. The worst that might happen is that we lock a freed or reused page. Who would complain? -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/19/18 3:10 PM, Michal Hocko wrote: > On Mon 19-11-18 13:51:21, Michal Hocko wrote: >> On Mon 19-11-18 13:40:33, Michal Hocko wrote: >>> How are >>> we supposed to converge when the swapin code waits for the migration to >>> finish with the reference count elevated? Indeed this looks wrong. How comes we only found this out now? I guess the race window where refcounts matter is only a part of the whole migration, where we update the mapping (migrate_page_move_mapping()). That's before copying contents, flags etc. >> Just to clarify. This is not only about swapin obviously. Any caller of >> __migration_entry_wait is affected the same way AFAICS. > > In other words. Why cannot we do the following? > > diff --git a/mm/migrate.c b/mm/migrate.c > index f7e4bfdc13b7..7ccab29bcf9a 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t > *ptep, > goto out; > > page = migration_entry_to_page(entry); > - > - /* > - * Once page cache replacement of page migration started, page_count > - * *must* be zero. And, we don't want to call wait_on_page_locked() > - * against a page without get_page(). > - * So, we use get_page_unless_zero(), here. Even failed, page fault > - * will occur again. > - */ > - if (!get_page_unless_zero(page)) > - goto out; > pte_unmap_unlock(ptep, ptl); > - wait_on_page_locked(page); > - put_page(page); > + page_lock(page); > + page_unlock(page); So what protects us from locking a page whose refcount dropped to zero? and is being freed? The checks in freeing path won't be happy about a stray lock. I suspect it's not that simple to fix this. Perhaps migration code could set some flag/bit in the page during the part where it stabilizes refcounts, and __migration_entry_wait() would just spin until the bit is cleared, and only then proceed with the current get_page+wait? Or we could maybe wait on the pte itself and not page? > return; > out: > pte_unmap_unlock(ptep, ptl); >
Re: Memory hotplug softlock issue
On Mon 19-11-18 13:51:21, Michal Hocko wrote: > On Mon 19-11-18 13:40:33, Michal Hocko wrote: > > On Mon 19-11-18 18:52:02, Baoquan He wrote: > > [...] > > > > There are few stacks directly in the offline path but those should be > > OK. > > The real culprit seems to be the swap in code > > > > > [ +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L > > > 4.20.0-rc2+ #7 > > > [ +0.007927] Hardware name: 9008/IT91SMUB, BIOS BLXSV512 03/22/2018 > > > [ +0.006297] Call Trace: > > > [ +0.002537] dump_stack+0x46/0x60 > > > [ +0.003386] __migration_entry_wait.cold.65+0x5/0x14 > > > [ +0.005043] do_swap_page+0x84e/0x960 > > > [ +0.003727] ? arch_tlb_finish_mmu+0x29/0xc0 > > > [ +0.006412] __handle_mm_fault+0x933/0x1330 > > > [ +0.004265] handle_mm_fault+0xc4/0x250 > > > [ +0.003915] __do_page_fault+0x2b7/0x510 > > > [ +0.003990] do_page_fault+0x2c/0x110 > > > [ +0.003729] ? page_fault+0x8/0x30 > > > [ +0.003462] page_fault+0x1e/0x30 > > > > There are many traces to this path. We are > > /* > > * Once page cache replacement of page migration started, page_count > > * *must* be zero. And, we don't want to call wait_on_page_locked() > > * against a page without get_page(). > > * So, we use get_page_unless_zero(), here. Even failed, page fault > > * will occur again. > > */ > > if (!get_page_unless_zero(page)) > > goto out; > > pte_unmap_unlock(ptep, ptl); > > wait_on_page_locked(page); > > > > taking a reference to the page under the migration. I have to think > > about this much more but I suspec this is just calling for a problem. > > > > Cc migration experts. For you background information. We are seeing > > memory offline not being able to converge because few heavily used pages > > fail to migrate away - e.g. > > http://lkml.kernel.org/r/20181116012433.GU2653@MiWiFi-R3L-srv > > A debugging page to dump stack for these pages > > http://lkml.kernel.org/r/20181116091409.gd14...@dhcp22.suse.cz > > shows that references are taken from the swap in code (above). How are > > we supposed to converge when the swapin code waits for the migration to > > finish with the reference count elevated? > > Just to clarify. This is not only about swapin obviously. Any caller of > __migration_entry_wait is affected the same way AFAICS. In other words. Why cannot we do the following? diff --git a/mm/migrate.c b/mm/migrate.c index f7e4bfdc13b7..7ccab29bcf9a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, goto out; page = migration_entry_to_page(entry); - - /* -* Once page cache replacement of page migration started, page_count -* *must* be zero. And, we don't want to call wait_on_page_locked() -* against a page without get_page(). -* So, we use get_page_unless_zero(), here. Even failed, page fault -* will occur again. -*/ - if (!get_page_unless_zero(page)) - goto out; pte_unmap_unlock(ptep, ptl); - wait_on_page_locked(page); - put_page(page); + page_lock(page); + page_unlock(page); return; out: pte_unmap_unlock(ptep, ptl); -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Mon 19-11-18 13:40:33, Michal Hocko wrote: > On Mon 19-11-18 18:52:02, Baoquan He wrote: > [...] > > There are few stacks directly in the offline path but those should be > OK. > The real culprit seems to be the swap in code > > > [ +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L > > 4.20.0-rc2+ #7 > > [ +0.007927] Hardware name: 9008/IT91SMUB, BIOS BLXSV512 03/22/2018 > > [ +0.006297] Call Trace: > > [ +0.002537] dump_stack+0x46/0x60 > > [ +0.003386] __migration_entry_wait.cold.65+0x5/0x14 > > [ +0.005043] do_swap_page+0x84e/0x960 > > [ +0.003727] ? arch_tlb_finish_mmu+0x29/0xc0 > > [ +0.006412] __handle_mm_fault+0x933/0x1330 > > [ +0.004265] handle_mm_fault+0xc4/0x250 > > [ +0.003915] __do_page_fault+0x2b7/0x510 > > [ +0.003990] do_page_fault+0x2c/0x110 > > [ +0.003729] ? page_fault+0x8/0x30 > > [ +0.003462] page_fault+0x1e/0x30 > > There are many traces to this path. We are > /* >* Once page cache replacement of page migration started, page_count >* *must* be zero. And, we don't want to call wait_on_page_locked() >* against a page without get_page(). >* So, we use get_page_unless_zero(), here. Even failed, page fault >* will occur again. >*/ > if (!get_page_unless_zero(page)) > goto out; > pte_unmap_unlock(ptep, ptl); > wait_on_page_locked(page); > > taking a reference to the page under the migration. I have to think > about this much more but I suspec this is just calling for a problem. > > Cc migration experts. For you background information. We are seeing > memory offline not being able to converge because few heavily used pages > fail to migrate away - e.g. > http://lkml.kernel.org/r/20181116012433.GU2653@MiWiFi-R3L-srv > A debugging page to dump stack for these pages > http://lkml.kernel.org/r/20181116091409.gd14...@dhcp22.suse.cz > shows that references are taken from the swap in code (above). How are > we supposed to converge when the swapin code waits for the migration to > finish with the reference count elevated? Just to clarify. This is not only about swapin obviously. Any caller of __migration_entry_wait is affected the same way AFAICS. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Mon 19-11-18 18:52:02, Baoquan He wrote: [...] There are few stacks directly in the offline path but those should be OK. The real culprit seems to be the swap in code > [ +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L > 4.20.0-rc2+ #7 > [ +0.007927] Hardware name: 9008/IT91SMUB, BIOS BLXSV512 03/22/2018 > [ +0.006297] Call Trace: > [ +0.002537] dump_stack+0x46/0x60 > [ +0.003386] __migration_entry_wait.cold.65+0x5/0x14 > [ +0.005043] do_swap_page+0x84e/0x960 > [ +0.003727] ? arch_tlb_finish_mmu+0x29/0xc0 > [ +0.006412] __handle_mm_fault+0x933/0x1330 > [ +0.004265] handle_mm_fault+0xc4/0x250 > [ +0.003915] __do_page_fault+0x2b7/0x510 > [ +0.003990] do_page_fault+0x2c/0x110 > [ +0.003729] ? page_fault+0x8/0x30 > [ +0.003462] page_fault+0x1e/0x30 There are many traces to this path. We are /* * Once page cache replacement of page migration started, page_count * *must* be zero. And, we don't want to call wait_on_page_locked() * against a page without get_page(). * So, we use get_page_unless_zero(), here. Even failed, page fault * will occur again. */ if (!get_page_unless_zero(page)) goto out; pte_unmap_unlock(ptep, ptl); wait_on_page_locked(page); taking a reference to the page under the migration. I have to think about this much more but I suspec this is just calling for a problem. Cc migration experts. For you background information. We are seeing memory offline not being able to converge because few heavily used pages fail to migrate away - e.g. http://lkml.kernel.org/r/20181116012433.GU2653@MiWiFi-R3L-srv A debugging page to dump stack for these pages http://lkml.kernel.org/r/20181116091409.gd14...@dhcp22.suse.cz shows that references are taken from the swap in code (above). How are we supposed to converge when the swapin code waits for the migration to finish with the reference count elevated? -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/16/18 at 10:14am, Michal Hocko wrote: > Could you try to apply this debugging patch on top please? It will dump > stack trace for each reference count elevation for one page that fails > to migrate after multiple passes. Thanks, applied and fixed two code issues. The dmesg has been sent to you privately, please check. The dmesg is overflow, if you need the earlier message, I will retest. diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index b64ebf253381..f76e2c498f31 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -72,7 +72,7 @@ static inline int page_count(struct page *page) return atomic_read(&compound_head(page)->_refcount); } -struct page *page_to_track; +extern struct page *page_to_track; static inline void set_page_count(struct page *page, int v) { atomic_set(&page->_refcount, v); diff --git a/mm/migrate.c b/mm/migrate.c index 9b2e395a3d68..42c7499c43b9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1339,6 +1339,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, } struct page *page_to_track; +EXPORT_SYMBOL_GPL(page_to_track); /* * migrate_pages - migrate the pages specified in a list, to the free pages > > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h > index 14d14beb1f7f..b64ebf253381 100644 > --- a/include/linux/page_ref.h > +++ b/include/linux/page_ref.h > @@ -72,9 +72,12 @@ static inline int page_count(struct page *page) > return atomic_read(&compound_head(page)->_refcount); > } > > +struct page *page_to_track; > static inline void set_page_count(struct page *page, int v) > { > atomic_set(&page->_refcount, v); > + if (page == page_to_track) > + dump_stack(); > if (page_ref_tracepoint_active(__tracepoint_page_ref_set)) > __page_ref_set(page, v); > } > @@ -91,6 +94,8 @@ static inline void init_page_count(struct page *page) > static inline void page_ref_add(struct page *page, int nr) > { > atomic_add(nr, &page->_refcount); > + if (page == page_to_track) > + dump_stack(); > if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) > __page_ref_mod(page, nr); > } > @@ -105,6 +110,8 @@ static inline void page_ref_sub(struct page *page, int nr) > static inline void page_ref_inc(struct page *page) > { > atomic_inc(&page->_refcount); > + if (page == page_to_track) > + dump_stack(); > if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) > __page_ref_mod(page, 1); > } > @@ -129,6 +136,8 @@ static inline int page_ref_inc_return(struct page *page) > { > int ret = atomic_inc_return(&page->_refcount); > > + if (page == page_to_track) > + dump_stack(); > if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return)) > __page_ref_mod_and_return(page, 1, ret); > return ret; > @@ -156,6 +165,8 @@ static inline int page_ref_add_unless(struct page *page, > int nr, int u) > { > int ret = atomic_add_unless(&page->_refcount, nr, u); > > + if (page == page_to_track) > + dump_stack(); > if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_unless)) > __page_ref_mod_unless(page, nr, ret); > return ret; > diff --git a/mm/migrate.c b/mm/migrate.c > index f7e4bfdc13b7..9b2e395a3d68 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1338,6 +1338,8 @@ static int unmap_and_move_huge_page(new_page_t > get_new_page, > return rc; > } > > +struct page *page_to_track; > + > /* > * migrate_pages - migrate the pages specified in a list, to the free pages > * supplied as the target for the page migration > @@ -1375,6 +1377,7 @@ int migrate_pages(struct list_head *from, new_page_t > get_new_page, > if (!swapwrite) > current->flags |= PF_SWAPWRITE; > > + page_to_track = NULL; > for(pass = 0; pass < 10 && retry; pass++) { > retry = 0; > > @@ -1417,6 +1420,8 @@ int migrate_pages(struct list_head *from, new_page_t > get_new_page, > goto out; > case -EAGAIN: > retry++; > + if (pass > 1 && !page_to_track) > + page_to_track = page; > break; > case MIGRATEPAGE_SUCCESS: > nr_succeeded++; > -- > Michal Hocko > SUSE Labs
Re: Memory hotplug softlock issue
On Fri 16-11-18 09:24:33, Baoquan He wrote: > On 11/15/18 at 03:32pm, Michal Hocko wrote: > > On Thu 15-11-18 21:38:40, Baoquan He wrote: > > > On 11/15/18 at 02:19pm, Michal Hocko wrote: > > > > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > > > > On 11/15/18 at 09:30am, Michal Hocko wrote: > > > > [...] > > > > > > It would be also good to find out whether this is fs specific. E.g. > > > > > > does > > > > > > it make any difference if you use a different one for your stress > > > > > > testing? > > > > > > > > > > Created a ramdisk and put stress bin there, then run stress -m 200, > > > > > now > > > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So > > > > > now xfs > > > > > is a big suspect. At bottom I paste numactl printing, you can see > > > > > that it's > > > > > the last 4G. > > > > > > > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps > > > > > trying to > > > > > access and activate it. > > > > > > > > Is this still with faultaround disabled? I have seen exactly same > > > > pattern in the bug I am working on. It was ext4 though. > > > > > > After a long time struggling, the last 2nd block where libc-2.28.so is > > > located is reclaimed, now it comes to the last memory block, still > > > stress program itself. swap migration entry has been made and trying to > > > unmap, now it's looping there. > > > > > > [ +0.004445] migrating pfn 190ff2bb0 failed > > > [ +0.13] page:ea643fcaec00 count:203 mapcount:201 > > > mapping:888dfb268f48 index:0x0 > > > [ +0.012809] shmem_aops > > > [ +0.11] name:"stress" > > > [ +0.002550] flags: > > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > > > [ +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > > > 888dfb268f48 > > > [ +0.007828] raw: 00cb00c8 > > > 888e72e92000 > > > [ +0.007810] page->mem_cgroup:888e72e92000 > > [...] > > > [ +0.004455] migrating pfn 190ff2bb0 failed > > > [ +0.18] page:ea643fcaec00 count:203 mapcount:201 > > > mapping:888dfb268f48 index:0x0 > > > [ +0.014392] shmem_aops > > > [ +0.10] name:"stress" > > > [ +0.002565] flags: > > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > > > [ +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > > > 888dfb268f48 > > > [ +0.007819] raw: 00cb00c8 > > > 888e72e92000 > > > [ +0.007808] page->mem_cgroup:888e72e92000 > > > > OK, so this is tmpfs backed code of your stree test. This just tells us > > that this is not fs specific. Reference count is 2 more than the map > > count which is the expected state. So the reference count must have been > > elevated at the time when the migration was attempted. Shmem supports > > fault around so this might be still possible (assuming it is enabled). > > If not we really need to dig deeper. I will think of a debugging patch. > > Disabled faultaround and reboot, test again, it's looping forever in the > last block again, on node2, stress progam itself again. The weird is > refcount seems to have been crazy, a random number now. There must be > something going wrong. Could you try to apply this debugging patch on top please? It will dump stack trace for each reference count elevation for one page that fails to migrate after multiple passes. diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 14d14beb1f7f..b64ebf253381 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -72,9 +72,12 @@ static inline int page_count(struct page *page) return atomic_read(&compound_head(page)->_refcount); } +struct page *page_to_track; static inline void set_page_count(struct page *page, int v) { atomic_set(&page->_refcount, v); + if (page == page_to_track) + dump_stack(); if (page_ref_tracepoint_active(__tracepoint_page_ref_set)) __page_ref_set(page, v); } @@ -91,6 +94,8 @@ static inline void init_page_count(struct page *page) static inline void page_ref_add(struct page *page, int nr) { atomic_add(nr, &page->_refcount); + if (page == page_to_track) + dump_stack(); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, nr); } @@ -105,6 +110,8 @@ static inline void page_ref_sub(struct page *page, int nr) static inline void page_ref_inc(struct page *page) { atomic_inc(&page->_refcount); + if (page == page_to_track) + dump_stack(); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, 1); } @@ -129,6 +136,8 @@ static inline int page_ref_inc_return(struct page *page) { int ret = atomic_inc_return(&page->_refcount); + if (page == page_to_track) + dump_stack(); if (page_ref_tracepoint_activ
Re: Memory hotplug softlock issue
On 11/15/18 at 03:32pm, Michal Hocko wrote: > On Thu 15-11-18 21:38:40, Baoquan He wrote: > > On 11/15/18 at 02:19pm, Michal Hocko wrote: > > > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > > > On 11/15/18 at 09:30am, Michal Hocko wrote: > > > [...] > > > > > It would be also good to find out whether this is fs specific. E.g. > > > > > does > > > > > it make any difference if you use a different one for your stress > > > > > testing? > > > > > > > > Created a ramdisk and put stress bin there, then run stress -m 200, now > > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now > > > > xfs > > > > is a big suspect. At bottom I paste numactl printing, you can see that > > > > it's > > > > the last 4G. > > > > > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps > > > > trying to > > > > access and activate it. > > > > > > Is this still with faultaround disabled? I have seen exactly same > > > pattern in the bug I am working on. It was ext4 though. > > > > After a long time struggling, the last 2nd block where libc-2.28.so is > > located is reclaimed, now it comes to the last memory block, still > > stress program itself. swap migration entry has been made and trying to > > unmap, now it's looping there. > > > > [ +0.004445] migrating pfn 190ff2bb0 failed > > [ +0.13] page:ea643fcaec00 count:203 mapcount:201 > > mapping:888dfb268f48 index:0x0 > > [ +0.012809] shmem_aops > > [ +0.11] name:"stress" > > [ +0.002550] flags: > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > > [ +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > > 888dfb268f48 > > [ +0.007828] raw: 00cb00c8 > > 888e72e92000 > > [ +0.007810] page->mem_cgroup:888e72e92000 > [...] > > [ +0.004455] migrating pfn 190ff2bb0 failed > > [ +0.18] page:ea643fcaec00 count:203 mapcount:201 > > mapping:888dfb268f48 index:0x0 > > [ +0.014392] shmem_aops > > [ +0.10] name:"stress" > > [ +0.002565] flags: > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > > [ +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > > 888dfb268f48 > > [ +0.007819] raw: 00cb00c8 > > 888e72e92000 > > [ +0.007808] page->mem_cgroup:888e72e92000 > > OK, so this is tmpfs backed code of your stree test. This just tells us > that this is not fs specific. Reference count is 2 more than the map > count which is the expected state. So the reference count must have been > elevated at the time when the migration was attempted. Shmem supports > fault around so this might be still possible (assuming it is enabled). > If not we really need to dig deeper. I will think of a debugging patch. Disabled faultaround and reboot, test again, it's looping forever in the last block again, on node2, stress progam itself again. The weird is refcount seems to have been crazy, a random number now. There must be something going wrong. [ +0.058624] migrating pfn 80fd6fbe failed [ +0.03] page:ea203f5bef80 count:336 mapcount:201 mapping:888e1c9357d8 index:0x2 [ +0.014122] shmem_aops [ +0.00] name:"stress" [ +0.002467] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked) [ +0.009511] raw: 009fc008000e c90e3d80 c90e3d80 888e1c9357d8 [ +0.007743] raw: 0002 00cb00c8 888e2233d000 [ +0.007740] page->mem_cgroup:888e2233d000 [ +0.038916] migrating pfn 80fd6fbe failed [ +0.03] page:ea203f5bef80 count:349 mapcount:201 mapping:888e1c9357d8 index:0x2 [ +0.012453] shmem_aops [ +0.01] name:"stress" [ +0.002641] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked) [ +0.009501] raw: 009fc008000e c90e3d80 c90e3d80 888e1c9357d8 [ +0.007746] raw: 0002 00cb00c8 888e2233d000 [ +0.007740] page->mem_cgroup:888e2233d000 [ +0.061226] migrating pfn 80fd6fbe failed [ +0.04] page:ea203f5bef80 count:276 mapcount:201 mapping:888e1c9357d8 index:0x2 [ +0.014129] shmem_aops [ +0.02] name:"stress" [ +0.003246] flags: 0x9fc008008e(waiters|referenced|uptodate|dirty|swapbacked) [ +0.010183] raw: 009fc008008e c90e3d80 c90e3d80 888e1c9357d8 [ +0.007742] raw: 0002 00cb00c8 888e2233d000 [ +0.007733] page->mem_cgroup:888e2233d000 [ +0.037305] migrating pfn 80fd6fbe failed [ +0.03] page:ea203f5bef80 count:304 mapcount:201 mapping:888e1c9357d8 index:0x2 [ +0.012449] shmem_aops [ +0.02] name:"stress" [ +0.002469] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked) [ +0.009495] raw: 009fc008000e c90e3d80 c90e3d80 888e1c9357d8 [ +0.007743] raw: 0002 00cb
Re: Memory hotplug softlock issue
On 11/15/18 at 03:32pm, Michal Hocko wrote: > On Thu 15-11-18 21:38:40, Baoquan He wrote: > > On 11/15/18 at 02:19pm, Michal Hocko wrote: > > > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > > > On 11/15/18 at 09:30am, Michal Hocko wrote: > > > [...] > > > > > It would be also good to find out whether this is fs specific. E.g. > > > > > does > > > > > it make any difference if you use a different one for your stress > > > > > testing? > > > > > > > > Created a ramdisk and put stress bin there, then run stress -m 200, now > > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now > > > > xfs > > > > is a big suspect. At bottom I paste numactl printing, you can see that > > > > it's > > > > the last 4G. > > > > > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps > > > > trying to > > > > access and activate it. > > > > > > Is this still with faultaround disabled? I have seen exactly same > > > pattern in the bug I am working on. It was ext4 though. > > > > After a long time struggling, the last 2nd block where libc-2.28.so is > > located is reclaimed, now it comes to the last memory block, still > > stress program itself. swap migration entry has been made and trying to > > unmap, now it's looping there. > > > > [ +0.004445] migrating pfn 190ff2bb0 failed > > [ +0.13] page:ea643fcaec00 count:203 mapcount:201 > > mapping:888dfb268f48 index:0x0 > > [ +0.012809] shmem_aops > > [ +0.11] name:"stress" > > [ +0.002550] flags: > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > > [ +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > > 888dfb268f48 > > [ +0.007828] raw: 00cb00c8 > > 888e72e92000 > > [ +0.007810] page->mem_cgroup:888e72e92000 > [...] > > [ +0.004455] migrating pfn 190ff2bb0 failed > > [ +0.18] page:ea643fcaec00 count:203 mapcount:201 > > mapping:888dfb268f48 index:0x0 > > [ +0.014392] shmem_aops > > [ +0.10] name:"stress" > > [ +0.002565] flags: > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > > [ +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > > 888dfb268f48 > > [ +0.007819] raw: 00cb00c8 > > 888e72e92000 > > [ +0.007808] page->mem_cgroup:888e72e92000 > > OK, so this is tmpfs backed code of your stree test. This just tells us > that this is not fs specific. Reference count is 2 more than the map > count which is the expected state. So the reference count must have been > elevated at the time when the migration was attempted. Shmem supports > fault around so this might be still possible (assuming it is enabled). > If not we really need to dig deeper. I will think of a debugging patch. Yes, faultaround is enabled. Will disable it and test again. Will report test result.
Re: Memory hotplug softlock issue
On Thu 15-11-18 21:38:40, Baoquan He wrote: > On 11/15/18 at 02:19pm, Michal Hocko wrote: > > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > > On 11/15/18 at 09:30am, Michal Hocko wrote: > > [...] > > > > It would be also good to find out whether this is fs specific. E.g. does > > > > it make any difference if you use a different one for your stress > > > > testing? > > > > > > Created a ramdisk and put stress bin there, then run stress -m 200, now > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs > > > is a big suspect. At bottom I paste numactl printing, you can see that > > > it's > > > the last 4G. > > > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps > > > trying to > > > access and activate it. > > > > Is this still with faultaround disabled? I have seen exactly same > > pattern in the bug I am working on. It was ext4 though. > > After a long time struggling, the last 2nd block where libc-2.28.so is > located is reclaimed, now it comes to the last memory block, still > stress program itself. swap migration entry has been made and trying to > unmap, now it's looping there. > > [ +0.004445] migrating pfn 190ff2bb0 failed > [ +0.13] page:ea643fcaec00 count:203 mapcount:201 > mapping:888dfb268f48 index:0x0 > [ +0.012809] shmem_aops > [ +0.11] name:"stress" > [ +0.002550] flags: > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > [ +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > 888dfb268f48 > [ +0.007828] raw: 00cb00c8 > 888e72e92000 > [ +0.007810] page->mem_cgroup:888e72e92000 [...] > [ +0.004455] migrating pfn 190ff2bb0 failed > [ +0.18] page:ea643fcaec00 count:203 mapcount:201 > mapping:888dfb268f48 index:0x0 > [ +0.014392] shmem_aops > [ +0.10] name:"stress" > [ +0.002565] flags: > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) > [ +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 > 888dfb268f48 > [ +0.007819] raw: 00cb00c8 > 888e72e92000 > [ +0.007808] page->mem_cgroup:888e72e92000 OK, so this is tmpfs backed code of your stree test. This just tells us that this is not fs specific. Reference count is 2 more than the map count which is the expected state. So the reference count must have been elevated at the time when the migration was attempted. Shmem supports fault around so this might be still possible (assuming it is enabled). If not we really need to dig deeper. I will think of a debugging patch. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Thu 15-11-18 21:23:42, Baoquan He wrote: > On 11/15/18 at 02:19pm, Michal Hocko wrote: > > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > > On 11/15/18 at 09:30am, Michal Hocko wrote: > > [...] > > > > It would be also good to find out whether this is fs specific. E.g. does > > > > it make any difference if you use a different one for your stress > > > > testing? > > > > > > Created a ramdisk and put stress bin there, then run stress -m 200, now > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs > > > is a big suspect. At bottom I paste numactl printing, you can see that > > > it's > > > the last 4G. > > > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps > > > trying to > > > access and activate it. > > > > Is this still with faultaround disabled? I have seen exactly same > > pattern in the bug I am working on. It was ext4 though. > > No, forgot disabling faultround after reboot. Do we need to disable it and > retest? No the faultaround is checked at the time of the fault. The reason why I am suspecting this path is that it can elevate the reference count before taking the lock. Normal page fault path should lock the page first. And we hold the lock while trying to migrate that page. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/15/18 at 02:19pm, Michal Hocko wrote: > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > On 11/15/18 at 09:30am, Michal Hocko wrote: > [...] > > > It would be also good to find out whether this is fs specific. E.g. does > > > it make any difference if you use a different one for your stress > > > testing? > > > > Created a ramdisk and put stress bin there, then run stress -m 200, now > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs > > is a big suspect. At bottom I paste numactl printing, you can see that it's > > the last 4G. > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps trying > > to > > access and activate it. > > Is this still with faultaround disabled? I have seen exactly same > pattern in the bug I am working on. It was ext4 though. After a long time struggling, the last 2nd block where libc-2.28.so is located is reclaimed, now it comes to the last memory block, still stress program itself. swap migration entry has been made and trying to unmap, now it's looping there. [ +0.004445] migrating pfn 190ff2bb0 failed [ +0.13] page:ea643fcaec00 count:203 mapcount:201 mapping:888dfb268f48 index:0x0 [ +0.012809] shmem_aops [ +0.11] name:"stress" [ +0.002550] flags: 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) [ +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 888dfb268f48 [ +0.007828] raw: 00cb00c8 888e72e92000 [ +0.007810] page->mem_cgroup:888e72e92000 [ +0.004466] migrating pfn 190ff2bb1 failed [ +0.13] page:ea643fcaec40 count:203 mapcount:201 mapping:888dfb268f48 index:0x2 [ +0.014321] shmem_aops [ +0.24] name:"stress" [ +0.002535] flags: 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) [ +0.010680] raw: 01dfc008004e c90e3d80 ea643fcaec08 888dfb268f48 [ +0.007863] raw: 0002 00cb00c8 888e72e92000 [ +0.007828] page->mem_cgroup:888e72e92000 [ +1.357302] migrating pfn 190ff1c53 failed [ +0.11] page:ea643fc714c0 count:202 mapcount:201 mapping:888e5e395109 index:0x28 [ +0.012756] anon [ +0.24] flags: 0x1dfc008004c(uptodate|dirty|workingset|swapbacked) [ +0.008961] raw: 01dfc008004c ea643fcaec08 c90e3d80 888e5e395109 [ +0.007847] raw: 0028 00ca00c8 888e72e92000 [ +0.007844] page->mem_cgroup:888e72e92000 [ +0.004455] migrating pfn 190ff2bb0 failed [ +0.18] page:ea643fcaec00 count:203 mapcount:201 mapping:888dfb268f48 index:0x0 [ +0.014392] shmem_aops [ +0.10] name:"stress" [ +0.002565] flags: 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) [ +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 888dfb268f48 [ +0.007819] raw: 00cb00c8 888e72e92000 [ +0.007808] page->mem_cgroup:888e72e92000 [ +0.004431] migrating pfn 190ff2bb1 failed [ +0.19] page:ea643fcaec40 count:203 mapcount:201 mapping:888dfb268f48 index:0x2 [ +0.012688] shmem_aops [ +0.12] name:"stress" [ +0.002525] flags: 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked) [ +0.012385] raw: 01dfc008004e c90e3d80 ea643fcaec08 888dfb268f48 [ +0.007840] raw: 0002 00cb00c8 888e72e92000 [ +0.007832] page->mem_cgroup:888e72e92000
Re: Memory hotplug softlock issue
On 11/15/18 at 02:19pm, Michal Hocko wrote: > On Thu 15-11-18 21:12:11, Baoquan He wrote: > > On 11/15/18 at 09:30am, Michal Hocko wrote: > [...] > > > It would be also good to find out whether this is fs specific. E.g. does > > > it make any difference if you use a different one for your stress > > > testing? > > > > Created a ramdisk and put stress bin there, then run stress -m 200, now > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs > > is a big suspect. At bottom I paste numactl printing, you can see that it's > > the last 4G. > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps trying > > to > > access and activate it. > > Is this still with faultaround disabled? I have seen exactly same > pattern in the bug I am working on. It was ext4 though. No, forgot disabling faultround after reboot. Do we need to disable it and retest? > > > [ 5055.461652] migrating pfn 190f4fb3e failed > > [ 5055.461671] page:ea643d3ecf80 count:257 mapcount:251 > > mapping:888e7a6ac528 index:0x85 > > [ 5055.474734] xfs_address_space_operations [xfs] > > [ 5055.474742] name:"libc-2.28.so" > > [ 5055.481070] flags: 0x1dfc026(referenced|uptodate|active) > > [ 5055.490329] raw: 01dfc026 c90e3d80 c90e3d80 > > 888e7a6ac528 > > [ 5055.498080] raw: 0085 00fc00f9 > > 88810a8f2000 > > [ 5055.505823] page->mem_cgroup:88810a8f2000 > > [ 5056.335970] migrating pfn 190f4fb3e failed > > [ 5056.335990] page:ea643d3ecf80 count:255 mapcount:250 > > mapping:888e7a6ac528 index:0x85 > > [ 5056.348994] xfs_address_space_operations [xfs] > > [ 5056.348998] name:"libc-2.28.so" > > [ 5056.353555] flags: 0x1dfc026(referenced|uptodate|active) > > [ 5056.364680] raw: 01dfc026 c90e3d80 c90e3d80 > > 888e7a6ac528 > > [ 5056.372428] raw: 0085 00fc00f9 > > 88810a8f2000 > > [ 5056.380172] page->mem_cgroup:88810a8f2000 > > [ 5057.332806] migrating pfn 190f4fb3e failed > > [ 5057.332821] page:ea643d3ecf80 count:261 mapcount:250 > > mapping:888e7a6ac528 index:0x85 > > [ 5057.345889] xfs_address_space_operations [xfs] > > [ 5057.345900] name:"libc-2.28.so" > > [ 5057.350451] flags: 0x1dfc026(referenced|uptodate|active) > > [ 5057.359707] raw: 01dfc026 c90e3d80 c90e3d80 > > 888e7a6ac528 > > [ 5057.369285] raw: 0085 00fc00f9 > > 88810a8f2000 > > [ 5057.377030] page->mem_cgroup:88810a8f2000 > > [ 5058.285457] migrating pfn 190f4fb3e failed > > [ 5058.285489] page:ea643d3ecf80 count:257 mapcount:250 > > mapping:888e7a6ac528 index:0x85 > > [ 5058.298544] xfs_address_space_operations [xfs] > > [ 5058.298556] name:"libc-2.28.so" > > [ 5058.303092] flags: 0x1dfc026(referenced|uptodate|active) > > [ 5058.314358] raw: 01dfc026 c90e3d80 c90e3d80 > > 888e7a6ac528 > > [ 5058.322109] raw: 0085 00fc00f9 > > 88810a8f2000 > > [ 5058.329848] page->mem_cgroup:88810a8f2000 > -- > Michal Hocko > SUSE Labs
Re: Memory hotplug softlock issue
On Thu 15-11-18 21:12:11, Baoquan He wrote: > On 11/15/18 at 09:30am, Michal Hocko wrote: [...] > > It would be also good to find out whether this is fs specific. E.g. does > > it make any difference if you use a different one for your stress > > testing? > > Created a ramdisk and put stress bin there, then run stress -m 200, now > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs > is a big suspect. At bottom I paste numactl printing, you can see that it's > the last 4G. > > Seems it's trying to migrate libc-2.28.so, but stress program keeps trying to > access and activate it. Is this still with faultaround disabled? I have seen exactly same pattern in the bug I am working on. It was ext4 though. > [ 5055.461652] migrating pfn 190f4fb3e failed > [ 5055.461671] page:ea643d3ecf80 count:257 mapcount:251 > mapping:888e7a6ac528 index:0x85 > [ 5055.474734] xfs_address_space_operations [xfs] > [ 5055.474742] name:"libc-2.28.so" > [ 5055.481070] flags: 0x1dfc026(referenced|uptodate|active) > [ 5055.490329] raw: 01dfc026 c90e3d80 c90e3d80 > 888e7a6ac528 > [ 5055.498080] raw: 0085 00fc00f9 > 88810a8f2000 > [ 5055.505823] page->mem_cgroup:88810a8f2000 > [ 5056.335970] migrating pfn 190f4fb3e failed > [ 5056.335990] page:ea643d3ecf80 count:255 mapcount:250 > mapping:888e7a6ac528 index:0x85 > [ 5056.348994] xfs_address_space_operations [xfs] > [ 5056.348998] name:"libc-2.28.so" > [ 5056.353555] flags: 0x1dfc026(referenced|uptodate|active) > [ 5056.364680] raw: 01dfc026 c90e3d80 c90e3d80 > 888e7a6ac528 > [ 5056.372428] raw: 0085 00fc00f9 > 88810a8f2000 > [ 5056.380172] page->mem_cgroup:88810a8f2000 > [ 5057.332806] migrating pfn 190f4fb3e failed > [ 5057.332821] page:ea643d3ecf80 count:261 mapcount:250 > mapping:888e7a6ac528 index:0x85 > [ 5057.345889] xfs_address_space_operations [xfs] > [ 5057.345900] name:"libc-2.28.so" > [ 5057.350451] flags: 0x1dfc026(referenced|uptodate|active) > [ 5057.359707] raw: 01dfc026 c90e3d80 c90e3d80 > 888e7a6ac528 > [ 5057.369285] raw: 0085 00fc00f9 > 88810a8f2000 > [ 5057.377030] page->mem_cgroup:88810a8f2000 > [ 5058.285457] migrating pfn 190f4fb3e failed > [ 5058.285489] page:ea643d3ecf80 count:257 mapcount:250 > mapping:888e7a6ac528 index:0x85 > [ 5058.298544] xfs_address_space_operations [xfs] > [ 5058.298556] name:"libc-2.28.so" > [ 5058.303092] flags: 0x1dfc026(referenced|uptodate|active) > [ 5058.314358] raw: 01dfc026 c90e3d80 c90e3d80 > 888e7a6ac528 > [ 5058.322109] raw: 0085 00fc00f9 > 88810a8f2000 > [ 5058.329848] page->mem_cgroup:88810a8f2000 -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/15/18 at 09:30am, Michal Hocko wrote: > On Thu 15-11-18 15:53:56, Baoquan He wrote: > > On 11/15/18 at 08:30am, Michal Hocko wrote: > > > On Thu 15-11-18 13:10:34, Baoquan He wrote: > > > > On 11/14/18 at 04:00pm, Michal Hocko wrote: > > > > > On Wed 14-11-18 22:52:50, Baoquan He wrote: > > > > > > On 11/14/18 at 10:01am, Michal Hocko wrote: > > > > > > > I have seen an issue when the migration cannot make a forward > > > > > > > progress > > > > > > > because of a glibc page with a reference count bumping up and > > > > > > > down. Most > > > > > > > probable explanation is the faultaround code. I am working on > > > > > > > this and > > > > > > > will post a patch soon. In any case the migration should converge > > > > > > > and if > > > > > > > it doesn't do then there is a bug lurking somewhere. > > > > > > > > > > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that > > > > > > > happening > > > > > > > wildly but if it is a case then I wouldn't be opposed. > > > > > > > > > > > > Applied your debugging patches, it helps a lot to printing message. > > > > > > > > > > > > Below is the dmesg log about the migrating failure. It can't pass > > > > > > migrate_pages() and loop forever. > > > > > > > > > > > > [ +0.083841] migrating pfn 10fff7d0 failed > > > > > > [ +0.05] page:ea043ffdf400 count:208 mapcount:201 > > > > > > mapping:888dff4bdda8 index:0x2 > > > > > > [ +0.012689] xfs_address_space_operations [xfs] > > > > > > [ +0.30] name:"stress" > > > > > > [ +0.004556] flags: 0x5fc004(uptodate) > > > > > > [ +0.007339] raw: 005fc004 c90e3d80 > > > > > > c90e3d80 888dff4bdda8 > > > > > > [ +0.009488] raw: 0002 > > > > > > 00cb00c8 888e7353d000 > > > > > > [ +0.007726] page->mem_cgroup:888e7353d000 > > > > > > [ +0.084538] migrating pfn 10fff7d0 failed > > > > > > [ +0.06] page:ea043ffdf400 count:210 mapcount:201 > > > > > > mapping:888dff4bdda8 index:0x2 > > > > > > [ +0.012798] xfs_address_space_operations [xfs] > > > > > > [ +0.34] name:"stress" > > > > > > [ +0.004524] flags: 0x5fc004(uptodate) > > > > > > [ +0.007068] raw: 005fc004 c90e3d80 > > > > > > c90e3d80 888dff4bdda8 > > > > > > [ +0.009359] raw: 0002 > > > > > > 00cb00c8 888e7353d000 > > > > > > [ +0.007728] page->mem_cgroup:888e7353d000 > > > > > > > > > > I wouldn't be surprised if this was a similar/same issue I've been > > > > > chasing recently. Could you try to disable faultaround to see if that > > > > > helps. It seems that it helped in my particular case but I am still > > > > > waiting for the final good-to-go to post the patch as I do not own the > > > > > workload which triggered that issue. > > > > > > > > Tried, still stuck in last block sometime. Usually after several times > > > > of hotplug/unplug. If stop stress program, the last block will be > > > > offlined immediately. > > > > > > Is the pattern still the same? I mean failing over few pages with > > > reference count jumping up and down between attempts? > > > > ->count jumping up and down, mapcount stays the same value. > > > > > > > > > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes > > > > 4096 > > > > > > Can you make it 0? > > > > I executed 'echo 0 > fault_around_bytes', value less than one page size > > will round up to one page. > > OK, I have missed that. So then there must be a different source of the > page count volatility. Is it always the same file? > > I think we can rule out memory reclaim because that depends on the page > lock. Is the stress test hitting on memory compaction? In other words, > are > grep compact /proc/vmstat > counters changing during the offline test heavily? I am asking because I > do not see compaction pfn walkers skipping over MIGRATE_ISOLATE > pageblocks. But I might be missing something easily. > > It would be also good to find out whether this is fs specific. E.g. does > it make any difference if you use a different one for your stress > testing? Created a ramdisk and put stress bin there, then run stress -m 200, now seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs is a big suspect. At bottom I paste numactl printing, you can see that it's the last 4G. Seems it's trying to migrate libc-2.28.so, but stress program keeps trying to access and activate it. [ 5055.461652] migrating pfn 190f4fb3e failed [ 5055.461671] page:ea643d3ecf80 count:257 mapcount:251 mapping:888e7a6ac528 index:0x85 [ 5055.474734] xfs_address_space_operations [xfs] [ 5055.474742] name:"libc-2.28.so" [ 5055.481070] flags: 0x1dfc026(referenced|uptodate|active) [ 5055.490329] raw: 01dfc026 c90e3d80 c90e3d80 888e7a6ac528 [ 5055.498080] raw: 0085 00fc00f9 88810a8f2000 [ 5055.505823]
Re: Memory hotplug softlock issue
On 15.11.18 10:52, Baoquan He wrote: > On 11/15/18 at 10:42am, David Hildenbrand wrote: >> I am wondering why it is always the last memory block of that device >> (and even that node). Coincidence? > > I remember one or two times it's the last 6G or 4G which stall there, > the size of memory block is 2G. But most of time it's the last memory > block. And from the debug printing added by Michal's patch, it's the > stress program itself which owns the migrating page and loop forvever > there. > Alright, seems like you found a very good reproducer for an existing problem :) -- Thanks, David / dhildenb
Re: Memory hotplug softlock issue
On 11/15/18 at 10:42am, David Hildenbrand wrote: > I am wondering why it is always the last memory block of that device > (and even that node). Coincidence? I remember one or two times it's the last 6G or 4G which stall there, the size of memory block is 2G. But most of time it's the last memory block. And from the debug printing added by Michal's patch, it's the stress program itself which owns the migrating page and loop forvever there.
Re: Memory hotplug softlock issue
On 15.11.18 09:30, Michal Hocko wrote: > On Thu 15-11-18 15:53:56, Baoquan He wrote: >> On 11/15/18 at 08:30am, Michal Hocko wrote: >>> On Thu 15-11-18 13:10:34, Baoquan He wrote: On 11/14/18 at 04:00pm, Michal Hocko wrote: > On Wed 14-11-18 22:52:50, Baoquan He wrote: >> On 11/14/18 at 10:01am, Michal Hocko wrote: >>> I have seen an issue when the migration cannot make a forward progress >>> because of a glibc page with a reference count bumping up and down. Most >>> probable explanation is the faultaround code. I am working on this and >>> will post a patch soon. In any case the migration should converge and if >>> it doesn't do then there is a bug lurking somewhere. >>> >>> Failing on ENOMEM is a questionable thing. I haven't seen that happening >>> wildly but if it is a case then I wouldn't be opposed. >> >> Applied your debugging patches, it helps a lot to printing message. >> >> Below is the dmesg log about the migrating failure. It can't pass >> migrate_pages() and loop forever. >> >> [ +0.083841] migrating pfn 10fff7d0 failed >> [ +0.05] page:ea043ffdf400 count:208 mapcount:201 >> mapping:888dff4bdda8 index:0x2 >> [ +0.012689] xfs_address_space_operations [xfs] >> [ +0.30] name:"stress" >> [ +0.004556] flags: 0x5fc004(uptodate) >> [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 >> 888dff4bdda8 >> [ +0.009488] raw: 0002 00cb00c8 >> 888e7353d000 >> [ +0.007726] page->mem_cgroup:888e7353d000 >> [ +0.084538] migrating pfn 10fff7d0 failed >> [ +0.06] page:ea043ffdf400 count:210 mapcount:201 >> mapping:888dff4bdda8 index:0x2 >> [ +0.012798] xfs_address_space_operations [xfs] >> [ +0.34] name:"stress" >> [ +0.004524] flags: 0x5fc004(uptodate) >> [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 >> 888dff4bdda8 >> [ +0.009359] raw: 0002 00cb00c8 >> 888e7353d000 >> [ +0.007728] page->mem_cgroup:888e7353d000 > > I wouldn't be surprised if this was a similar/same issue I've been > chasing recently. Could you try to disable faultaround to see if that > helps. It seems that it helped in my particular case but I am still > waiting for the final good-to-go to post the patch as I do not own the > workload which triggered that issue. Tried, still stuck in last block sometime. Usually after several times of hotplug/unplug. If stop stress program, the last block will be offlined immediately. >>> >>> Is the pattern still the same? I mean failing over few pages with >>> reference count jumping up and down between attempts? >> >> ->count jumping up and down, mapcount stays the same value. >> >>> [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 4096 >>> >>> Can you make it 0? >> >> I executed 'echo 0 > fault_around_bytes', value less than one page size >> will round up to one page. > > OK, I have missed that. So then there must be a different source of the > page count volatility. Is it always the same file? > > I think we can rule out memory reclaim because that depends on the page > lock. Is the stress test hitting on memory compaction? In other words, > are > grep compact /proc/vmstat > counters changing during the offline test heavily? I am asking because I > do not see compaction pfn walkers skipping over MIGRATE_ISOLATE > pageblocks. But I might be missing something easily. > > It would be also good to find out whether this is fs specific. E.g. does > it make any difference if you use a different one for your stress > testing? > I am wondering why it is always the last memory block of that device (and even that node). Coincidence? -- Thanks, David / dhildenb
Re: Memory hotplug softlock issue
On Thu 15-11-18 15:53:56, Baoquan He wrote: > On 11/15/18 at 08:30am, Michal Hocko wrote: > > On Thu 15-11-18 13:10:34, Baoquan He wrote: > > > On 11/14/18 at 04:00pm, Michal Hocko wrote: > > > > On Wed 14-11-18 22:52:50, Baoquan He wrote: > > > > > On 11/14/18 at 10:01am, Michal Hocko wrote: > > > > > > I have seen an issue when the migration cannot make a forward > > > > > > progress > > > > > > because of a glibc page with a reference count bumping up and down. > > > > > > Most > > > > > > probable explanation is the faultaround code. I am working on this > > > > > > and > > > > > > will post a patch soon. In any case the migration should converge > > > > > > and if > > > > > > it doesn't do then there is a bug lurking somewhere. > > > > > > > > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that > > > > > > happening > > > > > > wildly but if it is a case then I wouldn't be opposed. > > > > > > > > > > Applied your debugging patches, it helps a lot to printing message. > > > > > > > > > > Below is the dmesg log about the migrating failure. It can't pass > > > > > migrate_pages() and loop forever. > > > > > > > > > > [ +0.083841] migrating pfn 10fff7d0 failed > > > > > [ +0.05] page:ea043ffdf400 count:208 mapcount:201 > > > > > mapping:888dff4bdda8 index:0x2 > > > > > [ +0.012689] xfs_address_space_operations [xfs] > > > > > [ +0.30] name:"stress" > > > > > [ +0.004556] flags: 0x5fc004(uptodate) > > > > > [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 > > > > > 888dff4bdda8 > > > > > [ +0.009488] raw: 0002 00cb00c8 > > > > > 888e7353d000 > > > > > [ +0.007726] page->mem_cgroup:888e7353d000 > > > > > [ +0.084538] migrating pfn 10fff7d0 failed > > > > > [ +0.06] page:ea043ffdf400 count:210 mapcount:201 > > > > > mapping:888dff4bdda8 index:0x2 > > > > > [ +0.012798] xfs_address_space_operations [xfs] > > > > > [ +0.34] name:"stress" > > > > > [ +0.004524] flags: 0x5fc004(uptodate) > > > > > [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 > > > > > 888dff4bdda8 > > > > > [ +0.009359] raw: 0002 00cb00c8 > > > > > 888e7353d000 > > > > > [ +0.007728] page->mem_cgroup:888e7353d000 > > > > > > > > I wouldn't be surprised if this was a similar/same issue I've been > > > > chasing recently. Could you try to disable faultaround to see if that > > > > helps. It seems that it helped in my particular case but I am still > > > > waiting for the final good-to-go to post the patch as I do not own the > > > > workload which triggered that issue. > > > > > > Tried, still stuck in last block sometime. Usually after several times > > > of hotplug/unplug. If stop stress program, the last block will be > > > offlined immediately. > > > > Is the pattern still the same? I mean failing over few pages with > > reference count jumping up and down between attempts? > > ->count jumping up and down, mapcount stays the same value. > > > > > > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes > > > 4096 > > > > Can you make it 0? > > I executed 'echo 0 > fault_around_bytes', value less than one page size > will round up to one page. OK, I have missed that. So then there must be a different source of the page count volatility. Is it always the same file? I think we can rule out memory reclaim because that depends on the page lock. Is the stress test hitting on memory compaction? In other words, are grep compact /proc/vmstat counters changing during the offline test heavily? I am asking because I do not see compaction pfn walkers skipping over MIGRATE_ISOLATE pageblocks. But I might be missing something easily. It would be also good to find out whether this is fs specific. E.g. does it make any difference if you use a different one for your stress testing? -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/15/18 at 08:30am, Michal Hocko wrote: > On Thu 15-11-18 13:10:34, Baoquan He wrote: > > On 11/14/18 at 04:00pm, Michal Hocko wrote: > > > On Wed 14-11-18 22:52:50, Baoquan He wrote: > > > > On 11/14/18 at 10:01am, Michal Hocko wrote: > > > > > I have seen an issue when the migration cannot make a forward progress > > > > > because of a glibc page with a reference count bumping up and down. > > > > > Most > > > > > probable explanation is the faultaround code. I am working on this and > > > > > will post a patch soon. In any case the migration should converge and > > > > > if > > > > > it doesn't do then there is a bug lurking somewhere. > > > > > > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that > > > > > happening > > > > > wildly but if it is a case then I wouldn't be opposed. > > > > > > > > Applied your debugging patches, it helps a lot to printing message. > > > > > > > > Below is the dmesg log about the migrating failure. It can't pass > > > > migrate_pages() and loop forever. > > > > > > > > [ +0.083841] migrating pfn 10fff7d0 failed > > > > [ +0.05] page:ea043ffdf400 count:208 mapcount:201 > > > > mapping:888dff4bdda8 index:0x2 > > > > [ +0.012689] xfs_address_space_operations [xfs] > > > > [ +0.30] name:"stress" > > > > [ +0.004556] flags: 0x5fc004(uptodate) > > > > [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 > > > > 888dff4bdda8 > > > > [ +0.009488] raw: 0002 00cb00c8 > > > > 888e7353d000 > > > > [ +0.007726] page->mem_cgroup:888e7353d000 > > > > [ +0.084538] migrating pfn 10fff7d0 failed > > > > [ +0.06] page:ea043ffdf400 count:210 mapcount:201 > > > > mapping:888dff4bdda8 index:0x2 > > > > [ +0.012798] xfs_address_space_operations [xfs] > > > > [ +0.34] name:"stress" > > > > [ +0.004524] flags: 0x5fc004(uptodate) > > > > [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 > > > > 888dff4bdda8 > > > > [ +0.009359] raw: 0002 00cb00c8 > > > > 888e7353d000 > > > > [ +0.007728] page->mem_cgroup:888e7353d000 > > > > > > I wouldn't be surprised if this was a similar/same issue I've been > > > chasing recently. Could you try to disable faultaround to see if that > > > helps. It seems that it helped in my particular case but I am still > > > waiting for the final good-to-go to post the patch as I do not own the > > > workload which triggered that issue. > > > > Tried, still stuck in last block sometime. Usually after several times > > of hotplug/unplug. If stop stress program, the last block will be > > offlined immediately. > > Is the pattern still the same? I mean failing over few pages with > reference count jumping up and down between attempts? ->count jumping up and down, mapcount stays the same value. > > > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes > > 4096 > > Can you make it 0? I executed 'echo 0 > fault_around_bytes', value less than one page size will round up to one page. /* * fault_around_bytes must be rounded down to the nearest page order as it's * what do_fault_around() expects to see. */ static int fault_around_bytes_set(void *data, u64 val) { if (val / PAGE_SIZE > PTRS_PER_PTE) return -EINVAL; if (val > PAGE_SIZE) fault_around_bytes = rounddown_pow_of_two(val); else fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is undefined */ return 0; }
Re: Memory hotplug softlock issue
On Thu 15-11-18 13:10:34, Baoquan He wrote: > On 11/14/18 at 04:00pm, Michal Hocko wrote: > > On Wed 14-11-18 22:52:50, Baoquan He wrote: > > > On 11/14/18 at 10:01am, Michal Hocko wrote: > > > > I have seen an issue when the migration cannot make a forward progress > > > > because of a glibc page with a reference count bumping up and down. Most > > > > probable explanation is the faultaround code. I am working on this and > > > > will post a patch soon. In any case the migration should converge and if > > > > it doesn't do then there is a bug lurking somewhere. > > > > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that happening > > > > wildly but if it is a case then I wouldn't be opposed. > > > > > > Applied your debugging patches, it helps a lot to printing message. > > > > > > Below is the dmesg log about the migrating failure. It can't pass > > > migrate_pages() and loop forever. > > > > > > [ +0.083841] migrating pfn 10fff7d0 failed > > > [ +0.05] page:ea043ffdf400 count:208 mapcount:201 > > > mapping:888dff4bdda8 index:0x2 > > > [ +0.012689] xfs_address_space_operations [xfs] > > > [ +0.30] name:"stress" > > > [ +0.004556] flags: 0x5fc004(uptodate) > > > [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 > > > 888dff4bdda8 > > > [ +0.009488] raw: 0002 00cb00c8 > > > 888e7353d000 > > > [ +0.007726] page->mem_cgroup:888e7353d000 > > > [ +0.084538] migrating pfn 10fff7d0 failed > > > [ +0.06] page:ea043ffdf400 count:210 mapcount:201 > > > mapping:888dff4bdda8 index:0x2 > > > [ +0.012798] xfs_address_space_operations [xfs] > > > [ +0.34] name:"stress" > > > [ +0.004524] flags: 0x5fc004(uptodate) > > > [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 > > > 888dff4bdda8 > > > [ +0.009359] raw: 0002 00cb00c8 > > > 888e7353d000 > > > [ +0.007728] page->mem_cgroup:888e7353d000 > > > > I wouldn't be surprised if this was a similar/same issue I've been > > chasing recently. Could you try to disable faultaround to see if that > > helps. It seems that it helped in my particular case but I am still > > waiting for the final good-to-go to post the patch as I do not own the > > workload which triggered that issue. > > Tried, still stuck in last block sometime. Usually after several times > of hotplug/unplug. If stop stress program, the last block will be > offlined immediately. Is the pattern still the same? I mean failing over few pages with reference count jumping up and down between attempts? > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes > 4096 Can you make it 0? -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/14/18 at 04:00pm, Michal Hocko wrote: > On Wed 14-11-18 22:52:50, Baoquan He wrote: > > On 11/14/18 at 10:01am, Michal Hocko wrote: > > > I have seen an issue when the migration cannot make a forward progress > > > because of a glibc page with a reference count bumping up and down. Most > > > probable explanation is the faultaround code. I am working on this and > > > will post a patch soon. In any case the migration should converge and if > > > it doesn't do then there is a bug lurking somewhere. > > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that happening > > > wildly but if it is a case then I wouldn't be opposed. > > > > Applied your debugging patches, it helps a lot to printing message. > > > > Below is the dmesg log about the migrating failure. It can't pass > > migrate_pages() and loop forever. > > > > [ +0.083841] migrating pfn 10fff7d0 failed > > [ +0.05] page:ea043ffdf400 count:208 mapcount:201 > > mapping:888dff4bdda8 index:0x2 > > [ +0.012689] xfs_address_space_operations [xfs] > > [ +0.30] name:"stress" > > [ +0.004556] flags: 0x5fc004(uptodate) > > [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 > > 888dff4bdda8 > > [ +0.009488] raw: 0002 00cb00c8 > > 888e7353d000 > > [ +0.007726] page->mem_cgroup:888e7353d000 > > [ +0.084538] migrating pfn 10fff7d0 failed > > [ +0.06] page:ea043ffdf400 count:210 mapcount:201 > > mapping:888dff4bdda8 index:0x2 > > [ +0.012798] xfs_address_space_operations [xfs] > > [ +0.34] name:"stress" > > [ +0.004524] flags: 0x5fc004(uptodate) > > [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 > > 888dff4bdda8 > > [ +0.009359] raw: 0002 00cb00c8 > > 888e7353d000 > > [ +0.007728] page->mem_cgroup:888e7353d000 > > I wouldn't be surprised if this was a similar/same issue I've been > chasing recently. Could you try to disable faultaround to see if that > helps. It seems that it helped in my particular case but I am still > waiting for the final good-to-go to post the patch as I do not own the > workload which triggered that issue. Tried, still stuck in last block sometime. Usually after several times of hotplug/unplug. If stop stress program, the last block will be offlined immediately. [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 4096
Re: Memory hotplug softlock issue
On Wed 14-11-18 22:52:50, Baoquan He wrote: > On 11/14/18 at 10:01am, Michal Hocko wrote: > > I have seen an issue when the migration cannot make a forward progress > > because of a glibc page with a reference count bumping up and down. Most > > probable explanation is the faultaround code. I am working on this and > > will post a patch soon. In any case the migration should converge and if > > it doesn't do then there is a bug lurking somewhere. > > > > Failing on ENOMEM is a questionable thing. I haven't seen that happening > > wildly but if it is a case then I wouldn't be opposed. > > Applied your debugging patches, it helps a lot to printing message. > > Below is the dmesg log about the migrating failure. It can't pass > migrate_pages() and loop forever. > > [ +0.083841] migrating pfn 10fff7d0 failed > [ +0.05] page:ea043ffdf400 count:208 mapcount:201 > mapping:888dff4bdda8 index:0x2 > [ +0.012689] xfs_address_space_operations [xfs] > [ +0.30] name:"stress" > [ +0.004556] flags: 0x5fc004(uptodate) > [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 > 888dff4bdda8 > [ +0.009488] raw: 0002 00cb00c8 > 888e7353d000 > [ +0.007726] page->mem_cgroup:888e7353d000 > [ +0.084538] migrating pfn 10fff7d0 failed > [ +0.06] page:ea043ffdf400 count:210 mapcount:201 > mapping:888dff4bdda8 index:0x2 > [ +0.012798] xfs_address_space_operations [xfs] > [ +0.34] name:"stress" > [ +0.004524] flags: 0x5fc004(uptodate) > [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 > 888dff4bdda8 > [ +0.009359] raw: 0002 00cb00c8 > 888e7353d000 > [ +0.007728] page->mem_cgroup:888e7353d000 I wouldn't be surprised if this was a similar/same issue I've been chasing recently. Could you try to disable faultaround to see if that helps. It seems that it helped in my particular case but I am still waiting for the final good-to-go to post the patch as I do not own the workload which triggered that issue. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 11/14/18 at 10:01am, Michal Hocko wrote: > I have seen an issue when the migration cannot make a forward progress > because of a glibc page with a reference count bumping up and down. Most > probable explanation is the faultaround code. I am working on this and > will post a patch soon. In any case the migration should converge and if > it doesn't do then there is a bug lurking somewhere. > > Failing on ENOMEM is a questionable thing. I haven't seen that happening > wildly but if it is a case then I wouldn't be opposed. Applied your debugging patches, it helps a lot to printing message. Below is the dmesg log about the migrating failure. It can't pass migrate_pages() and loop forever. [ +0.083841] migrating pfn 10fff7d0 failed [ +0.05] page:ea043ffdf400 count:208 mapcount:201 mapping:888dff4bdda8 index:0x2 [ +0.012689] xfs_address_space_operations [xfs] [ +0.30] name:"stress" [ +0.004556] flags: 0x5fc004(uptodate) [ +0.007339] raw: 005fc004 c90e3d80 c90e3d80 888dff4bdda8 [ +0.009488] raw: 0002 00cb00c8 888e7353d000 [ +0.007726] page->mem_cgroup:888e7353d000 [ +0.084538] migrating pfn 10fff7d0 failed [ +0.06] page:ea043ffdf400 count:210 mapcount:201 mapping:888dff4bdda8 index:0x2 [ +0.012798] xfs_address_space_operations [xfs] [ +0.34] name:"stress" [ +0.004524] flags: 0x5fc004(uptodate) [ +0.007068] raw: 005fc004 c90e3d80 c90e3d80 888dff4bdda8 [ +0.009359] raw: 0002 00cb00c8 888e7353d000 [ +0.007728] page->mem_cgroup:888e7353d000 ~ This is numactl -H, the last memory block of node1 can't be migrated. [root@~]# numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 node 0 size: 58793 MB node 0 free: 50374 MB node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 node 1 size: 2048 MB node 1 free: 0 MB node 2 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 node 2 size: 65536 MB node 2 free: 60102 MB node 3 cpus: 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 node 3 size: 65536 MB node 3 free: 61237 MB node 4 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 node 4 size: 65536 MB node 4 free: 63057 MB node 5 cpus: 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 node 5 size: 65536 MB node 5 free: 62507 MB node 6 cpus: 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 node 6 size: 65536 MB node 6 free: 62688 MB node 7 cpus: 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 node 7 size: 65536 MB node 7 free: 61775 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 21 31 21 41 41 51 51 1: 21 10 21 31 41 41 51 51 2: 31 21 10 21 51 51 41 41 3: 21 31 21 10 51 51 41 41 4: 41 41 51 51 10 21 31 21 5: 41 41 51 51 21 10 21 31 6: 51 51 41 41 31 21 10 21 7: 51 51 41 41 21 31 21 10 > > > You mentioned memory pressure, if our host is under memory pressure we > > can easily trigger running into an endless loop there, because we > > basically ignore -ENOMEM e.g. when we cannot get a page to migrate some > > memory to be offlined. I assume this is the case here. > > do_migrate_range() could be the bad boy if it keeps failing forever and > > we keep retrying. > > My hotplug debugging patches [1] should help to tell us. > > [1] http://lkml.kernel.org/r/20181107101830.17405-1-mho...@kernel.org > -- > Michal Hocko > SUSE Labs
Re: Memory hotplug softlock issue
On Wed 14-11-18 10:48:09, David Hildenbrand wrote: > On 14.11.18 10:41, Michal Hocko wrote: > > On Wed 14-11-18 10:25:57, David Hildenbrand wrote: > >> On 14.11.18 10:00, Baoquan He wrote: > >>> Hi David, > >>> > >>> On 11/14/18 at 09:18am, David Hildenbrand wrote: > Code seems to be waiting for the mem_hotplug_lock in read. > We hold mem_hotplug_lock in write whenever we online/offline/add/remove > memory. There are two ways to trigger offlining of memory: > > 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" > > This always properly took the mem_hotplug_lock. Nothing changed > > 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" > > This didn't take the mem_hotplug_lock and I fixed that for this release. > > So if you were testing with 1., you should have seen the same error > before this release (unless there is something else now broken in this > release). > >>> > >>> Thanks a lot for looking into this. > >>> > >>> I triggered sysrq+t to check threads' state. You can see that we use > >>> firmware to trigger ACPI event to go to acpi_bus_offline(), it truly > >>> didn't take mem_hotplug_lock() and has taken it with your fix in > >>> commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called > >>> w.o. mem_hotplug_lock > >>> > >>> [ +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > >>> [ +0.005398] Call Trace: > >>> [ +0.002476] ? page_vma_mapped_walk+0x307/0x710 > >>> [ +0.004538] ? page_remove_rmap+0xa2/0x340 > >>> [ +0.004104] ? ptep_clear_flush+0x54/0x60 > >>> [ +0.004027] ? enqueue_entity+0x11c/0x620 > >>> [ +0.005904] ? schedule+0x28/0x80 > >>> [ +0.003336] ? rmap_walk_file+0xf9/0x270 > >>> [ +0.003940] ? try_to_unmap+0x9c/0xf0 > >>> [ +0.003695] ? migrate_pages+0x2b0/0xb90 > >>> [ +0.003959] ? try_offline_node+0x160/0x160 > >>> [ +0.004214] ? __offline_pages+0x6ce/0x8e0 > >>> [ +0.004134] ? memory_subsys_offline+0x40/0x60 > >>> [ +0.004474] ? device_offline+0x81/0xb0 > >>> [ +0.003867] ? acpi_bus_offline+0xdb/0x140 > >>> [ +0.004117] ? acpi_device_hotplug+0x21c/0x460 > >>> [ +0.004458] ? acpi_hotplug_work_fn+0x1a/0x30 > >>> [ +0.004372] ? process_one_work+0x1a1/0x3a0 > >>> [ +0.004195] ? worker_thread+0x30/0x380 > >>> [ +0.003851] ? drain_workqueue+0x120/0x120 > >>> [ +0.004117] ? kthread+0x112/0x130 > >>> [ +0.003411] ? kthread_park+0x80/0x80 > >>> [ +0.005325] ? ret_from_fork+0x35/0x40 > >>> > >> > >> Yes, this is indeed another code path that was fixed (and I didn't > >> actually realize it ;) ). Thanks for the callchain. Before my fix > >> hotplug still would have never succeeded (offline_pages would have > >> silently looped forever) as far as I can tell. > > > > I haven't studied your patch yet so I am not really sure why you have > > added the lock into this path. The memory hotplug locking is certainly > > far from great but I believe we should really rething the scope of the > > lock. There shouldn't be any fundamental reason to use the global lock > > for the full offlining. So rather than moving the lock from one place to > > another we need a range locking I believe. > See the patches for details, the lock was removed on this path by > mistake not by design. OK, so I guess we should plug that hole first I guess. > Replacing the lock by some range lock can now be done. The tricky part > will be get_online_mems(), we'll have to indicate a range somehow. For > online/offline/add/remove, we have the range. I would argue that get_online_mems() needs some rethinking. Callers shouldn't really care that a node went offline. If they care about the specific pfn range of the node to not go away then the range locking should work just fine for them. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
[Cc Vladimir] On Wed 14-11-18 15:09:09, Baoquan He wrote: > Hi, > > Tested memory hotplug on a bare metal system, hot removing always > trigger a lock. Usually need hot plug/unplug several times, then the hot > removing will hang there at the last block. Surely with memory pressure > added by executing "stress -m 200". > > Will attach the log partly. Any idea or suggestion, appreciated. > [...] > [ +0.007169] Not tainted 4.20.0-rc2+ #4 > [ +0.004630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ +0.008001] kworker/181:1 D0 1187 2 0x8000 > [ +0.005711] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func > [ +0.006467] Call Trace: > [ +0.002591] ? __schedule+0x24e/0x880 > [ +0.004995] schedule+0x28/0x80 > [ +0.003380] rwsem_down_read_failed+0x103/0x190 > [ +0.006528] call_rwsem_down_read_failed+0x14/0x30 > [ +0.004937] __percpu_down_read+0x4f/0x80 > [ +0.004204] get_online_mems+0x2d/0x30 > [ +0.003871] memcg_create_kmem_cache+0x1b/0x120 > [ +0.004740] memcg_kmem_cache_create_func+0x1b/0x60 > [ +0.004986] process_one_work+0x1a1/0x3a0 > [ +0.004255] worker_thread+0x30/0x380 > [ +0.003764] ? drain_workqueue+0x120/0x120 > [ +0.004238] kthread+0x112/0x130 > [ +0.003320] ? kthread_park+0x80/0x80 > [ +0.003796] ret_from_fork+0x35/0x40 For a quick context. We do hold the exclusive mem hotplug lock throughout the whole offlining and that can take quite some time. So I am wondering whether we absolutely have to take the shared lock in this path (introduced by 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")). Is there any way to relax this requirement? E.g. nodes stay around even when they are completely offline. Does that help? -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On 14.11.18 10:41, Michal Hocko wrote: > On Wed 14-11-18 10:25:57, David Hildenbrand wrote: >> On 14.11.18 10:00, Baoquan He wrote: >>> Hi David, >>> >>> On 11/14/18 at 09:18am, David Hildenbrand wrote: Code seems to be waiting for the mem_hotplug_lock in read. We hold mem_hotplug_lock in write whenever we online/offline/add/remove memory. There are two ways to trigger offlining of memory: 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" This always properly took the mem_hotplug_lock. Nothing changed 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" This didn't take the mem_hotplug_lock and I fixed that for this release. So if you were testing with 1., you should have seen the same error before this release (unless there is something else now broken in this release). >>> >>> Thanks a lot for looking into this. >>> >>> I triggered sysrq+t to check threads' state. You can see that we use >>> firmware to trigger ACPI event to go to acpi_bus_offline(), it truly >>> didn't take mem_hotplug_lock() and has taken it with your fix in >>> commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. >>> mem_hotplug_lock >>> >>> [ +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn >>> [ +0.005398] Call Trace: >>> [ +0.002476] ? page_vma_mapped_walk+0x307/0x710 >>> [ +0.004538] ? page_remove_rmap+0xa2/0x340 >>> [ +0.004104] ? ptep_clear_flush+0x54/0x60 >>> [ +0.004027] ? enqueue_entity+0x11c/0x620 >>> [ +0.005904] ? schedule+0x28/0x80 >>> [ +0.003336] ? rmap_walk_file+0xf9/0x270 >>> [ +0.003940] ? try_to_unmap+0x9c/0xf0 >>> [ +0.003695] ? migrate_pages+0x2b0/0xb90 >>> [ +0.003959] ? try_offline_node+0x160/0x160 >>> [ +0.004214] ? __offline_pages+0x6ce/0x8e0 >>> [ +0.004134] ? memory_subsys_offline+0x40/0x60 >>> [ +0.004474] ? device_offline+0x81/0xb0 >>> [ +0.003867] ? acpi_bus_offline+0xdb/0x140 >>> [ +0.004117] ? acpi_device_hotplug+0x21c/0x460 >>> [ +0.004458] ? acpi_hotplug_work_fn+0x1a/0x30 >>> [ +0.004372] ? process_one_work+0x1a1/0x3a0 >>> [ +0.004195] ? worker_thread+0x30/0x380 >>> [ +0.003851] ? drain_workqueue+0x120/0x120 >>> [ +0.004117] ? kthread+0x112/0x130 >>> [ +0.003411] ? kthread_park+0x80/0x80 >>> [ +0.005325] ? ret_from_fork+0x35/0x40 >>> >> >> Yes, this is indeed another code path that was fixed (and I didn't >> actually realize it ;) ). Thanks for the callchain. Before my fix >> hotplug still would have never succeeded (offline_pages would have >> silently looped forever) as far as I can tell. > > I haven't studied your patch yet so I am not really sure why you have > added the lock into this path. The memory hotplug locking is certainly > far from great but I believe we should really rething the scope of the > lock. There shouldn't be any fundamental reason to use the global lock > for the full offlining. So rather than moving the lock from one place to > another we need a range locking I believe. See the patches for details, the lock was removed on this path by mistake not by design. Replacing the lock by some range lock can now be done. The tricky part will be get_online_mems(), we'll have to indicate a range somehow. For online/offline/add/remove, we have the range. -- Thanks, David / dhildenb
Re: Memory hotplug softlock issue
On Wed 14-11-18 10:25:57, David Hildenbrand wrote: > On 14.11.18 10:00, Baoquan He wrote: > > Hi David, > > > > On 11/14/18 at 09:18am, David Hildenbrand wrote: > >> Code seems to be waiting for the mem_hotplug_lock in read. > >> We hold mem_hotplug_lock in write whenever we online/offline/add/remove > >> memory. There are two ways to trigger offlining of memory: > >> > >> 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" > >> > >> This always properly took the mem_hotplug_lock. Nothing changed > >> > >> 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" > >> > >> This didn't take the mem_hotplug_lock and I fixed that for this release. > >> > >> So if you were testing with 1., you should have seen the same error > >> before this release (unless there is something else now broken in this > >> release). > > > > Thanks a lot for looking into this. > > > > I triggered sysrq+t to check threads' state. You can see that we use > > firmware to trigger ACPI event to go to acpi_bus_offline(), it truly > > didn't take mem_hotplug_lock() and has taken it with your fix in > > commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. > > mem_hotplug_lock > > > > [ +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > > [ +0.005398] Call Trace: > > [ +0.002476] ? page_vma_mapped_walk+0x307/0x710 > > [ +0.004538] ? page_remove_rmap+0xa2/0x340 > > [ +0.004104] ? ptep_clear_flush+0x54/0x60 > > [ +0.004027] ? enqueue_entity+0x11c/0x620 > > [ +0.005904] ? schedule+0x28/0x80 > > [ +0.003336] ? rmap_walk_file+0xf9/0x270 > > [ +0.003940] ? try_to_unmap+0x9c/0xf0 > > [ +0.003695] ? migrate_pages+0x2b0/0xb90 > > [ +0.003959] ? try_offline_node+0x160/0x160 > > [ +0.004214] ? __offline_pages+0x6ce/0x8e0 > > [ +0.004134] ? memory_subsys_offline+0x40/0x60 > > [ +0.004474] ? device_offline+0x81/0xb0 > > [ +0.003867] ? acpi_bus_offline+0xdb/0x140 > > [ +0.004117] ? acpi_device_hotplug+0x21c/0x460 > > [ +0.004458] ? acpi_hotplug_work_fn+0x1a/0x30 > > [ +0.004372] ? process_one_work+0x1a1/0x3a0 > > [ +0.004195] ? worker_thread+0x30/0x380 > > [ +0.003851] ? drain_workqueue+0x120/0x120 > > [ +0.004117] ? kthread+0x112/0x130 > > [ +0.003411] ? kthread_park+0x80/0x80 > > [ +0.005325] ? ret_from_fork+0x35/0x40 > > > > Yes, this is indeed another code path that was fixed (and I didn't > actually realize it ;) ). Thanks for the callchain. Before my fix > hotplug still would have never succeeded (offline_pages would have > silently looped forever) as far as I can tell. I haven't studied your patch yet so I am not really sure why you have added the lock into this path. The memory hotplug locking is certainly far from great but I believe we should really rething the scope of the lock. There shouldn't be any fundamental reason to use the global lock for the full offlining. So rather than moving the lock from one place to another we need a range locking I believe. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
On Wed 14-11-18 10:22:31, David Hildenbrand wrote: > >> > >> The real question is, however, why offlining of the last block doesn't > >> succeed. In __offline_pages() we basically have an endless loop (while > >> holding the mem_hotplug_lock in write). Now I consider this piece of > >> code very problematic (we should automatically fail after X > >> attempts/after X seconds, we should not ignore -ENOMEM), and we've had > >> other BUGs whereby we would run into an endless loop here (e.g. related > >> to hugepages I guess). > > > > We used to have number of retries previous and it was too fragile. If > > you need a timeout then you can easily do that from userspace. Just do > > timeout $TIME echo 0 > $MEM_PATH/online > > I agree that number of retries is not a good measure. > > But as far as I can see this happens from the kernel via an ACPI event. > E.g. failing to offline a block after X seconds would still make sense. > (if something takes 120seconds to offline 128MB/2G there is something > very bad going on, we could set the default limit to e.g. 30seconds), > however ... I disagree. THis is pulling policy into the kernel and that just generates problems. What might look like a reasonable timeout to some workloads might be wrong for others. > > I have seen an issue when the migration cannot make a forward progress > > because of a glibc page with a reference count bumping up and down. Most > > probable explanation is the faultaround code. I am working on this and > > will post a patch soon. In any case the migration should converge and if > > it doesn't do then there is a bug lurking somewhere. > > ... I also agree that this should converge. And if we detect a serious > issue that we can't handle/where we can't converge (e.g. -ENOMEM) we > should abort. As I've said ENOMEM can be considered a hard failure. We do not trigger OOM killer when allocating migration target so we only rely on somebody esle making a forward progress for us and that is suboptimal. Yet I haven't seen this happening in hotplug scenarios so far. Making hotremove while the memory is really under pressure is a bad idea in the first place most of the time. It is quite likely that somebody else just triggers the oom killer and the offlining part will eventually make a forward progress. > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that happening > > wildly but if it is a case then I wouldn't be opposed. > > > >> You mentioned memory pressure, if our host is under memory pressure we > >> can easily trigger running into an endless loop there, because we > >> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some > >> memory to be offlined. I assume this is the case here. > >> do_migrate_range() could be the bad boy if it keeps failing forever and > >> we keep retrying. > > I've seen quite some issues while playing with virtio-mem, but didn't > have the time to look into the details. Still on my long list of things > to look into. Memory hotplug is really far away from being optimal and robust. This has always been the case. Issues used to be workaround by retry limits etc. If we ever want to make it more robust we have to bite a bullet and actually chase all the issues that might be basically anywhere and fix them. This is just a nature of a pony that memory hotplug is. -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
>>> Failing on ENOMEM is a questionable thing. I haven't seen that happening >>> wildly but if it is a case then I wouldn't be opposed. >>> You mentioned memory pressure, if our host is under memory pressure we can easily trigger running into an endless loop there, because we basically ignore -ENOMEM e.g. when we cannot get a page to migrate some memory to be offlined. I assume this is the case here. do_migrate_range() could be the bad boy if it keeps failing forever and we keep retrying. >> >> I've seen quite some issues while playing with virtio-mem, but didn't >> have the time to look into the details. Still on my long list of things >> to look into. > > Memory hotplug is really far away from being optimal and robust. This > has always been the case. Issues used to be workaround by retry limits > etc. If we ever want to make it more robust we have to bite a bullet and > actually chase all the issues that might be basically anywhere and fix > them. This is just a nature of a pony that memory hotplug is. > Yes I agree, no more workarounds. -- Thanks, David / dhildenb
Re: Memory hotplug softlock issue
On 14.11.18 10:00, Baoquan He wrote: > Hi David, > > On 11/14/18 at 09:18am, David Hildenbrand wrote: >> Code seems to be waiting for the mem_hotplug_lock in read. >> We hold mem_hotplug_lock in write whenever we online/offline/add/remove >> memory. There are two ways to trigger offlining of memory: >> >> 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" >> >> This always properly took the mem_hotplug_lock. Nothing changed >> >> 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" >> >> This didn't take the mem_hotplug_lock and I fixed that for this release. >> >> So if you were testing with 1., you should have seen the same error >> before this release (unless there is something else now broken in this >> release). > > Thanks a lot for looking into this. > > I triggered sysrq+t to check threads' state. You can see that we use > firmware to trigger ACPI event to go to acpi_bus_offline(), it truly > didn't take mem_hotplug_lock() and has taken it with your fix in > commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. > mem_hotplug_lock > > [ +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > [ +0.005398] Call Trace: > [ +0.002476] ? page_vma_mapped_walk+0x307/0x710 > [ +0.004538] ? page_remove_rmap+0xa2/0x340 > [ +0.004104] ? ptep_clear_flush+0x54/0x60 > [ +0.004027] ? enqueue_entity+0x11c/0x620 > [ +0.005904] ? schedule+0x28/0x80 > [ +0.003336] ? rmap_walk_file+0xf9/0x270 > [ +0.003940] ? try_to_unmap+0x9c/0xf0 > [ +0.003695] ? migrate_pages+0x2b0/0xb90 > [ +0.003959] ? try_offline_node+0x160/0x160 > [ +0.004214] ? __offline_pages+0x6ce/0x8e0 > [ +0.004134] ? memory_subsys_offline+0x40/0x60 > [ +0.004474] ? device_offline+0x81/0xb0 > [ +0.003867] ? acpi_bus_offline+0xdb/0x140 > [ +0.004117] ? acpi_device_hotplug+0x21c/0x460 > [ +0.004458] ? acpi_hotplug_work_fn+0x1a/0x30 > [ +0.004372] ? process_one_work+0x1a1/0x3a0 > [ +0.004195] ? worker_thread+0x30/0x380 > [ +0.003851] ? drain_workqueue+0x120/0x120 > [ +0.004117] ? kthread+0x112/0x130 > [ +0.003411] ? kthread_park+0x80/0x80 > [ +0.005325] ? ret_from_fork+0x35/0x40 > Yes, this is indeed another code path that was fixed (and I didn't actually realize it ;) ). Thanks for the callchain. Before my fix hotplug still would have never succeeded (offline_pages would have silently looped forever) as far as I can tell. > >> >> >> The real question is, however, why offlining of the last block doesn't >> succeed. In __offline_pages() we basically have an endless loop (while >> holding the mem_hotplug_lock in write). Now I consider this piece of >> code very problematic (we should automatically fail after X >> attempts/after X seconds, we should not ignore -ENOMEM), and we've had >> other BUGs whereby we would run into an endless loop here (e.g. related >> to hugepages I guess). > > Hmm, even though memory hotplug stalled there, there are still much > memory. E.g in this system, it has 8 nodes and each node has 64 GB > memory, it's 512 GB in all. Now I run "stress -m 200" to trigger 200 > processes to malloc then free 256 MB contiously, and it's eating 50 GB > in all. In theory, it still has much memory for migrating to. Maybe a NUMA issue? But I am just making wild guesses. Maybe it is not -ENOMEM but just some other migration condition that is not properly handled (see Michals reply). > >> >> You mentioned memory pressure, if our host is under memory pressure we >> can easily trigger running into an endless loop there, because we >> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some >> memory to be offlined. I assume this is the case here. >> do_migrate_range() could be the bad boy if it keeps failing forever and >> we keep retrying. > > Not sure what other people think about this. If failed the memory removing > when still much free memory left, I worry customer will complain. Indeed, we have to look into this. > > Yeah, it stoped at do_migrate_range() when try to migrate the last > memory block. And each time it's the last memory block which can't be > offlined and hang. It would be interesting to see which error message we keep getting. > > If any message or information needed, I can provide. > > Thanks > Baoquan > -- Thanks, David / dhildenb
Re: Memory hotplug softlock issue
>> >> The real question is, however, why offlining of the last block doesn't >> succeed. In __offline_pages() we basically have an endless loop (while >> holding the mem_hotplug_lock in write). Now I consider this piece of >> code very problematic (we should automatically fail after X >> attempts/after X seconds, we should not ignore -ENOMEM), and we've had >> other BUGs whereby we would run into an endless loop here (e.g. related >> to hugepages I guess). > > We used to have number of retries previous and it was too fragile. If > you need a timeout then you can easily do that from userspace. Just do > timeout $TIME echo 0 > $MEM_PATH/online I agree that number of retries is not a good measure. But as far as I can see this happens from the kernel via an ACPI event. E.g. failing to offline a block after X seconds would still make sense. (if something takes 120seconds to offline 128MB/2G there is something very bad going on, we could set the default limit to e.g. 30seconds), however ... > > I have seen an issue when the migration cannot make a forward progress > because of a glibc page with a reference count bumping up and down. Most > probable explanation is the faultaround code. I am working on this and > will post a patch soon. In any case the migration should converge and if > it doesn't do then there is a bug lurking somewhere. ... I also agree that this should converge. And if we detect a serious issue that we can't handle/where we can't converge (e.g. -ENOMEM) we should abort. > > Failing on ENOMEM is a questionable thing. I haven't seen that happening > wildly but if it is a case then I wouldn't be opposed. > >> You mentioned memory pressure, if our host is under memory pressure we >> can easily trigger running into an endless loop there, because we >> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some >> memory to be offlined. I assume this is the case here. >> do_migrate_range() could be the bad boy if it keeps failing forever and >> we keep retrying. I've seen quite some issues while playing with virtio-mem, but didn't have the time to look into the details. Still on my long list of things to look into. > > My hotplug debugging patches [1] should help to tell us. > > [1] http://lkml.kernel.org/r/20181107101830.17405-1-mho...@kernel.org > -- Thanks, David / dhildenb
Re: Memory hotplug softlock issue
On Wed 14-11-18 09:18:09, David Hildenbrand wrote: > On 14.11.18 08:09, Baoquan He wrote: > > Hi, > > > > Tested memory hotplug on a bare metal system, hot removing always > > trigger a lock. Usually need hot plug/unplug several times, then the hot > > removing will hang there at the last block. Surely with memory pressure > > added by executing "stress -m 200". > > > > Will attach the log partly. Any idea or suggestion, appreciated. > > > > Thanks > > Baoquan > > > > Code seems to be waiting for the mem_hotplug_lock in read. > We hold mem_hotplug_lock in write whenever we online/offline/add/remove > memory. There are two ways to trigger offlining of memory: > > 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" > > This always properly took the mem_hotplug_lock. Nothing changed > > 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" > > This didn't take the mem_hotplug_lock and I fixed that for this release. This discrepancy should go. > So if you were testing with 1., you should have seen the same error > before this release (unless there is something else now broken in this > release). > > > The real question is, however, why offlining of the last block doesn't > succeed. In __offline_pages() we basically have an endless loop (while > holding the mem_hotplug_lock in write). Now I consider this piece of > code very problematic (we should automatically fail after X > attempts/after X seconds, we should not ignore -ENOMEM), and we've had > other BUGs whereby we would run into an endless loop here (e.g. related > to hugepages I guess). We used to have number of retries previous and it was too fragile. If you need a timeout then you can easily do that from userspace. Just do timeout $TIME echo 0 > $MEM_PATH/online I have seen an issue when the migration cannot make a forward progress because of a glibc page with a reference count bumping up and down. Most probable explanation is the faultaround code. I am working on this and will post a patch soon. In any case the migration should converge and if it doesn't do then there is a bug lurking somewhere. Failing on ENOMEM is a questionable thing. I haven't seen that happening wildly but if it is a case then I wouldn't be opposed. > You mentioned memory pressure, if our host is under memory pressure we > can easily trigger running into an endless loop there, because we > basically ignore -ENOMEM e.g. when we cannot get a page to migrate some > memory to be offlined. I assume this is the case here. > do_migrate_range() could be the bad boy if it keeps failing forever and > we keep retrying. My hotplug debugging patches [1] should help to tell us. [1] http://lkml.kernel.org/r/20181107101830.17405-1-mho...@kernel.org -- Michal Hocko SUSE Labs
Re: Memory hotplug softlock issue
Hi David, On 11/14/18 at 09:18am, David Hildenbrand wrote: > Code seems to be waiting for the mem_hotplug_lock in read. > We hold mem_hotplug_lock in write whenever we online/offline/add/remove > memory. There are two ways to trigger offlining of memory: > > 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" > > This always properly took the mem_hotplug_lock. Nothing changed > > 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" > > This didn't take the mem_hotplug_lock and I fixed that for this release. > > So if you were testing with 1., you should have seen the same error > before this release (unless there is something else now broken in this > release). Thanks a lot for looking into this. I triggered sysrq+t to check threads' state. You can see that we use firmware to trigger ACPI event to go to acpi_bus_offline(), it truly didn't take mem_hotplug_lock() and has taken it with your fix in commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock [ +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ +0.005398] Call Trace: [ +0.002476] ? page_vma_mapped_walk+0x307/0x710 [ +0.004538] ? page_remove_rmap+0xa2/0x340 [ +0.004104] ? ptep_clear_flush+0x54/0x60 [ +0.004027] ? enqueue_entity+0x11c/0x620 [ +0.005904] ? schedule+0x28/0x80 [ +0.003336] ? rmap_walk_file+0xf9/0x270 [ +0.003940] ? try_to_unmap+0x9c/0xf0 [ +0.003695] ? migrate_pages+0x2b0/0xb90 [ +0.003959] ? try_offline_node+0x160/0x160 [ +0.004214] ? __offline_pages+0x6ce/0x8e0 [ +0.004134] ? memory_subsys_offline+0x40/0x60 [ +0.004474] ? device_offline+0x81/0xb0 [ +0.003867] ? acpi_bus_offline+0xdb/0x140 [ +0.004117] ? acpi_device_hotplug+0x21c/0x460 [ +0.004458] ? acpi_hotplug_work_fn+0x1a/0x30 [ +0.004372] ? process_one_work+0x1a1/0x3a0 [ +0.004195] ? worker_thread+0x30/0x380 [ +0.003851] ? drain_workqueue+0x120/0x120 [ +0.004117] ? kthread+0x112/0x130 [ +0.003411] ? kthread_park+0x80/0x80 [ +0.005325] ? ret_from_fork+0x35/0x40 > > > The real question is, however, why offlining of the last block doesn't > succeed. In __offline_pages() we basically have an endless loop (while > holding the mem_hotplug_lock in write). Now I consider this piece of > code very problematic (we should automatically fail after X > attempts/after X seconds, we should not ignore -ENOMEM), and we've had > other BUGs whereby we would run into an endless loop here (e.g. related > to hugepages I guess). Hmm, even though memory hotplug stalled there, there are still much memory. E.g in this system, it has 8 nodes and each node has 64 GB memory, it's 512 GB in all. Now I run "stress -m 200" to trigger 200 processes to malloc then free 256 MB contiously, and it's eating 50 GB in all. In theory, it still has much memory for migrating to. > > You mentioned memory pressure, if our host is under memory pressure we > can easily trigger running into an endless loop there, because we > basically ignore -ENOMEM e.g. when we cannot get a page to migrate some > memory to be offlined. I assume this is the case here. > do_migrate_range() could be the bad boy if it keeps failing forever and > we keep retrying. Not sure what other people think about this. If failed the memory removing when still much free memory left, I worry customer will complain. Yeah, it stoped at do_migrate_range() when try to migrate the last memory block. And each time it's the last memory block which can't be offlined and hang. If any message or information needed, I can provide. Thanks Baoquan
Re: Memory hotplug softlock issue
On 14.11.18 08:09, Baoquan He wrote: > Hi, > > Tested memory hotplug on a bare metal system, hot removing always > trigger a lock. Usually need hot plug/unplug several times, then the hot > removing will hang there at the last block. Surely with memory pressure > added by executing "stress -m 200". > > Will attach the log partly. Any idea or suggestion, appreciated. > > Thanks > Baoquan > Code seems to be waiting for the mem_hotplug_lock in read. We hold mem_hotplug_lock in write whenever we online/offline/add/remove memory. There are two ways to trigger offlining of memory: 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state" This always properly took the mem_hotplug_lock. Nothing changed 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online" This didn't take the mem_hotplug_lock and I fixed that for this release. So if you were testing with 1., you should have seen the same error before this release (unless there is something else now broken in this release). The real question is, however, why offlining of the last block doesn't succeed. In __offline_pages() we basically have an endless loop (while holding the mem_hotplug_lock in write). Now I consider this piece of code very problematic (we should automatically fail after X attempts/after X seconds, we should not ignore -ENOMEM), and we've had other BUGs whereby we would run into an endless loop here (e.g. related to hugepages I guess). You mentioned memory pressure, if our host is under memory pressure we can easily trigger running into an endless loop there, because we basically ignore -ENOMEM e.g. when we cannot get a page to migrate some memory to be offlined. I assume this is the case here. do_migrate_range() could be the bad boy if it keeps failing forever and we keep retrying. -- Thanks, David / dhildenb