Re: Memory hotplug softlock issue

2018-11-21 Thread Hugh Dickins
On Wed, 21 Nov 2018, Michal Hocko wrote:
> On Mon 19-11-18 21:44:41, Hugh Dickins wrote:
> [...]
> > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > 
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that there is still a struct
> > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > reuse of the struct page going to someone who then holds the page locked
> > indefinitely, when the waiter can reasonably expect timely unlocking.
> 
> I would add the following for the "problem statement". Feel free to
> reuse per your preference:
> "
> An elevated reference count, however, stands in the way of migration and
> forces it to fail with a bad timing. This is especially a problem for
> memory offlining which retries for ever (or until the operation is
> terminated from userspace) because a heavy refault workload can trigger
> essentially an endless loop of migration failures. Therefore
> __migration_entry_wait is essentially harmful for the even it is waiting
> for.
> "

Okay, I do have a lot written from way back when I prepared the
now-abandoned migration_waitqueue patch internally, but I'll factor in
what you say above when I get there - in particular, you highlight the
memory offlining aspect, as in this mailthread: which is very helpful,
because it's outside my experience so I won't have mentioned it - thanks.

I just know that there's some important linkage to do, to the August 2017
WQ_FLAG_BOOKMARK discussion: so it's a research and editing job I have to
work myself up to at the right moment.

> 
> > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > and is careful not to rely on struct page contents thereafter, there is
> > no need to hold a reference to the page while waiting on it.  That does
> > mean that this case cannot go back through the loop: but that's fine for
> > the page migration case, and even if used more widely, is limited by the
> > "Stop walking if it's locked" optimization in wake_page_function().
> 
> I would appreciate this would be more explicit about the existence of
> the elevated-ref-count problem but it reduces it to a tiny time window
> compared to the whole time the waiter is blocked. So a great
> improvement.

Fair enough, I'll do so. (But that's a bit like when we say we've attached
something and then forget to do so: please check that I've been honest
when I do post.)

> 
> > Add interface put_and_wait_on_page_locked() to do this, using negative
> > value of the lock arg to wait_on_page_bit_common() to implement it.
> > No interruptible or killable variant needed yet, but they might follow:
> > I have a vague notion that reporting -EINTR should take precedence over
> > return from wait_on_page_bit_common() without knowing the page state,
> > so arrange it accordingly - but that may be nothing but pedantic.
> > 
> > shrink_page_list()'s __ClearPageLocked(): that was a surprise!
> 
> and I can imagine a bad one. Do we really have to be so clever here?
> The unlock_page went away in the name of performance (a978d6f521063)
> and I would argue that this is a slow path where this is just not worth
> it.

Do we really have to be so clever here? That's a good question: now we
have PG_waiters, we probably do not need to bother with this cleverness,
and it would save me from having to expand on that comment as I was asked.
I'll try going back to a simple unlock_page() there: and can always restore
the __ClearPageLocked if a reviewer demands, or 0-day notices regression,

> 
> > this
> > survived a lot of testing before that showed up.  It does raise the
> > question: should is_page_cache_freeable() and __remove_mapping() now
> > treat a PG_waiters page as if an extra reference were held?  Perhaps,
> > but I don't think it matters much, since shrink_page_list() already
> > had to win its trylock_page(), so waiters are not very common there: I
> > noticed no difference when trying the bigger change, and it's surely not
> > needed while put_and_wait_on_page_locked() is only for page migration.
> > 
> > Signed-off-by: Hugh Dickins 
> 
> The patch looks good to me - quite ugly but it doesn't make the existing
> code much worse.
> 
> With the problem described Vlastimil fixed, feel free to add
> Acked-by: Michal Hocko 

Thanks!

> 
> And thanks for a prompt patch. This is something I've been chasing for
> quite some time. __migration_entry_wait came to my radar only recently
> because this is an extremely volatile area.

You are very gracious to describe a patch promised six months ago as
"prompt".  But it does help me a lot to have it fixing a real problem
for someone (thank you Baoquan) - well, it fixed a real problem for us
internally too, but very nice to gather more backing for it like this.

Hugh


Re: Memory hotplug softlock issue

2018-11-21 Thread Michal Hocko
On Mon 19-11-18 21:44:41, Hugh Dickins wrote:
[...]
> [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> 
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.

I would add the following for the "problem statement". Feel free to
reuse per your preference:
"
An elevated reference count, however, stands in the way of migration and
forces it to fail with a bad timing. This is especially a problem for
memory offlining which retries for ever (or until the operation is
terminated from userspace) because a heavy refault workload can trigger
essentially an endless loop of migration failures. Therefore
__migration_entry_wait is essentially harmful for the even it is waiting
for.
"

> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it.  That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().

I would appreciate this would be more explicit about the existence of
the elevated-ref-count problem but it reduces it to a tiny time window
compared to the whole time the waiter is blocked. So a great
improvement.

> Add interface put_and_wait_on_page_locked() to do this, using negative
> value of the lock arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
> 
> shrink_page_list()'s __ClearPageLocked(): that was a surprise!

and I can imagine a bad one. Do we really have to be so clever here?
The unlock_page went away in the name of performance (a978d6f521063)
and I would argue that this is a slow path where this is just not worth
it.

> this
> survived a lot of testing before that showed up.  It does raise the
> question: should is_page_cache_freeable() and __remove_mapping() now
> treat a PG_waiters page as if an extra reference were held?  Perhaps,
> but I don't think it matters much, since shrink_page_list() already
> had to win its trylock_page(), so waiters are not very common there: I
> noticed no difference when trying the bigger change, and it's surely not
> needed while put_and_wait_on_page_locked() is only for page migration.
> 
> Signed-off-by: Hugh Dickins 

The patch looks good to me - quite ugly but it doesn't make the existing
code much worse.

With the problem described Vlastimil fixed, feel free to add
Acked-by: Michal Hocko 

And thanks for a prompt patch. This is something I've been chasing for
quite some time. __migration_entry_wait came to my radar only recently
because this is an extremely volatile area.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-20 Thread Hugh Dickins
On Tue, 20 Nov 2018, Hugh Dickins wrote:
> On Tue, 20 Nov 2018, Vlastimil Babka wrote:
> > >  
> > >   finish_wait(q, wait);
> > 
> > ... the code continues by:
> > 
> > if (thrashing) {
> > if (!PageSwapBacked(page))
> > 
> > So maybe we should not set 'thrashing' true when lock < 0?
> 
> Very good catch, thank you Vlastimil: as you might have guessed, the
> patch from a pre-PSI kernel applied cleanly, and I just hadn't reviewed
> the surrounding context properly before sending out.
> 
> I cannot say immediately what the right answer is, I'll have to do some
> research first: maybe not enter the block that sets thrashing true when
> lock < 0, as you suggest, or maybe force lock < 0 to 0 and put_page()
> afterwards, or... 

... I still won't adjust the patch tonight, but the answer is obvious
now I look closer: as you show in your extract above, the only thing
it does with "page" at the end is to ask if it was SwapBacked, so we
just need to set one more bool at the beginning to check at the end
(or I could make "thrashing" a -1, 0, 1 int like "lock": but my guess
is that that would not be to other people's taste: acceptable for the
arg, but stretching your patience for the local variable).

By the way, I do have a further patch to wait_on_page_bit_common(),
which I could send at the same time, if it sounds right to you
(but it's a no-op in the put_and_wait page migration case).  That
__add_wait_queue_entry_tail() is right for the first time into
the loop, but I maintain that it should use __add_wait_queue()
for fairness thereafter, to avoid repeatedly sending older waiters
back to the back of the queue.  I don't have hard numbers for it,
but it's one of several patches, each of which helped to reduce our
wait_on_page_bit lockups in some (perhaps unrealistic) stress tests.

Hugh


Re: Memory hotplug softlock issue

2018-11-20 Thread Hugh Dickins
On Tue, 20 Nov 2018, Baoquan He wrote:
> On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> > On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > > 
> > > We have all assumed that it is essential to hold a page reference while
> > > waiting on a page lock: partly to guarantee that there is still a struct
> > > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > > reuse of the struct page going to someone who then holds the page locked
> > > indefinitely, when the waiter can reasonably expect timely unlocking.
> > > 
> > > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > > and is careful not to rely on struct page contents thereafter, there is
> > > no need to hold a reference to the page while waiting on it.  That does
> > 
> > So there's still a moment where refcount is elevated, but hopefully
> > short enough, right? Let's see if it survives Baoquan's stress testing.
> 
> Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on
> that machine, after many times of hot removing/adding, the endless
> looping during mirgrating is not seen any more. The test result for
> Hugh's patch is positive. I even suggested Ping increasing the memory
> pressure to "stress -m 250", it still succeeded to offline and remove.
> 
> So I think this patch works to solve the issue. Thanks a lot for your
> help, all of you. 

Very good to hear, thanks a lot for your quick feedback.

> 
> High, will you post a formal patch in a separate thread?

Yes, I promise that I shall do so in the next few days, but not today:
some other things have to take priority.

And Vlastimil has raised an excellent point about the interaction with
PSI "thrashing": I need to read up and decide which way to go on that
(and add Johannes to the Cc when I post).

I think I shall probably post it directly to Linus (lists and other
people Cc'ed of course): not because I think it should be rushed in
too quickly, nor to sidestep Andrew, but because Linus was very closely
involved in both the PG_waiters and WQ_FLAG_BOOKMARK discussions:
it is an area of special interest to him.

Hugh


Re: Memory hotplug softlock issue

2018-11-20 Thread Hugh Dickins
On Tue, 20 Nov 2018, Vlastimil Babka wrote:
> On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > 
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that there is still a struct
> > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > reuse of the struct page going to someone who then holds the page locked
> > indefinitely, when the waiter can reasonably expect timely unlocking.
> > 
> > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > and is careful not to rely on struct page contents thereafter, there is
> > no need to hold a reference to the page while waiting on it.  That does
> 
> So there's still a moment where refcount is elevated, but hopefully
> short enough, right?

Correct: and given page migration's 10 passes, it would have to be very
unlucky to hit one of those transiently elevated refcounts every time:
so I don't think it's a grave drawback at all - certainly much less
grave than how it's done at present.

I admit that doing a get_page_unless_zero() immediately before the
put_and_wait_on_page_locked() looks rather silly, but I think we do
have to hold a reference in order to set PG_waiters. Then for other
future uses (e.g. in find_get_entry() or lock_page_or_retry()),
the reference to be dropped has been taken earlier anyway.

> Let's see if it survives Baoquan's stress testing.
> 
> > mean that this case cannot go back through the loop: but that's fine for
> > the page migration case, and even if used more widely, is limited by the
> > "Stop walking if it's locked" optimization in wake_page_function().
> > 
> > Add interface put_and_wait_on_page_locked() to do this, using negative
> > value of the lock arg to wait_on_page_bit_common() to implement it.
> > No interruptible or killable variant needed yet, but they might follow:
> > I have a vague notion that reporting -EINTR should take precedence over
> > return from wait_on_page_bit_common() without knowing the page state,
> > so arrange it accordingly - but that may be nothing but pedantic.
> > 
> > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
> > survived a lot of testing before that showed up.  It does raise the
> > question: should is_page_cache_freeable() and __remove_mapping() now
> > treat a PG_waiters page as if an extra reference were held?  Perhaps,
> > but I don't think it matters much, since shrink_page_list() already
> > had to win its trylock_page(), so waiters are not very common there: I
> > noticed no difference when trying the bigger change, and it's surely not
> > needed while put_and_wait_on_page_locked() is only for page migration.
> > 
> > Signed-off-by: Hugh Dickins 
> > ---
> 
> ...
> 
> > @@ -1100,6 +,17 @@ static inline int 
> > wait_on_page_bit_common(wait_queue_head_t *q,
> > ret = -EINTR;
> > break;
> > }
> > +
> > +   if (lock < 0) {
> > +   /*
> > +* We can no longer safely access page->flags:
> 
> Hmm...
> 
> > +* even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> > +* there is a risk of waiting forever on a page reused
> > +* for something that keeps it locked indefinitely.
> > +* But best check for -EINTR above before breaking.
> > +*/
> > +   break;
> > +   }
> > }
> >  
> > finish_wait(q, wait);
> 
> ... the code continues by:
> 
> if (thrashing) {
> if (!PageSwapBacked(page))
> 
> So maybe we should not set 'thrashing' true when lock < 0?

Very good catch, thank you Vlastimil: as you might have guessed, the
patch from a pre-PSI kernel applied cleanly, and I just hadn't reviewed
the surrounding context properly before sending out.

I cannot say immediately what the right answer is, I'll have to do some
research first: maybe not enter the block that sets thrashing true when
lock < 0, as you suggest, or maybe force lock < 0 to 0 and put_page()
afterwards, or... 

Hugh


Re: Memory hotplug softlock issue

2018-11-20 Thread Baoquan He
On 11/20/18 at 03:05pm, Michal Hocko wrote:
> > Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on
> > that machine, after many times of hot removing/adding, the endless
> > looping during mirgrating is not seen any more. The test result for
> > Hugh's patch is positive. I even suggested Ping increasing the memory
> > pressure to "stress -m 250", it still succeeded to offline and remove.
> > 
> > So I think this patch works to solve the issue. Thanks a lot for your
> > help, all of you. 
> 
> This is a great news! Thanks for your swift feedback. I will go and try
> to review Hugh's patch soon.

Yeah. Thanks a lot for your help on debugging and narrowing down
to position the cause of the issue, Michal, really appreciated!

> > Meanwhile we found sometime onlining page may not add back all memory
> > blocks on one memory board, then hot removing/adding them will cause
> > kernel panic. I will investigate further and collect information, see if
> > it's a kernel issue or udev issue.
> 
> It would be great to get a report in a new email thread.

Sure, will do after reproducing and inforamtion arranging.

Thanks
Baoquan


Re: Memory hotplug softlock issue

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 21:58:03, Baoquan He wrote:
> Hi,
> 
> On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> > On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > > 
> > > We have all assumed that it is essential to hold a page reference while
> > > waiting on a page lock: partly to guarantee that there is still a struct
> > > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > > reuse of the struct page going to someone who then holds the page locked
> > > indefinitely, when the waiter can reasonably expect timely unlocking.
> > > 
> > > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > > and is careful not to rely on struct page contents thereafter, there is
> > > no need to hold a reference to the page while waiting on it.  That does
> > 
> > So there's still a moment where refcount is elevated, but hopefully
> > short enough, right? Let's see if it survives Baoquan's stress testing.
> 
> Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on
> that machine, after many times of hot removing/adding, the endless
> looping during mirgrating is not seen any more. The test result for
> Hugh's patch is positive. I even suggested Ping increasing the memory
> pressure to "stress -m 250", it still succeeded to offline and remove.
> 
> So I think this patch works to solve the issue. Thanks a lot for your
> help, all of you. 

This is a great news! Thanks for your swift feedback. I will go and try
to review Hugh's patch soon.

> High, will you post a formal patch in a separate thread?
> 
> Meanwhile we found sometime onlining page may not add back all memory
> blocks on one memory board, then hot removing/adding them will cause
> kernel panic. I will investigate further and collect information, see if
> it's a kernel issue or udev issue.

It would be great to get a report in a new email thread.
> 
> Thanks
> Baoquan
> 
> > 
> > > mean that this case cannot go back through the loop: but that's fine for
> > > the page migration case, and even if used more widely, is limited by the
> > > "Stop walking if it's locked" optimization in wake_page_function().
> > > 
> > > Add interface put_and_wait_on_page_locked() to do this, using negative
> > > value of the lock arg to wait_on_page_bit_common() to implement it.
> > > No interruptible or killable variant needed yet, but they might follow:
> > > I have a vague notion that reporting -EINTR should take precedence over
> > > return from wait_on_page_bit_common() without knowing the page state,
> > > so arrange it accordingly - but that may be nothing but pedantic.
> > > 
> > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
> > > survived a lot of testing before that showed up.  It does raise the
> > > question: should is_page_cache_freeable() and __remove_mapping() now
> > > treat a PG_waiters page as if an extra reference were held?  Perhaps,
> > > but I don't think it matters much, since shrink_page_list() already
> > > had to win its trylock_page(), so waiters are not very common there: I
> > > noticed no difference when trying the bigger change, and it's surely not
> > > needed while put_and_wait_on_page_locked() is only for page migration.
> > > 
> > > Signed-off-by: Hugh Dickins 
> > > ---
> > 
> > ...
> > 
> > > @@ -1100,6 +,17 @@ static inline int 
> > > wait_on_page_bit_common(wait_queue_head_t *q,
> > >   ret = -EINTR;
> > >   break;
> > >   }
> > > +
> > > + if (lock < 0) {
> > > + /*
> > > +  * We can no longer safely access page->flags:
> > 
> > Hmm...
> > 
> > > +  * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> > > +  * there is a risk of waiting forever on a page reused
> > > +  * for something that keeps it locked indefinitely.
> > > +  * But best check for -EINTR above before breaking.
> > > +  */
> > > + break;
> > > + }
> > >   }
> > >  
> > >   finish_wait(q, wait);
> > 
> > ... the code continues by:
> > 
> > if (thrashing) {
> > if (!PageSwapBacked(page))
> > 
> > So maybe we should not set 'thrashing' true when lock < 0?
> > 
> > Thanks!
> > Vlastimil

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-20 Thread Baoquan He
Hi,

On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > 
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that there is still a struct
> > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > reuse of the struct page going to someone who then holds the page locked
> > indefinitely, when the waiter can reasonably expect timely unlocking.
> > 
> > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > and is careful not to rely on struct page contents thereafter, there is
> > no need to hold a reference to the page while waiting on it.  That does
> 
> So there's still a moment where refcount is elevated, but hopefully
> short enough, right? Let's see if it survives Baoquan's stress testing.

Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on
that machine, after many times of hot removing/adding, the endless
looping during mirgrating is not seen any more. The test result for
Hugh's patch is positive. I even suggested Ping increasing the memory
pressure to "stress -m 250", it still succeeded to offline and remove.

So I think this patch works to solve the issue. Thanks a lot for your
help, all of you. 

High, will you post a formal patch in a separate thread?

Meanwhile we found sometime onlining page may not add back all memory
blocks on one memory board, then hot removing/adding them will cause
kernel panic. I will investigate further and collect information, see if
it's a kernel issue or udev issue.

Thanks
Baoquan

> 
> > mean that this case cannot go back through the loop: but that's fine for
> > the page migration case, and even if used more widely, is limited by the
> > "Stop walking if it's locked" optimization in wake_page_function().
> > 
> > Add interface put_and_wait_on_page_locked() to do this, using negative
> > value of the lock arg to wait_on_page_bit_common() to implement it.
> > No interruptible or killable variant needed yet, but they might follow:
> > I have a vague notion that reporting -EINTR should take precedence over
> > return from wait_on_page_bit_common() without knowing the page state,
> > so arrange it accordingly - but that may be nothing but pedantic.
> > 
> > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
> > survived a lot of testing before that showed up.  It does raise the
> > question: should is_page_cache_freeable() and __remove_mapping() now
> > treat a PG_waiters page as if an extra reference were held?  Perhaps,
> > but I don't think it matters much, since shrink_page_list() already
> > had to win its trylock_page(), so waiters are not very common there: I
> > noticed no difference when trying the bigger change, and it's surely not
> > needed while put_and_wait_on_page_locked() is only for page migration.
> > 
> > Signed-off-by: Hugh Dickins 
> > ---
> 
> ...
> 
> > @@ -1100,6 +,17 @@ static inline int 
> > wait_on_page_bit_common(wait_queue_head_t *q,
> > ret = -EINTR;
> > break;
> > }
> > +
> > +   if (lock < 0) {
> > +   /*
> > +* We can no longer safely access page->flags:
> 
> Hmm...
> 
> > +* even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> > +* there is a risk of waiting forever on a page reused
> > +* for something that keeps it locked indefinitely.
> > +* But best check for -EINTR above before breaking.
> > +*/
> > +   break;
> > +   }
> > }
> >  
> > finish_wait(q, wait);
> 
> ... the code continues by:
> 
> if (thrashing) {
> if (!PageSwapBacked(page))
> 
> So maybe we should not set 'thrashing' true when lock < 0?
> 
> Thanks!
> Vlastimil


Re: Memory hotplug softlock issue

2018-11-20 Thread Vlastimil Babka
On 11/20/18 6:44 AM, Hugh Dickins wrote:
> [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> 
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.
> 
> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it.  That does

So there's still a moment where refcount is elevated, but hopefully
short enough, right? Let's see if it survives Baoquan's stress testing.

> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().
> 
> Add interface put_and_wait_on_page_locked() to do this, using negative
> value of the lock arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
> 
> shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
> survived a lot of testing before that showed up.  It does raise the
> question: should is_page_cache_freeable() and __remove_mapping() now
> treat a PG_waiters page as if an extra reference were held?  Perhaps,
> but I don't think it matters much, since shrink_page_list() already
> had to win its trylock_page(), so waiters are not very common there: I
> noticed no difference when trying the bigger change, and it's surely not
> needed while put_and_wait_on_page_locked() is only for page migration.
> 
> Signed-off-by: Hugh Dickins 
> ---

...

> @@ -1100,6 +,17 @@ static inline int 
> wait_on_page_bit_common(wait_queue_head_t *q,
>   ret = -EINTR;
>   break;
>   }
> +
> + if (lock < 0) {
> + /*
> +  * We can no longer safely access page->flags:

Hmm...

> +  * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> +  * there is a risk of waiting forever on a page reused
> +  * for something that keeps it locked indefinitely.
> +  * But best check for -EINTR above before breaking.
> +  */
> + break;
> + }
>   }
>  
>   finish_wait(q, wait);

... the code continues by:

if (thrashing) {
if (!PageSwapBacked(page))

So maybe we should not set 'thrashing' true when lock < 0?

Thanks!
Vlastimil


Re: Memory hotplug softlock issue

2018-11-19 Thread Hugh Dickins
On Tue, 20 Nov 2018, Baoquan He wrote:
> On 11/19/18 at 09:59pm, Michal Hocko wrote:
> > On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> > > I'm glad that I delayed, what I had then (migration_waitqueue instead
> > > of using page_waitqueue) was not wrong, but what I've been using the
> > > last couple of months is rather better (and can be put to use to solve
> > > similar problems in collapsing pages on huge tmpfs. but we don't need
> > > to get into that at this time): put_and_wait_on_page_locked().
> > > 
> > > What I have not yet done is verify it on latest kernel, and research
> > > the interested Cc list (Linus and Tim Chen come immediately to mind),
> > > and write the commit comment. I have some testing to do on the latest
> > > kernel today, so I'll throw put_and_wait_on_page_locked() in too,
> > > and post tomorrow I hope.
> > 
> > Cool, it seems that Baoquan has a reliable test case to trigger the
> > pathological case.
> 
> Yes. I will test Hugh's patch.

Thanks: I've completed some of the retesting now, so it would probably
help us all better if I post the patch in this thread, even without
completing its description and links and Cc list yet - there isn't
even a problem description below, I still have to paste that in from
the unposted patch that I made six months ago.  Here is today's...


[PATCH] mm: put_and_wait_on_page_locked() while page is migrated

We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.

But in fact, so long as wait_on_page_bit_common() does the put_page(),
and is careful not to rely on struct page contents thereafter, there is
no need to hold a reference to the page while waiting on it.  That does
mean that this case cannot go back through the loop: but that's fine for
the page migration case, and even if used more widely, is limited by the
"Stop walking if it's locked" optimization in wake_page_function().

Add interface put_and_wait_on_page_locked() to do this, using negative
value of the lock arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow:
I have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state,
so arrange it accordingly - but that may be nothing but pedantic.

shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
survived a lot of testing before that showed up.  It does raise the
question: should is_page_cache_freeable() and __remove_mapping() now
treat a PG_waiters page as if an extra reference were held?  Perhaps,
but I don't think it matters much, since shrink_page_list() already
had to win its trylock_page(), so waiters are not very common there: I
noticed no difference when trying the bigger change, and it's surely not
needed while put_and_wait_on_page_locked() is only for page migration.

Signed-off-by: Hugh Dickins 
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c| 53 -
 mm/huge_memory.c|  6 ++---
 mm/migrate.c| 12 --
 mm/vmscan.c | 11 +
 5 files changed, 57 insertions(+), 27 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 226f96f0dee0..e2d7039af6a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page 
*page)
return wait_on_page_bit_killable(compound_head(page), PG_locked);
 }
 
+extern void put_and_wait_on_page_locked(struct page *page);
+
 /* 
  * Wait for a page to complete writeback
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..ef82119032d8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, 
unsigned mode, int sync,
if (wait_page->bit_nr != key->bit_nr)
return 0;
 
-   /* Stop walking if it's locked */
+   /*
+* Stop walking if it's locked.
+* Is this safe if put_and_wait_on_page_locked() is in use?
+* Yes: the waker must hold a reference to this page, and if PG_locked
+* has now already been set by another task, that task must also hold
+* a reference to the *same usage* of this page; so there is no need
+* to walk on to wake even the put_and_wait_on_page_locked() callers.
+*/
if (test_bit(key->bit_nr, &key->page->flags))
return -1;
 
@@ -1050,13 +1057,14 @@ static void wake_up_page(struct page *page, int bit)
 }
 
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
-   struct page *page, int bit_nr, int state, bool lock)
+ 

Re: Memory hotplug softlock issue

2018-11-19 Thread Baoquan He
On 11/19/18 at 09:59pm, Michal Hocko wrote:
> On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> > I'm glad that I delayed, what I had then (migration_waitqueue instead
> > of using page_waitqueue) was not wrong, but what I've been using the
> > last couple of months is rather better (and can be put to use to solve
> > similar problems in collapsing pages on huge tmpfs. but we don't need
> > to get into that at this time): put_and_wait_on_page_locked().
> > 
> > What I have not yet done is verify it on latest kernel, and research
> > the interested Cc list (Linus and Tim Chen come immediately to mind),
> > and write the commit comment. I have some testing to do on the latest
> > kernel today, so I'll throw put_and_wait_on_page_locked() in too,
> > and post tomorrow I hope.
> 
> Cool, it seems that Baoquan has a reliable test case to trigger the
> pathological case.

Yes. I will test Hugh's patch.



Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> On Mon, 19 Nov 2018, Michal Hocko wrote:
> > On Mon 19-11-18 15:10:16, Michal Hocko wrote:
> > [...]
> > > In other words. Why cannot we do the following?
> > 
> > Baoquan, this is certainly not the right fix but I would be really
> > curious whether it makes the problem go away.
> > 
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index f7e4bfdc13b7..7ccab29bcf9a 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, 
> > > pte_t *ptep,
> > >   goto out;
> > >  
> > >   page = migration_entry_to_page(entry);
> > > -
> > > - /*
> > > -  * Once page cache replacement of page migration started, page_count
> > > -  * *must* be zero. And, we don't want to call wait_on_page_locked()
> > > -  * against a page without get_page().
> > > -  * So, we use get_page_unless_zero(), here. Even failed, page fault
> > > -  * will occur again.
> > > -  */
> > > - if (!get_page_unless_zero(page))
> > > - goto out;
> > >   pte_unmap_unlock(ptep, ptl);
> > > - wait_on_page_locked(page);
> > > - put_page(page);
> > > + page_lock(page);
> > > + page_unlock(page);
> > >   return;
> > >  out:
> > >   pte_unmap_unlock(ptep, ptl);
> 
> Thanks for Cc'ing me. I did mention precisely this issue two or three
> times at LSF/MM this year, and claimed then that I would post the fix.

I've had a recollection about some issue I just couldn't remember what
was it exactly. Tried to make Vlastimil to remember but failed there as
well ;)

> I'm glad that I delayed, what I had then (migration_waitqueue instead
> of using page_waitqueue) was not wrong, but what I've been using the
> last couple of months is rather better (and can be put to use to solve
> similar problems in collapsing pages on huge tmpfs. but we don't need
> to get into that at this time): put_and_wait_on_page_locked().
> 
> What I have not yet done is verify it on latest kernel, and research
> the interested Cc list (Linus and Tim Chen come immediately to mind),
> and write the commit comment. I have some testing to do on the latest
> kernel today, so I'll throw put_and_wait_on_page_locked() in too,
> and post tomorrow I hope.

Cool, it seems that Baoquan has a reliable test case to trigger the
pathological case.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-19 Thread Hugh Dickins
On Mon, 19 Nov 2018, Michal Hocko wrote:
> On Mon 19-11-18 15:10:16, Michal Hocko wrote:
> [...]
> > In other words. Why cannot we do the following?
> 
> Baoquan, this is certainly not the right fix but I would be really
> curious whether it makes the problem go away.
> 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index f7e4bfdc13b7..7ccab29bcf9a 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, 
> > pte_t *ptep,
> > goto out;
> >  
> > page = migration_entry_to_page(entry);
> > -
> > -   /*
> > -* Once page cache replacement of page migration started, page_count
> > -* *must* be zero. And, we don't want to call wait_on_page_locked()
> > -* against a page without get_page().
> > -* So, we use get_page_unless_zero(), here. Even failed, page fault
> > -* will occur again.
> > -*/
> > -   if (!get_page_unless_zero(page))
> > -   goto out;
> > pte_unmap_unlock(ptep, ptl);
> > -   wait_on_page_locked(page);
> > -   put_page(page);
> > +   page_lock(page);
> > +   page_unlock(page);
> > return;
> >  out:
> > pte_unmap_unlock(ptep, ptl);

Thanks for Cc'ing me. I did mention precisely this issue two or three
times at LSF/MM this year, and claimed then that I would post the fix.

I'm glad that I delayed, what I had then (migration_waitqueue instead
of using page_waitqueue) was not wrong, but what I've been using the
last couple of months is rather better (and can be put to use to solve
similar problems in collapsing pages on huge tmpfs. but we don't need
to get into that at this time): put_and_wait_on_page_locked().

What I have not yet done is verify it on latest kernel, and research
the interested Cc list (Linus and Tim Chen come immediately to mind),
and write the commit comment. I have some testing to do on the latest
kernel today, so I'll throw put_and_wait_on_page_locked() in too,
and post tomorrow I hope.

Hugh


Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 15:10:16, Michal Hocko wrote:
[...]
> In other words. Why cannot we do the following?

Baoquan, this is certainly not the right fix but I would be really
curious whether it makes the problem go away.

> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..7ccab29bcf9a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
> *ptep,
>   goto out;
>  
>   page = migration_entry_to_page(entry);
> -
> - /*
> -  * Once page cache replacement of page migration started, page_count
> -  * *must* be zero. And, we don't want to call wait_on_page_locked()
> -  * against a page without get_page().
> -  * So, we use get_page_unless_zero(), here. Even failed, page fault
> -  * will occur again.
> -  */
> - if (!get_page_unless_zero(page))
> - goto out;
>   pte_unmap_unlock(ptep, ptl);
> - wait_on_page_locked(page);
> - put_page(page);
> + page_lock(page);
> + page_unlock(page);
>   return;
>  out:
>   pte_unmap_unlock(ptep, ptl);

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 17:48:35, Vlastimil Babka wrote:
> On 11/19/18 5:46 PM, Vlastimil Babka wrote:
> > On 11/19/18 5:46 PM, Michal Hocko wrote:
> >> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
> >>>
> >>> So what protects us from locking a page whose refcount dropped to zero?
> >>> and is being freed? The checks in freeing path won't be happy about a
> >>> stray lock.
> >>
> >> Nothing really prevents that. But does it matter. The worst that might
> >> happen is that we lock a freed or reused page. Who would complain?
> > 
> > free_pages_check() for example
> > 
> > PAGE_FLAGS_CHECK_AT_FREE includes PG_locked

Right you are.

> And besides... what about the last page being offlined and then the
> whole struct page's part of vmemmap destroyed as the node goes away?

Yeah, that is quite unlikely though because the there is quite a large
time window between the two events. I am not entirely sure we are safe
right now TBH. Any access to the struct page after the put_page is
unsafe theoretically.

Then we have to come up with something more clever I am afraid.

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-19 Thread Vlastimil Babka
On 11/19/18 5:46 PM, Vlastimil Babka wrote:
> On 11/19/18 5:46 PM, Michal Hocko wrote:
>> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
>>>
>>> So what protects us from locking a page whose refcount dropped to zero?
>>> and is being freed? The checks in freeing path won't be happy about a
>>> stray lock.
>>
>> Nothing really prevents that. But does it matter. The worst that might
>> happen is that we lock a freed or reused page. Who would complain?
> 
> free_pages_check() for example
> 
> PAGE_FLAGS_CHECK_AT_FREE includes PG_locked

And besides... what about the last page being offlined and then the
whole struct page's part of vmemmap destroyed as the node goes away?


Re: Memory hotplug softlock issue

2018-11-19 Thread Vlastimil Babka
On 11/19/18 5:46 PM, Michal Hocko wrote:
> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
>>
>> So what protects us from locking a page whose refcount dropped to zero?
>> and is being freed? The checks in freeing path won't be happy about a
>> stray lock.
> 
> Nothing really prevents that. But does it matter. The worst that might
> happen is that we lock a freed or reused page. Who would complain?

free_pages_check() for example

PAGE_FLAGS_CHECK_AT_FREE includes PG_locked



Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
> On 11/19/18 3:10 PM, Michal Hocko wrote:
> > On Mon 19-11-18 13:51:21, Michal Hocko wrote:
> >> On Mon 19-11-18 13:40:33, Michal Hocko wrote:
> >>> How are
> >>> we supposed to converge when the swapin code waits for the migration to
> >>> finish with the reference count elevated?
> 
> Indeed this looks wrong. How comes we only found this out now? I guess
> the race window where refcounts matter is only a part of the whole
> migration, where we update the mapping (migrate_page_move_mapping()).
> That's before copying contents, flags etc.

I guess we simply never found out because most migration callers simply
fail after few attempts. The notable exception is memory offline which
tries retries until it suceeds or the caller terminates the process by a
fatal signal

> >> Just to clarify. This is not only about swapin obviously. Any caller of
> >> __migration_entry_wait is affected the same way AFAICS.
> > 
> > In other words. Why cannot we do the following?
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index f7e4bfdc13b7..7ccab29bcf9a 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, 
> > pte_t *ptep,
> > goto out;
> >  
> > page = migration_entry_to_page(entry);
> > -
> > -   /*
> > -* Once page cache replacement of page migration started, page_count
> > -* *must* be zero. And, we don't want to call wait_on_page_locked()
> > -* against a page without get_page().
> > -* So, we use get_page_unless_zero(), here. Even failed, page fault
> > -* will occur again.
> > -*/
> > -   if (!get_page_unless_zero(page))
> > -   goto out;
> > pte_unmap_unlock(ptep, ptl);
> > -   wait_on_page_locked(page);
> > -   put_page(page);
> > +   page_lock(page);
> > +   page_unlock(page);
> 
> So what protects us from locking a page whose refcount dropped to zero?
> and is being freed? The checks in freeing path won't be happy about a
> stray lock.

Nothing really prevents that. But does it matter. The worst that might
happen is that we lock a freed or reused page. Who would complain?

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-19 Thread Vlastimil Babka
On 11/19/18 3:10 PM, Michal Hocko wrote:
> On Mon 19-11-18 13:51:21, Michal Hocko wrote:
>> On Mon 19-11-18 13:40:33, Michal Hocko wrote:
>>> How are
>>> we supposed to converge when the swapin code waits for the migration to
>>> finish with the reference count elevated?

Indeed this looks wrong. How comes we only found this out now? I guess
the race window where refcounts matter is only a part of the whole
migration, where we update the mapping (migrate_page_move_mapping()).
That's before copying contents, flags etc.

>> Just to clarify. This is not only about swapin obviously. Any caller of
>> __migration_entry_wait is affected the same way AFAICS.
> 
> In other words. Why cannot we do the following?
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..7ccab29bcf9a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
> *ptep,
>   goto out;
>  
>   page = migration_entry_to_page(entry);
> -
> - /*
> -  * Once page cache replacement of page migration started, page_count
> -  * *must* be zero. And, we don't want to call wait_on_page_locked()
> -  * against a page without get_page().
> -  * So, we use get_page_unless_zero(), here. Even failed, page fault
> -  * will occur again.
> -  */
> - if (!get_page_unless_zero(page))
> - goto out;
>   pte_unmap_unlock(ptep, ptl);
> - wait_on_page_locked(page);
> - put_page(page);
> + page_lock(page);
> + page_unlock(page);

So what protects us from locking a page whose refcount dropped to zero?
and is being freed? The checks in freeing path won't be happy about a
stray lock.

I suspect it's not that simple to fix this. Perhaps migration code could
set some flag/bit in the page during the part where it stabilizes
refcounts, and __migration_entry_wait() would just spin until the bit is
cleared, and only then proceed with the current get_page+wait? Or we
could maybe wait on the pte itself and not page?

>   return;
>  out:
>   pte_unmap_unlock(ptep, ptl);
> 



Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 13:51:21, Michal Hocko wrote:
> On Mon 19-11-18 13:40:33, Michal Hocko wrote:
> > On Mon 19-11-18 18:52:02, Baoquan He wrote:
> > [...]
> > 
> > There are few stacks directly in the offline path but those should be
> > OK.
> > The real culprit seems to be the swap in code
> > 
> > > [  +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L
> > > 4.20.0-rc2+ #7
> > > [  +0.007927] Hardware name:  9008/IT91SMUB, BIOS BLXSV512 03/22/2018
> > > [  +0.006297] Call Trace:
> > > [  +0.002537]  dump_stack+0x46/0x60
> > > [  +0.003386]  __migration_entry_wait.cold.65+0x5/0x14
> > > [  +0.005043]  do_swap_page+0x84e/0x960
> > > [  +0.003727]  ? arch_tlb_finish_mmu+0x29/0xc0
> > > [  +0.006412]  __handle_mm_fault+0x933/0x1330
> > > [  +0.004265]  handle_mm_fault+0xc4/0x250
> > > [  +0.003915]  __do_page_fault+0x2b7/0x510
> > > [  +0.003990]  do_page_fault+0x2c/0x110
> > > [  +0.003729]  ? page_fault+0x8/0x30
> > > [  +0.003462]  page_fault+0x1e/0x30
> > 
> > There are many traces to this path. We are 
> > /*
> >  * Once page cache replacement of page migration started, page_count
> >  * *must* be zero. And, we don't want to call wait_on_page_locked()
> >  * against a page without get_page().
> >  * So, we use get_page_unless_zero(), here. Even failed, page fault
> >  * will occur again.
> >  */
> > if (!get_page_unless_zero(page))
> > goto out;
> > pte_unmap_unlock(ptep, ptl);
> > wait_on_page_locked(page);
> > 
> > taking a reference to the page under the migration. I have to think
> > about this much more but I suspec this is just calling for a problem.
> > 
> > Cc migration experts. For you background information. We are seeing
> > memory offline not being able to converge because few heavily used pages
> > fail to migrate away - e.g. 
> > http://lkml.kernel.org/r/20181116012433.GU2653@MiWiFi-R3L-srv
> > A debugging page to dump stack for these pages 
> > http://lkml.kernel.org/r/20181116091409.gd14...@dhcp22.suse.cz
> > shows that references are taken from the swap in code (above). How are
> > we supposed to converge when the swapin code waits for the migration to
> > finish with the reference count elevated?
> 
> Just to clarify. This is not only about swapin obviously. Any caller of
> __migration_entry_wait is affected the same way AFAICS.

In other words. Why cannot we do the following?

diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..7ccab29bcf9a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -324,19 +324,9 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
*ptep,
goto out;
 
page = migration_entry_to_page(entry);
-
-   /*
-* Once page cache replacement of page migration started, page_count
-* *must* be zero. And, we don't want to call wait_on_page_locked()
-* against a page without get_page().
-* So, we use get_page_unless_zero(), here. Even failed, page fault
-* will occur again.
-*/
-   if (!get_page_unless_zero(page))
-   goto out;
pte_unmap_unlock(ptep, ptl);
-   wait_on_page_locked(page);
-   put_page(page);
+   page_lock(page);
+   page_unlock(page);
return;
 out:
pte_unmap_unlock(ptep, ptl);
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 13:40:33, Michal Hocko wrote:
> On Mon 19-11-18 18:52:02, Baoquan He wrote:
> [...]
> 
> There are few stacks directly in the offline path but those should be
> OK.
> The real culprit seems to be the swap in code
> 
> > [  +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L
> > 4.20.0-rc2+ #7
> > [  +0.007927] Hardware name:  9008/IT91SMUB, BIOS BLXSV512 03/22/2018
> > [  +0.006297] Call Trace:
> > [  +0.002537]  dump_stack+0x46/0x60
> > [  +0.003386]  __migration_entry_wait.cold.65+0x5/0x14
> > [  +0.005043]  do_swap_page+0x84e/0x960
> > [  +0.003727]  ? arch_tlb_finish_mmu+0x29/0xc0
> > [  +0.006412]  __handle_mm_fault+0x933/0x1330
> > [  +0.004265]  handle_mm_fault+0xc4/0x250
> > [  +0.003915]  __do_page_fault+0x2b7/0x510
> > [  +0.003990]  do_page_fault+0x2c/0x110
> > [  +0.003729]  ? page_fault+0x8/0x30
> > [  +0.003462]  page_fault+0x1e/0x30
> 
> There are many traces to this path. We are 
>   /*
>* Once page cache replacement of page migration started, page_count
>* *must* be zero. And, we don't want to call wait_on_page_locked()
>* against a page without get_page().
>* So, we use get_page_unless_zero(), here. Even failed, page fault
>* will occur again.
>*/
>   if (!get_page_unless_zero(page))
>   goto out;
>   pte_unmap_unlock(ptep, ptl);
>   wait_on_page_locked(page);
> 
> taking a reference to the page under the migration. I have to think
> about this much more but I suspec this is just calling for a problem.
> 
> Cc migration experts. For you background information. We are seeing
> memory offline not being able to converge because few heavily used pages
> fail to migrate away - e.g. 
> http://lkml.kernel.org/r/20181116012433.GU2653@MiWiFi-R3L-srv
> A debugging page to dump stack for these pages 
> http://lkml.kernel.org/r/20181116091409.gd14...@dhcp22.suse.cz
> shows that references are taken from the swap in code (above). How are
> we supposed to converge when the swapin code waits for the migration to
> finish with the reference count elevated?

Just to clarify. This is not only about swapin obviously. Any caller of
__migration_entry_wait is affected the same way AFAICS.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 18:52:02, Baoquan He wrote:
[...]

There are few stacks directly in the offline path but those should be
OK.
The real culprit seems to be the swap in code

> [  +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L
> 4.20.0-rc2+ #7
> [  +0.007927] Hardware name:  9008/IT91SMUB, BIOS BLXSV512 03/22/2018
> [  +0.006297] Call Trace:
> [  +0.002537]  dump_stack+0x46/0x60
> [  +0.003386]  __migration_entry_wait.cold.65+0x5/0x14
> [  +0.005043]  do_swap_page+0x84e/0x960
> [  +0.003727]  ? arch_tlb_finish_mmu+0x29/0xc0
> [  +0.006412]  __handle_mm_fault+0x933/0x1330
> [  +0.004265]  handle_mm_fault+0xc4/0x250
> [  +0.003915]  __do_page_fault+0x2b7/0x510
> [  +0.003990]  do_page_fault+0x2c/0x110
> [  +0.003729]  ? page_fault+0x8/0x30
> [  +0.003462]  page_fault+0x1e/0x30

There are many traces to this path. We are 
/*
 * Once page cache replacement of page migration started, page_count
 * *must* be zero. And, we don't want to call wait_on_page_locked()
 * against a page without get_page().
 * So, we use get_page_unless_zero(), here. Even failed, page fault
 * will occur again.
 */
if (!get_page_unless_zero(page))
goto out;
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);

taking a reference to the page under the migration. I have to think
about this much more but I suspec this is just calling for a problem.

Cc migration experts. For you background information. We are seeing
memory offline not being able to converge because few heavily used pages
fail to migrate away - e.g. 
http://lkml.kernel.org/r/20181116012433.GU2653@MiWiFi-R3L-srv
A debugging page to dump stack for these pages 
http://lkml.kernel.org/r/20181116091409.gd14...@dhcp22.suse.cz
shows that references are taken from the swap in code (above). How are
we supposed to converge when the swapin code waits for the migration to
finish with the reference count elevated?
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-16 Thread Baoquan He
On 11/16/18 at 10:14am, Michal Hocko wrote:
> Could you try to apply this debugging patch on top please? It will dump
> stack trace for each reference count elevation for one page that fails
> to migrate after multiple passes.

Thanks, applied and fixed two code issues. The dmesg has been sent to
you privately, please check. The dmesg is overflow, if you need the
earlier message, I will retest.

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index b64ebf253381..f76e2c498f31 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -72,7 +72,7 @@ static inline int page_count(struct page *page)
return atomic_read(&compound_head(page)->_refcount);
 }
 
-struct page *page_to_track;
+extern struct page *page_to_track;
 static inline void set_page_count(struct page *page, int v)
 {
atomic_set(&page->_refcount, v);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9b2e395a3d68..42c7499c43b9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1339,6 +1339,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
 }
 
 struct page *page_to_track;
+EXPORT_SYMBOL_GPL(page_to_track);
 
 /*
  * migrate_pages - migrate the pages specified in a list, to the free pages

> 
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 14d14beb1f7f..b64ebf253381 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -72,9 +72,12 @@ static inline int page_count(struct page *page)
>   return atomic_read(&compound_head(page)->_refcount);
>  }
>  
> +struct page *page_to_track;
>  static inline void set_page_count(struct page *page, int v)
>  {
>   atomic_set(&page->_refcount, v);
> + if (page == page_to_track)
> + dump_stack();
>   if (page_ref_tracepoint_active(__tracepoint_page_ref_set))
>   __page_ref_set(page, v);
>  }
> @@ -91,6 +94,8 @@ static inline void init_page_count(struct page *page)
>  static inline void page_ref_add(struct page *page, int nr)
>  {
>   atomic_add(nr, &page->_refcount);
> + if (page == page_to_track)
> + dump_stack();
>   if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
>   __page_ref_mod(page, nr);
>  }
> @@ -105,6 +110,8 @@ static inline void page_ref_sub(struct page *page, int nr)
>  static inline void page_ref_inc(struct page *page)
>  {
>   atomic_inc(&page->_refcount);
> + if (page == page_to_track)
> + dump_stack();
>   if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
>   __page_ref_mod(page, 1);
>  }
> @@ -129,6 +136,8 @@ static inline int page_ref_inc_return(struct page *page)
>  {
>   int ret = atomic_inc_return(&page->_refcount);
>  
> + if (page == page_to_track)
> + dump_stack();
>   if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return))
>   __page_ref_mod_and_return(page, 1, ret);
>   return ret;
> @@ -156,6 +165,8 @@ static inline int page_ref_add_unless(struct page *page, 
> int nr, int u)
>  {
>   int ret = atomic_add_unless(&page->_refcount, nr, u);
>  
> + if (page == page_to_track)
> + dump_stack();
>   if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_unless))
>   __page_ref_mod_unless(page, nr, ret);
>   return ret;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..9b2e395a3d68 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1338,6 +1338,8 @@ static int unmap_and_move_huge_page(new_page_t 
> get_new_page,
>   return rc;
>  }
>  
> +struct page *page_to_track;
> +
>  /*
>   * migrate_pages - migrate the pages specified in a list, to the free pages
>   *  supplied as the target for the page migration
> @@ -1375,6 +1377,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>   if (!swapwrite)
>   current->flags |= PF_SWAPWRITE;
>  
> + page_to_track = NULL;
>   for(pass = 0; pass < 10 && retry; pass++) {
>   retry = 0;
>  
> @@ -1417,6 +1420,8 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>   goto out;
>   case -EAGAIN:
>   retry++;
> + if (pass > 1 && !page_to_track)
> + page_to_track = page;
>   break;
>   case MIGRATEPAGE_SUCCESS:
>   nr_succeeded++;
> -- 
> Michal Hocko
> SUSE Labs


Re: Memory hotplug softlock issue

2018-11-16 Thread Michal Hocko
On Fri 16-11-18 09:24:33, Baoquan He wrote:
> On 11/15/18 at 03:32pm, Michal Hocko wrote:
> > On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > > > [...]
> > > > > > It would be also good to find out whether this is fs specific. E.g. 
> > > > > > does
> > > > > > it make any difference if you use a different one for your stress
> > > > > > testing?
> > > > > 
> > > > > Created a ramdisk and put stress bin there, then run stress -m 200, 
> > > > > now
> > > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So 
> > > > > now xfs
> > > > > is a big suspect. At bottom I paste numactl printing, you can see 
> > > > > that it's
> > > > > the last 4G.
> > > > > 
> > > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps 
> > > > > trying to
> > > > > access and activate it.
> > > > 
> > > > Is this still with faultaround disabled? I have seen exactly same
> > > > pattern in the bug I am working on. It was ext4 though.
> > > 
> > > After a long time struggling, the last 2nd block where libc-2.28.so is
> > > located is reclaimed, now it comes to the last memory block, still
> > > stress program itself. swap migration entry has been made and trying to
> > > unmap, now it's looping there.
> > > 
> > > [  +0.004445] migrating pfn 190ff2bb0 failed 
> > > [  +0.13] page:ea643fcaec00 count:203 mapcount:201 
> > > mapping:888dfb268f48 index:0x0
> > > [  +0.012809] shmem_aops 
> > > [  +0.11] name:"stress" 
> > > [  +0.002550] flags: 
> > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > > [  +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > > 888dfb268f48
> > > [  +0.007828] raw:   00cb00c8 
> > > 888e72e92000
> > > [  +0.007810] page->mem_cgroup:888e72e92000
> > [...]
> > > [  +0.004455] migrating pfn 190ff2bb0 failed 
> > > [  +0.18] page:ea643fcaec00 count:203 mapcount:201 
> > > mapping:888dfb268f48 index:0x0
> > > [  +0.014392] shmem_aops 
> > > [  +0.10] name:"stress" 
> > > [  +0.002565] flags: 
> > > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > > [  +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > > 888dfb268f48
> > > [  +0.007819] raw:   00cb00c8 
> > > 888e72e92000
> > > [  +0.007808] page->mem_cgroup:888e72e92000
> > 
> > OK, so this is tmpfs backed code of your stree test. This just tells us
> > that this is not fs specific. Reference count is 2 more than the map
> > count which is the expected state. So the reference count must have been
> > elevated at the time when the migration was attempted. Shmem supports
> > fault around so this might be still possible (assuming it is enabled).
> > If not we really need to dig deeper. I will think of a debugging patch.
> 
> Disabled faultaround and reboot, test again, it's looping forever in the
> last block again, on node2, stress progam itself again. The weird is
> refcount seems to have been crazy, a random number now. There must be
> something going wrong.

Could you try to apply this debugging patch on top please? It will dump
stack trace for each reference count elevation for one page that fails
to migrate after multiple passes.

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..b64ebf253381 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -72,9 +72,12 @@ static inline int page_count(struct page *page)
return atomic_read(&compound_head(page)->_refcount);
 }
 
+struct page *page_to_track;
 static inline void set_page_count(struct page *page, int v)
 {
atomic_set(&page->_refcount, v);
+   if (page == page_to_track)
+   dump_stack();
if (page_ref_tracepoint_active(__tracepoint_page_ref_set))
__page_ref_set(page, v);
 }
@@ -91,6 +94,8 @@ static inline void init_page_count(struct page *page)
 static inline void page_ref_add(struct page *page, int nr)
 {
atomic_add(nr, &page->_refcount);
+   if (page == page_to_track)
+   dump_stack();
if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod(page, nr);
 }
@@ -105,6 +110,8 @@ static inline void page_ref_sub(struct page *page, int nr)
 static inline void page_ref_inc(struct page *page)
 {
atomic_inc(&page->_refcount);
+   if (page == page_to_track)
+   dump_stack();
if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod(page, 1);
 }
@@ -129,6 +136,8 @@ static inline int page_ref_inc_return(struct page *page)
 {
int ret = atomic_inc_return(&page->_refcount);
 
+   if (page == page_to_track)
+   dump_stack();
if (page_ref_tracepoint_activ

Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He
On 11/15/18 at 03:32pm, Michal Hocko wrote:
> On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > > [...]
> > > > > It would be also good to find out whether this is fs specific. E.g. 
> > > > > does
> > > > > it make any difference if you use a different one for your stress
> > > > > testing?
> > > > 
> > > > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now 
> > > > xfs
> > > > is a big suspect. At bottom I paste numactl printing, you can see that 
> > > > it's
> > > > the last 4G.
> > > > 
> > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps 
> > > > trying to
> > > > access and activate it.
> > > 
> > > Is this still with faultaround disabled? I have seen exactly same
> > > pattern in the bug I am working on. It was ext4 though.
> > 
> > After a long time struggling, the last 2nd block where libc-2.28.so is
> > located is reclaimed, now it comes to the last memory block, still
> > stress program itself. swap migration entry has been made and trying to
> > unmap, now it's looping there.
> > 
> > [  +0.004445] migrating pfn 190ff2bb0 failed 
> > [  +0.13] page:ea643fcaec00 count:203 mapcount:201 
> > mapping:888dfb268f48 index:0x0
> > [  +0.012809] shmem_aops 
> > [  +0.11] name:"stress" 
> > [  +0.002550] flags: 
> > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > [  +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > 888dfb268f48
> > [  +0.007828] raw:   00cb00c8 
> > 888e72e92000
> > [  +0.007810] page->mem_cgroup:888e72e92000
> [...]
> > [  +0.004455] migrating pfn 190ff2bb0 failed 
> > [  +0.18] page:ea643fcaec00 count:203 mapcount:201 
> > mapping:888dfb268f48 index:0x0
> > [  +0.014392] shmem_aops 
> > [  +0.10] name:"stress" 
> > [  +0.002565] flags: 
> > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > [  +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > 888dfb268f48
> > [  +0.007819] raw:   00cb00c8 
> > 888e72e92000
> > [  +0.007808] page->mem_cgroup:888e72e92000
> 
> OK, so this is tmpfs backed code of your stree test. This just tells us
> that this is not fs specific. Reference count is 2 more than the map
> count which is the expected state. So the reference count must have been
> elevated at the time when the migration was attempted. Shmem supports
> fault around so this might be still possible (assuming it is enabled).
> If not we really need to dig deeper. I will think of a debugging patch.

Disabled faultaround and reboot, test again, it's looping forever in the
last block again, on node2, stress progam itself again. The weird is
refcount seems to have been crazy, a random number now. There must be
something going wrong.

[  +0.058624] migrating pfn 80fd6fbe failed 
[  +0.03] page:ea203f5bef80 count:336 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.014122] shmem_aops 
[  +0.00] name:"stress" 
[  +0.002467] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked)
[  +0.009511] raw: 009fc008000e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007743] raw: 0002  00cb00c8 
888e2233d000
[  +0.007740] page->mem_cgroup:888e2233d000
[  +0.038916] migrating pfn 80fd6fbe failed 
[  +0.03] page:ea203f5bef80 count:349 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.012453] shmem_aops 
[  +0.01] name:"stress" 
[  +0.002641] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked)
[  +0.009501] raw: 009fc008000e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007746] raw: 0002  00cb00c8 
888e2233d000
[  +0.007740] page->mem_cgroup:888e2233d000
[  +0.061226] migrating pfn 80fd6fbe failed 
[  +0.04] page:ea203f5bef80 count:276 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.014129] shmem_aops 
[  +0.02] name:"stress" 
[  +0.003246] flags: 
0x9fc008008e(waiters|referenced|uptodate|dirty|swapbacked)
[  +0.010183] raw: 009fc008008e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007742] raw: 0002  00cb00c8 
888e2233d000
[  +0.007733] page->mem_cgroup:888e2233d000
[  +0.037305] migrating pfn 80fd6fbe failed 
[  +0.03] page:ea203f5bef80 count:304 mapcount:201 
mapping:888e1c9357d8 index:0x2
[  +0.012449] shmem_aops 
[  +0.02] name:"stress" 
[  +0.002469] flags: 0x9fc008000e(referenced|uptodate|dirty|swapbacked)
[  +0.009495] raw: 009fc008000e c90e3d80 c90e3d80 
888e1c9357d8
[  +0.007743] raw: 0002  00cb

Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He
On 11/15/18 at 03:32pm, Michal Hocko wrote:
> On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > > [...]
> > > > > It would be also good to find out whether this is fs specific. E.g. 
> > > > > does
> > > > > it make any difference if you use a different one for your stress
> > > > > testing?
> > > > 
> > > > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now 
> > > > xfs
> > > > is a big suspect. At bottom I paste numactl printing, you can see that 
> > > > it's
> > > > the last 4G.
> > > > 
> > > > Seems it's trying to migrate libc-2.28.so, but stress program keeps 
> > > > trying to
> > > > access and activate it.
> > > 
> > > Is this still with faultaround disabled? I have seen exactly same
> > > pattern in the bug I am working on. It was ext4 though.
> > 
> > After a long time struggling, the last 2nd block where libc-2.28.so is
> > located is reclaimed, now it comes to the last memory block, still
> > stress program itself. swap migration entry has been made and trying to
> > unmap, now it's looping there.
> > 
> > [  +0.004445] migrating pfn 190ff2bb0 failed 
> > [  +0.13] page:ea643fcaec00 count:203 mapcount:201 
> > mapping:888dfb268f48 index:0x0
> > [  +0.012809] shmem_aops 
> > [  +0.11] name:"stress" 
> > [  +0.002550] flags: 
> > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > [  +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > 888dfb268f48
> > [  +0.007828] raw:   00cb00c8 
> > 888e72e92000
> > [  +0.007810] page->mem_cgroup:888e72e92000
> [...]
> > [  +0.004455] migrating pfn 190ff2bb0 failed 
> > [  +0.18] page:ea643fcaec00 count:203 mapcount:201 
> > mapping:888dfb268f48 index:0x0
> > [  +0.014392] shmem_aops 
> > [  +0.10] name:"stress" 
> > [  +0.002565] flags: 
> > 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> > [  +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> > 888dfb268f48
> > [  +0.007819] raw:   00cb00c8 
> > 888e72e92000
> > [  +0.007808] page->mem_cgroup:888e72e92000
> 
> OK, so this is tmpfs backed code of your stree test. This just tells us
> that this is not fs specific. Reference count is 2 more than the map
> count which is the expected state. So the reference count must have been
> elevated at the time when the migration was attempted. Shmem supports
> fault around so this might be still possible (assuming it is enabled).
> If not we really need to dig deeper. I will think of a debugging patch.

Yes, faultaround is enabled. Will disable it and test again. Will report
test result.


Re: Memory hotplug softlock issue

2018-11-15 Thread Michal Hocko
On Thu 15-11-18 21:38:40, Baoquan He wrote:
> On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > [...]
> > > > It would be also good to find out whether this is fs specific. E.g. does
> > > > it make any difference if you use a different one for your stress
> > > > testing?
> > > 
> > > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs
> > > is a big suspect. At bottom I paste numactl printing, you can see that 
> > > it's
> > > the last 4G.
> > > 
> > > Seems it's trying to migrate libc-2.28.so, but stress program keeps 
> > > trying to
> > > access and activate it.
> > 
> > Is this still with faultaround disabled? I have seen exactly same
> > pattern in the bug I am working on. It was ext4 though.
> 
> After a long time struggling, the last 2nd block where libc-2.28.so is
> located is reclaimed, now it comes to the last memory block, still
> stress program itself. swap migration entry has been made and trying to
> unmap, now it's looping there.
> 
> [  +0.004445] migrating pfn 190ff2bb0 failed 
> [  +0.13] page:ea643fcaec00 count:203 mapcount:201 
> mapping:888dfb268f48 index:0x0
> [  +0.012809] shmem_aops 
> [  +0.11] name:"stress" 
> [  +0.002550] flags: 
> 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> [  +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> 888dfb268f48
> [  +0.007828] raw:   00cb00c8 
> 888e72e92000
> [  +0.007810] page->mem_cgroup:888e72e92000
[...]
> [  +0.004455] migrating pfn 190ff2bb0 failed 
> [  +0.18] page:ea643fcaec00 count:203 mapcount:201 
> mapping:888dfb268f48 index:0x0
> [  +0.014392] shmem_aops 
> [  +0.10] name:"stress" 
> [  +0.002565] flags: 
> 0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
> [  +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
> 888dfb268f48
> [  +0.007819] raw:   00cb00c8 
> 888e72e92000
> [  +0.007808] page->mem_cgroup:888e72e92000

OK, so this is tmpfs backed code of your stree test. This just tells us
that this is not fs specific. Reference count is 2 more than the map
count which is the expected state. So the reference count must have been
elevated at the time when the migration was attempted. Shmem supports
fault around so this might be still possible (assuming it is enabled).
If not we really need to dig deeper. I will think of a debugging patch.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-15 Thread Michal Hocko
On Thu 15-11-18 21:23:42, Baoquan He wrote:
> On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > [...]
> > > > It would be also good to find out whether this is fs specific. E.g. does
> > > > it make any difference if you use a different one for your stress
> > > > testing?
> > > 
> > > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs
> > > is a big suspect. At bottom I paste numactl printing, you can see that 
> > > it's
> > > the last 4G.
> > > 
> > > Seems it's trying to migrate libc-2.28.so, but stress program keeps 
> > > trying to
> > > access and activate it.
> > 
> > Is this still with faultaround disabled? I have seen exactly same
> > pattern in the bug I am working on. It was ext4 though.
> 
> No, forgot disabling faultround after reboot. Do we need to disable it and
> retest?

No the faultaround is checked at the time of the fault. The reason why I
am suspecting this path is that it can elevate the reference count
before taking the lock. Normal page fault path should lock the page
first. And we hold the lock while trying to migrate that page.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He
On 11/15/18 at 02:19pm, Michal Hocko wrote:
> On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > On 11/15/18 at 09:30am, Michal Hocko wrote:
> [...]
> > > It would be also good to find out whether this is fs specific. E.g. does
> > > it make any difference if you use a different one for your stress
> > > testing?
> > 
> > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs
> > is a big suspect. At bottom I paste numactl printing, you can see that it's
> > the last 4G.
> > 
> > Seems it's trying to migrate libc-2.28.so, but stress program keeps trying 
> > to
> > access and activate it.
> 
> Is this still with faultaround disabled? I have seen exactly same
> pattern in the bug I am working on. It was ext4 though.

After a long time struggling, the last 2nd block where libc-2.28.so is
located is reclaimed, now it comes to the last memory block, still
stress program itself. swap migration entry has been made and trying to
unmap, now it's looping there.

[  +0.004445] migrating pfn 190ff2bb0 failed 
[  +0.13] page:ea643fcaec00 count:203 mapcount:201 
mapping:888dfb268f48 index:0x0
[  +0.012809] shmem_aops 
[  +0.11] name:"stress" 
[  +0.002550] flags: 
0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
[  +0.010715] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
888dfb268f48
[  +0.007828] raw:   00cb00c8 
888e72e92000
[  +0.007810] page->mem_cgroup:888e72e92000
[  +0.004466] migrating pfn 190ff2bb1 failed 
[  +0.13] page:ea643fcaec40 count:203 mapcount:201 
mapping:888dfb268f48 index:0x2
[  +0.014321] shmem_aops 
[  +0.24] name:"stress" 
[  +0.002535] flags: 
0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
[  +0.010680] raw: 01dfc008004e c90e3d80 ea643fcaec08 
888dfb268f48
[  +0.007863] raw: 0002  00cb00c8 
888e72e92000
[  +0.007828] page->mem_cgroup:888e72e92000
[  +1.357302] migrating pfn 190ff1c53 failed 
[  +0.11] page:ea643fc714c0 count:202 mapcount:201 
mapping:888e5e395109 index:0x28
[  +0.012756] anon 
[  +0.24] flags: 0x1dfc008004c(uptodate|dirty|workingset|swapbacked)
[  +0.008961] raw: 01dfc008004c ea643fcaec08 c90e3d80 
888e5e395109
[  +0.007847] raw: 0028  00ca00c8 
888e72e92000
[  +0.007844] page->mem_cgroup:888e72e92000
[  +0.004455] migrating pfn 190ff2bb0 failed 
[  +0.18] page:ea643fcaec00 count:203 mapcount:201 
mapping:888dfb268f48 index:0x0
[  +0.014392] shmem_aops 
[  +0.10] name:"stress" 
[  +0.002565] flags: 
0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
[  +0.010675] raw: 01dfc008004e ea643fcaec48 ea643fc714c8 
888dfb268f48
[  +0.007819] raw:   00cb00c8 
888e72e92000
[  +0.007808] page->mem_cgroup:888e72e92000
[  +0.004431] migrating pfn 190ff2bb1 failed 
[  +0.19] page:ea643fcaec40 count:203 mapcount:201 
mapping:888dfb268f48 index:0x2
[  +0.012688] shmem_aops 
[  +0.12] name:"stress" 
[  +0.002525] flags: 
0x1dfc008004e(referenced|uptodate|dirty|workingset|swapbacked)
[  +0.012385] raw: 01dfc008004e c90e3d80 ea643fcaec08 
888dfb268f48
[  +0.007840] raw: 0002  00cb00c8 
888e72e92000
[  +0.007832] page->mem_cgroup:888e72e92000



Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He
On 11/15/18 at 02:19pm, Michal Hocko wrote:
> On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > On 11/15/18 at 09:30am, Michal Hocko wrote:
> [...]
> > > It would be also good to find out whether this is fs specific. E.g. does
> > > it make any difference if you use a different one for your stress
> > > testing?
> > 
> > Created a ramdisk and put stress bin there, then run stress -m 200, now
> > seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs
> > is a big suspect. At bottom I paste numactl printing, you can see that it's
> > the last 4G.
> > 
> > Seems it's trying to migrate libc-2.28.so, but stress program keeps trying 
> > to
> > access and activate it.
> 
> Is this still with faultaround disabled? I have seen exactly same
> pattern in the bug I am working on. It was ext4 though.

No, forgot disabling faultround after reboot. Do we need to disable it and
retest?

> 
> > [ 5055.461652] migrating pfn 190f4fb3e failed 
> > [ 5055.461671] page:ea643d3ecf80 count:257 mapcount:251 
> > mapping:888e7a6ac528 index:0x85
> > [ 5055.474734] xfs_address_space_operations [xfs] 
> > [ 5055.474742] name:"libc-2.28.so" 
> > [ 5055.481070] flags: 0x1dfc026(referenced|uptodate|active)
> > [ 5055.490329] raw: 01dfc026 c90e3d80 c90e3d80 
> > 888e7a6ac528
> > [ 5055.498080] raw: 0085  00fc00f9 
> > 88810a8f2000
> > [ 5055.505823] page->mem_cgroup:88810a8f2000
> > [ 5056.335970] migrating pfn 190f4fb3e failed 
> > [ 5056.335990] page:ea643d3ecf80 count:255 mapcount:250 
> > mapping:888e7a6ac528 index:0x85
> > [ 5056.348994] xfs_address_space_operations [xfs] 
> > [ 5056.348998] name:"libc-2.28.so" 
> > [ 5056.353555] flags: 0x1dfc026(referenced|uptodate|active)
> > [ 5056.364680] raw: 01dfc026 c90e3d80 c90e3d80 
> > 888e7a6ac528
> > [ 5056.372428] raw: 0085  00fc00f9 
> > 88810a8f2000
> > [ 5056.380172] page->mem_cgroup:88810a8f2000
> > [ 5057.332806] migrating pfn 190f4fb3e failed 
> > [ 5057.332821] page:ea643d3ecf80 count:261 mapcount:250 
> > mapping:888e7a6ac528 index:0x85
> > [ 5057.345889] xfs_address_space_operations [xfs] 
> > [ 5057.345900] name:"libc-2.28.so" 
> > [ 5057.350451] flags: 0x1dfc026(referenced|uptodate|active)
> > [ 5057.359707] raw: 01dfc026 c90e3d80 c90e3d80 
> > 888e7a6ac528
> > [ 5057.369285] raw: 0085  00fc00f9 
> > 88810a8f2000
> > [ 5057.377030] page->mem_cgroup:88810a8f2000
> > [ 5058.285457] migrating pfn 190f4fb3e failed 
> > [ 5058.285489] page:ea643d3ecf80 count:257 mapcount:250 
> > mapping:888e7a6ac528 index:0x85
> > [ 5058.298544] xfs_address_space_operations [xfs] 
> > [ 5058.298556] name:"libc-2.28.so" 
> > [ 5058.303092] flags: 0x1dfc026(referenced|uptodate|active)
> > [ 5058.314358] raw: 01dfc026 c90e3d80 c90e3d80 
> > 888e7a6ac528
> > [ 5058.322109] raw: 0085  00fc00f9 
> > 88810a8f2000
> > [ 5058.329848] page->mem_cgroup:88810a8f2000
> -- 
> Michal Hocko
> SUSE Labs


Re: Memory hotplug softlock issue

2018-11-15 Thread Michal Hocko
On Thu 15-11-18 21:12:11, Baoquan He wrote:
> On 11/15/18 at 09:30am, Michal Hocko wrote:
[...]
> > It would be also good to find out whether this is fs specific. E.g. does
> > it make any difference if you use a different one for your stress
> > testing?
> 
> Created a ramdisk and put stress bin there, then run stress -m 200, now
> seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs
> is a big suspect. At bottom I paste numactl printing, you can see that it's
> the last 4G.
> 
> Seems it's trying to migrate libc-2.28.so, but stress program keeps trying to
> access and activate it.

Is this still with faultaround disabled? I have seen exactly same
pattern in the bug I am working on. It was ext4 though.

> [ 5055.461652] migrating pfn 190f4fb3e failed 
> [ 5055.461671] page:ea643d3ecf80 count:257 mapcount:251 
> mapping:888e7a6ac528 index:0x85
> [ 5055.474734] xfs_address_space_operations [xfs] 
> [ 5055.474742] name:"libc-2.28.so" 
> [ 5055.481070] flags: 0x1dfc026(referenced|uptodate|active)
> [ 5055.490329] raw: 01dfc026 c90e3d80 c90e3d80 
> 888e7a6ac528
> [ 5055.498080] raw: 0085  00fc00f9 
> 88810a8f2000
> [ 5055.505823] page->mem_cgroup:88810a8f2000
> [ 5056.335970] migrating pfn 190f4fb3e failed 
> [ 5056.335990] page:ea643d3ecf80 count:255 mapcount:250 
> mapping:888e7a6ac528 index:0x85
> [ 5056.348994] xfs_address_space_operations [xfs] 
> [ 5056.348998] name:"libc-2.28.so" 
> [ 5056.353555] flags: 0x1dfc026(referenced|uptodate|active)
> [ 5056.364680] raw: 01dfc026 c90e3d80 c90e3d80 
> 888e7a6ac528
> [ 5056.372428] raw: 0085  00fc00f9 
> 88810a8f2000
> [ 5056.380172] page->mem_cgroup:88810a8f2000
> [ 5057.332806] migrating pfn 190f4fb3e failed 
> [ 5057.332821] page:ea643d3ecf80 count:261 mapcount:250 
> mapping:888e7a6ac528 index:0x85
> [ 5057.345889] xfs_address_space_operations [xfs] 
> [ 5057.345900] name:"libc-2.28.so" 
> [ 5057.350451] flags: 0x1dfc026(referenced|uptodate|active)
> [ 5057.359707] raw: 01dfc026 c90e3d80 c90e3d80 
> 888e7a6ac528
> [ 5057.369285] raw: 0085  00fc00f9 
> 88810a8f2000
> [ 5057.377030] page->mem_cgroup:88810a8f2000
> [ 5058.285457] migrating pfn 190f4fb3e failed 
> [ 5058.285489] page:ea643d3ecf80 count:257 mapcount:250 
> mapping:888e7a6ac528 index:0x85
> [ 5058.298544] xfs_address_space_operations [xfs] 
> [ 5058.298556] name:"libc-2.28.so" 
> [ 5058.303092] flags: 0x1dfc026(referenced|uptodate|active)
> [ 5058.314358] raw: 01dfc026 c90e3d80 c90e3d80 
> 888e7a6ac528
> [ 5058.322109] raw: 0085  00fc00f9 
> 88810a8f2000
> [ 5058.329848] page->mem_cgroup:88810a8f2000
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He
On 11/15/18 at 09:30am, Michal Hocko wrote:
> On Thu 15-11-18 15:53:56, Baoquan He wrote:
> > On 11/15/18 at 08:30am, Michal Hocko wrote:
> > > On Thu 15-11-18 13:10:34, Baoquan He wrote:
> > > > On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > > > > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > > > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > > > > > I have seen an issue when the migration cannot make a forward 
> > > > > > > progress
> > > > > > > because of a glibc page with a reference count bumping up and 
> > > > > > > down. Most
> > > > > > > probable explanation is the faultaround code. I am working on 
> > > > > > > this and
> > > > > > > will post a patch soon. In any case the migration should converge 
> > > > > > > and if
> > > > > > > it doesn't do then there is a bug lurking somewhere.
> > > > > > > 
> > > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that 
> > > > > > > happening
> > > > > > > wildly but if it is a case then I wouldn't be opposed.
> > > > > > 
> > > > > > Applied your debugging patches, it helps a lot to printing message.
> > > > > > 
> > > > > > Below is the dmesg log about the migrating failure. It can't pass
> > > > > > migrate_pages() and loop forever.
> > > > > > 
> > > > > > [  +0.083841] migrating pfn 10fff7d0 failed 
> > > > > > [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
> > > > > > mapping:888dff4bdda8 index:0x2
> > > > > > [  +0.012689] xfs_address_space_operations [xfs] 
> > > > > > [  +0.30] name:"stress" 
> > > > > > [  +0.004556] flags: 0x5fc004(uptodate)
> > > > > > [  +0.007339] raw: 005fc004 c90e3d80 
> > > > > > c90e3d80 888dff4bdda8
> > > > > > [  +0.009488] raw: 0002  
> > > > > > 00cb00c8 888e7353d000
> > > > > > [  +0.007726] page->mem_cgroup:888e7353d000
> > > > > > [  +0.084538] migrating pfn 10fff7d0 failed 
> > > > > > [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
> > > > > > mapping:888dff4bdda8 index:0x2
> > > > > > [  +0.012798] xfs_address_space_operations [xfs] 
> > > > > > [  +0.34] name:"stress" 
> > > > > > [  +0.004524] flags: 0x5fc004(uptodate)
> > > > > > [  +0.007068] raw: 005fc004 c90e3d80 
> > > > > > c90e3d80 888dff4bdda8
> > > > > > [  +0.009359] raw: 0002  
> > > > > > 00cb00c8 888e7353d000
> > > > > > [  +0.007728] page->mem_cgroup:888e7353d000
> > > > > 
> > > > > I wouldn't be surprised if this was a similar/same issue I've been
> > > > > chasing recently. Could you try to disable faultaround to see if that
> > > > > helps. It seems that it helped in my particular case but I am still
> > > > > waiting for the final good-to-go to post the patch as I do not own the
> > > > > workload which triggered that issue.
> > > > 
> > > > Tried, still stuck in last block sometime. Usually after several times
> > > > of hotplug/unplug. If stop stress program, the last block will be
> > > > offlined immediately.
> > > 
> > > Is the pattern still the same? I mean failing over few pages with
> > > reference count jumping up and down between attempts?
> > 
> > ->count jumping up and down, mapcount stays the same value.
> > 
> > > 
> > > > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 
> > > > 4096
> > > 
> > > Can you make it 0?
> > 
> > I executed 'echo 0 > fault_around_bytes', value less than one page size
> > will round up to one page.
> 
> OK, I have missed that. So then there must be a different source of the
> page count volatility. Is it always the same file?
> 
> I think we can rule out memory reclaim because that depends on the page
> lock. Is the stress test hitting on memory compaction? In other words,
> are
> grep compact /proc/vmstat
> counters changing during the offline test heavily? I am asking because I
> do not see compaction pfn walkers skipping over MIGRATE_ISOLATE
> pageblocks. But I might be missing something easily.
> 
> It would be also good to find out whether this is fs specific. E.g. does
> it make any difference if you use a different one for your stress
> testing?

Created a ramdisk and put stress bin there, then run stress -m 200, now
seems it's stuck in libc-2.28.so migrating. And it's still xfs. So now xfs
is a big suspect. At bottom I paste numactl printing, you can see that it's
the last 4G.

Seems it's trying to migrate libc-2.28.so, but stress program keeps trying to
access and activate it.

[ 5055.461652] migrating pfn 190f4fb3e failed 
[ 5055.461671] page:ea643d3ecf80 count:257 mapcount:251 
mapping:888e7a6ac528 index:0x85
[ 5055.474734] xfs_address_space_operations [xfs] 
[ 5055.474742] name:"libc-2.28.so" 
[ 5055.481070] flags: 0x1dfc026(referenced|uptodate|active)
[ 5055.490329] raw: 01dfc026 c90e3d80 c90e3d80 
888e7a6ac528
[ 5055.498080] raw: 0085  00fc00f9 
88810a8f2000
[ 5055.505823]

Re: Memory hotplug softlock issue

2018-11-15 Thread David Hildenbrand
On 15.11.18 10:52, Baoquan He wrote:
> On 11/15/18 at 10:42am, David Hildenbrand wrote:
>> I am wondering why it is always the last memory block of that device
>> (and even that node). Coincidence?
> 
> I remember one or two times it's the last 6G or 4G which stall there,
> the size of memory block is 2G. But most of time it's the last memory
> block. And from the debug printing added by Michal's patch, it's the
> stress program itself which owns the migrating page and loop forvever
> there.
> 

Alright, seems like you found a very good reproducer for an existing
problem :)

-- 

Thanks,

David / dhildenb


Re: Memory hotplug softlock issue

2018-11-15 Thread Baoquan He
On 11/15/18 at 10:42am, David Hildenbrand wrote:
> I am wondering why it is always the last memory block of that device
> (and even that node). Coincidence?

I remember one or two times it's the last 6G or 4G which stall there,
the size of memory block is 2G. But most of time it's the last memory
block. And from the debug printing added by Michal's patch, it's the
stress program itself which owns the migrating page and loop forvever
there.


Re: Memory hotplug softlock issue

2018-11-15 Thread David Hildenbrand
On 15.11.18 09:30, Michal Hocko wrote:
> On Thu 15-11-18 15:53:56, Baoquan He wrote:
>> On 11/15/18 at 08:30am, Michal Hocko wrote:
>>> On Thu 15-11-18 13:10:34, Baoquan He wrote:
 On 11/14/18 at 04:00pm, Michal Hocko wrote:
> On Wed 14-11-18 22:52:50, Baoquan He wrote:
>> On 11/14/18 at 10:01am, Michal Hocko wrote:
>>> I have seen an issue when the migration cannot make a forward progress
>>> because of a glibc page with a reference count bumping up and down. Most
>>> probable explanation is the faultaround code. I am working on this and
>>> will post a patch soon. In any case the migration should converge and if
>>> it doesn't do then there is a bug lurking somewhere.
>>>
>>> Failing on ENOMEM is a questionable thing. I haven't seen that happening
>>> wildly but if it is a case then I wouldn't be opposed.
>>
>> Applied your debugging patches, it helps a lot to printing message.
>>
>> Below is the dmesg log about the migrating failure. It can't pass
>> migrate_pages() and loop forever.
>>
>> [  +0.083841] migrating pfn 10fff7d0 failed 
>> [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
>> mapping:888dff4bdda8 index:0x2
>> [  +0.012689] xfs_address_space_operations [xfs] 
>> [  +0.30] name:"stress" 
>> [  +0.004556] flags: 0x5fc004(uptodate)
>> [  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
>> 888dff4bdda8
>> [  +0.009488] raw: 0002  00cb00c8 
>> 888e7353d000
>> [  +0.007726] page->mem_cgroup:888e7353d000
>> [  +0.084538] migrating pfn 10fff7d0 failed 
>> [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
>> mapping:888dff4bdda8 index:0x2
>> [  +0.012798] xfs_address_space_operations [xfs] 
>> [  +0.34] name:"stress" 
>> [  +0.004524] flags: 0x5fc004(uptodate)
>> [  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
>> 888dff4bdda8
>> [  +0.009359] raw: 0002  00cb00c8 
>> 888e7353d000
>> [  +0.007728] page->mem_cgroup:888e7353d000
>
> I wouldn't be surprised if this was a similar/same issue I've been
> chasing recently. Could you try to disable faultaround to see if that
> helps. It seems that it helped in my particular case but I am still
> waiting for the final good-to-go to post the patch as I do not own the
> workload which triggered that issue.

 Tried, still stuck in last block sometime. Usually after several times
 of hotplug/unplug. If stop stress program, the last block will be
 offlined immediately.
>>>
>>> Is the pattern still the same? I mean failing over few pages with
>>> reference count jumping up and down between attempts?
>>
>> ->count jumping up and down, mapcount stays the same value.
>>
>>>
 [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 
 4096
>>>
>>> Can you make it 0?
>>
>> I executed 'echo 0 > fault_around_bytes', value less than one page size
>> will round up to one page.
> 
> OK, I have missed that. So then there must be a different source of the
> page count volatility. Is it always the same file?
> 
> I think we can rule out memory reclaim because that depends on the page
> lock. Is the stress test hitting on memory compaction? In other words,
> are
> grep compact /proc/vmstat
> counters changing during the offline test heavily? I am asking because I
> do not see compaction pfn walkers skipping over MIGRATE_ISOLATE
> pageblocks. But I might be missing something easily.
> 
> It would be also good to find out whether this is fs specific. E.g. does
> it make any difference if you use a different one for your stress
> testing?
> 

I am wondering why it is always the last memory block of that device
(and even that node). Coincidence?

-- 

Thanks,

David / dhildenb


Re: Memory hotplug softlock issue

2018-11-15 Thread Michal Hocko
On Thu 15-11-18 15:53:56, Baoquan He wrote:
> On 11/15/18 at 08:30am, Michal Hocko wrote:
> > On Thu 15-11-18 13:10:34, Baoquan He wrote:
> > > On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > > > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > > > > I have seen an issue when the migration cannot make a forward 
> > > > > > progress
> > > > > > because of a glibc page with a reference count bumping up and down. 
> > > > > > Most
> > > > > > probable explanation is the faultaround code. I am working on this 
> > > > > > and
> > > > > > will post a patch soon. In any case the migration should converge 
> > > > > > and if
> > > > > > it doesn't do then there is a bug lurking somewhere.
> > > > > > 
> > > > > > Failing on ENOMEM is a questionable thing. I haven't seen that 
> > > > > > happening
> > > > > > wildly but if it is a case then I wouldn't be opposed.
> > > > > 
> > > > > Applied your debugging patches, it helps a lot to printing message.
> > > > > 
> > > > > Below is the dmesg log about the migrating failure. It can't pass
> > > > > migrate_pages() and loop forever.
> > > > > 
> > > > > [  +0.083841] migrating pfn 10fff7d0 failed 
> > > > > [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
> > > > > mapping:888dff4bdda8 index:0x2
> > > > > [  +0.012689] xfs_address_space_operations [xfs] 
> > > > > [  +0.30] name:"stress" 
> > > > > [  +0.004556] flags: 0x5fc004(uptodate)
> > > > > [  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
> > > > > 888dff4bdda8
> > > > > [  +0.009488] raw: 0002  00cb00c8 
> > > > > 888e7353d000
> > > > > [  +0.007726] page->mem_cgroup:888e7353d000
> > > > > [  +0.084538] migrating pfn 10fff7d0 failed 
> > > > > [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
> > > > > mapping:888dff4bdda8 index:0x2
> > > > > [  +0.012798] xfs_address_space_operations [xfs] 
> > > > > [  +0.34] name:"stress" 
> > > > > [  +0.004524] flags: 0x5fc004(uptodate)
> > > > > [  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
> > > > > 888dff4bdda8
> > > > > [  +0.009359] raw: 0002  00cb00c8 
> > > > > 888e7353d000
> > > > > [  +0.007728] page->mem_cgroup:888e7353d000
> > > > 
> > > > I wouldn't be surprised if this was a similar/same issue I've been
> > > > chasing recently. Could you try to disable faultaround to see if that
> > > > helps. It seems that it helped in my particular case but I am still
> > > > waiting for the final good-to-go to post the patch as I do not own the
> > > > workload which triggered that issue.
> > > 
> > > Tried, still stuck in last block sometime. Usually after several times
> > > of hotplug/unplug. If stop stress program, the last block will be
> > > offlined immediately.
> > 
> > Is the pattern still the same? I mean failing over few pages with
> > reference count jumping up and down between attempts?
> 
> ->count jumping up and down, mapcount stays the same value.
> 
> > 
> > > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 
> > > 4096
> > 
> > Can you make it 0?
> 
> I executed 'echo 0 > fault_around_bytes', value less than one page size
> will round up to one page.

OK, I have missed that. So then there must be a different source of the
page count volatility. Is it always the same file?

I think we can rule out memory reclaim because that depends on the page
lock. Is the stress test hitting on memory compaction? In other words,
are
grep compact /proc/vmstat
counters changing during the offline test heavily? I am asking because I
do not see compaction pfn walkers skipping over MIGRATE_ISOLATE
pageblocks. But I might be missing something easily.

It would be also good to find out whether this is fs specific. E.g. does
it make any difference if you use a different one for your stress
testing?
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Baoquan He
On 11/15/18 at 08:30am, Michal Hocko wrote:
> On Thu 15-11-18 13:10:34, Baoquan He wrote:
> > On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > > > I have seen an issue when the migration cannot make a forward progress
> > > > > because of a glibc page with a reference count bumping up and down. 
> > > > > Most
> > > > > probable explanation is the faultaround code. I am working on this and
> > > > > will post a patch soon. In any case the migration should converge and 
> > > > > if
> > > > > it doesn't do then there is a bug lurking somewhere.
> > > > > 
> > > > > Failing on ENOMEM is a questionable thing. I haven't seen that 
> > > > > happening
> > > > > wildly but if it is a case then I wouldn't be opposed.
> > > > 
> > > > Applied your debugging patches, it helps a lot to printing message.
> > > > 
> > > > Below is the dmesg log about the migrating failure. It can't pass
> > > > migrate_pages() and loop forever.
> > > > 
> > > > [  +0.083841] migrating pfn 10fff7d0 failed 
> > > > [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
> > > > mapping:888dff4bdda8 index:0x2
> > > > [  +0.012689] xfs_address_space_operations [xfs] 
> > > > [  +0.30] name:"stress" 
> > > > [  +0.004556] flags: 0x5fc004(uptodate)
> > > > [  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
> > > > 888dff4bdda8
> > > > [  +0.009488] raw: 0002  00cb00c8 
> > > > 888e7353d000
> > > > [  +0.007726] page->mem_cgroup:888e7353d000
> > > > [  +0.084538] migrating pfn 10fff7d0 failed 
> > > > [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
> > > > mapping:888dff4bdda8 index:0x2
> > > > [  +0.012798] xfs_address_space_operations [xfs] 
> > > > [  +0.34] name:"stress" 
> > > > [  +0.004524] flags: 0x5fc004(uptodate)
> > > > [  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
> > > > 888dff4bdda8
> > > > [  +0.009359] raw: 0002  00cb00c8 
> > > > 888e7353d000
> > > > [  +0.007728] page->mem_cgroup:888e7353d000
> > > 
> > > I wouldn't be surprised if this was a similar/same issue I've been
> > > chasing recently. Could you try to disable faultaround to see if that
> > > helps. It seems that it helped in my particular case but I am still
> > > waiting for the final good-to-go to post the patch as I do not own the
> > > workload which triggered that issue.
> > 
> > Tried, still stuck in last block sometime. Usually after several times
> > of hotplug/unplug. If stop stress program, the last block will be
> > offlined immediately.
> 
> Is the pattern still the same? I mean failing over few pages with
> reference count jumping up and down between attempts?

->count jumping up and down, mapcount stays the same value.

> 
> > [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 
> > 4096
> 
> Can you make it 0?

I executed 'echo 0 > fault_around_bytes', value less than one page size
will round up to one page.

/*
 * fault_around_bytes must be rounded down to the nearest page order as it's
 * what do_fault_around() expects to see.
 */
static int fault_around_bytes_set(void *data, u64 val)
{
if (val / PAGE_SIZE > PTRS_PER_PTE)
return -EINVAL;
if (val > PAGE_SIZE)
fault_around_bytes = rounddown_pow_of_two(val);
else
fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is 
undefined */
return 0;
}



Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
On Thu 15-11-18 13:10:34, Baoquan He wrote:
> On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > > I have seen an issue when the migration cannot make a forward progress
> > > > because of a glibc page with a reference count bumping up and down. Most
> > > > probable explanation is the faultaround code. I am working on this and
> > > > will post a patch soon. In any case the migration should converge and if
> > > > it doesn't do then there is a bug lurking somewhere.
> > > > 
> > > > Failing on ENOMEM is a questionable thing. I haven't seen that happening
> > > > wildly but if it is a case then I wouldn't be opposed.
> > > 
> > > Applied your debugging patches, it helps a lot to printing message.
> > > 
> > > Below is the dmesg log about the migrating failure. It can't pass
> > > migrate_pages() and loop forever.
> > > 
> > > [  +0.083841] migrating pfn 10fff7d0 failed 
> > > [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
> > > mapping:888dff4bdda8 index:0x2
> > > [  +0.012689] xfs_address_space_operations [xfs] 
> > > [  +0.30] name:"stress" 
> > > [  +0.004556] flags: 0x5fc004(uptodate)
> > > [  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
> > > 888dff4bdda8
> > > [  +0.009488] raw: 0002  00cb00c8 
> > > 888e7353d000
> > > [  +0.007726] page->mem_cgroup:888e7353d000
> > > [  +0.084538] migrating pfn 10fff7d0 failed 
> > > [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
> > > mapping:888dff4bdda8 index:0x2
> > > [  +0.012798] xfs_address_space_operations [xfs] 
> > > [  +0.34] name:"stress" 
> > > [  +0.004524] flags: 0x5fc004(uptodate)
> > > [  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
> > > 888dff4bdda8
> > > [  +0.009359] raw: 0002  00cb00c8 
> > > 888e7353d000
> > > [  +0.007728] page->mem_cgroup:888e7353d000
> > 
> > I wouldn't be surprised if this was a similar/same issue I've been
> > chasing recently. Could you try to disable faultaround to see if that
> > helps. It seems that it helped in my particular case but I am still
> > waiting for the final good-to-go to post the patch as I do not own the
> > workload which triggered that issue.
> 
> Tried, still stuck in last block sometime. Usually after several times
> of hotplug/unplug. If stop stress program, the last block will be
> offlined immediately.

Is the pattern still the same? I mean failing over few pages with
reference count jumping up and down between attempts?

> [root@ ~]# cat /sys/kernel/debug/fault_around_bytes 
> 4096

Can you make it 0?

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Baoquan He
On 11/14/18 at 04:00pm, Michal Hocko wrote:
> On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > I have seen an issue when the migration cannot make a forward progress
> > > because of a glibc page with a reference count bumping up and down. Most
> > > probable explanation is the faultaround code. I am working on this and
> > > will post a patch soon. In any case the migration should converge and if
> > > it doesn't do then there is a bug lurking somewhere.
> > > 
> > > Failing on ENOMEM is a questionable thing. I haven't seen that happening
> > > wildly but if it is a case then I wouldn't be opposed.
> > 
> > Applied your debugging patches, it helps a lot to printing message.
> > 
> > Below is the dmesg log about the migrating failure. It can't pass
> > migrate_pages() and loop forever.
> > 
> > [  +0.083841] migrating pfn 10fff7d0 failed 
> > [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
> > mapping:888dff4bdda8 index:0x2
> > [  +0.012689] xfs_address_space_operations [xfs] 
> > [  +0.30] name:"stress" 
> > [  +0.004556] flags: 0x5fc004(uptodate)
> > [  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
> > 888dff4bdda8
> > [  +0.009488] raw: 0002  00cb00c8 
> > 888e7353d000
> > [  +0.007726] page->mem_cgroup:888e7353d000
> > [  +0.084538] migrating pfn 10fff7d0 failed 
> > [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
> > mapping:888dff4bdda8 index:0x2
> > [  +0.012798] xfs_address_space_operations [xfs] 
> > [  +0.34] name:"stress" 
> > [  +0.004524] flags: 0x5fc004(uptodate)
> > [  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
> > 888dff4bdda8
> > [  +0.009359] raw: 0002  00cb00c8 
> > 888e7353d000
> > [  +0.007728] page->mem_cgroup:888e7353d000
> 
> I wouldn't be surprised if this was a similar/same issue I've been
> chasing recently. Could you try to disable faultaround to see if that
> helps. It seems that it helped in my particular case but I am still
> waiting for the final good-to-go to post the patch as I do not own the
> workload which triggered that issue.

Tried, still stuck in last block sometime. Usually after several times
of hotplug/unplug. If stop stress program, the last block will be
offlined immediately.

[root@ ~]# cat /sys/kernel/debug/fault_around_bytes 
4096


Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 22:52:50, Baoquan He wrote:
> On 11/14/18 at 10:01am, Michal Hocko wrote:
> > I have seen an issue when the migration cannot make a forward progress
> > because of a glibc page with a reference count bumping up and down. Most
> > probable explanation is the faultaround code. I am working on this and
> > will post a patch soon. In any case the migration should converge and if
> > it doesn't do then there is a bug lurking somewhere.
> > 
> > Failing on ENOMEM is a questionable thing. I haven't seen that happening
> > wildly but if it is a case then I wouldn't be opposed.
> 
> Applied your debugging patches, it helps a lot to printing message.
> 
> Below is the dmesg log about the migrating failure. It can't pass
> migrate_pages() and loop forever.
> 
> [  +0.083841] migrating pfn 10fff7d0 failed 
> [  +0.05] page:ea043ffdf400 count:208 mapcount:201 
> mapping:888dff4bdda8 index:0x2
> [  +0.012689] xfs_address_space_operations [xfs] 
> [  +0.30] name:"stress" 
> [  +0.004556] flags: 0x5fc004(uptodate)
> [  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
> 888dff4bdda8
> [  +0.009488] raw: 0002  00cb00c8 
> 888e7353d000
> [  +0.007726] page->mem_cgroup:888e7353d000
> [  +0.084538] migrating pfn 10fff7d0 failed 
> [  +0.06] page:ea043ffdf400 count:210 mapcount:201 
> mapping:888dff4bdda8 index:0x2
> [  +0.012798] xfs_address_space_operations [xfs] 
> [  +0.34] name:"stress" 
> [  +0.004524] flags: 0x5fc004(uptodate)
> [  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
> 888dff4bdda8
> [  +0.009359] raw: 0002  00cb00c8 
> 888e7353d000
> [  +0.007728] page->mem_cgroup:888e7353d000

I wouldn't be surprised if this was a similar/same issue I've been
chasing recently. Could you try to disable faultaround to see if that
helps. It seems that it helped in my particular case but I am still
waiting for the final good-to-go to post the patch as I do not own the
workload which triggered that issue.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Baoquan He
On 11/14/18 at 10:01am, Michal Hocko wrote:
> I have seen an issue when the migration cannot make a forward progress
> because of a glibc page with a reference count bumping up and down. Most
> probable explanation is the faultaround code. I am working on this and
> will post a patch soon. In any case the migration should converge and if
> it doesn't do then there is a bug lurking somewhere.
> 
> Failing on ENOMEM is a questionable thing. I haven't seen that happening
> wildly but if it is a case then I wouldn't be opposed.

Applied your debugging patches, it helps a lot to printing message.

Below is the dmesg log about the migrating failure. It can't pass
migrate_pages() and loop forever.

[  +0.083841] migrating pfn 10fff7d0 failed 
[  +0.05] page:ea043ffdf400 count:208 mapcount:201 
mapping:888dff4bdda8 index:0x2
[  +0.012689] xfs_address_space_operations [xfs] 
[  +0.30] name:"stress" 
[  +0.004556] flags: 0x5fc004(uptodate)
[  +0.007339] raw: 005fc004 c90e3d80 c90e3d80 
888dff4bdda8
[  +0.009488] raw: 0002  00cb00c8 
888e7353d000
[  +0.007726] page->mem_cgroup:888e7353d000
[  +0.084538] migrating pfn 10fff7d0 failed 
[  +0.06] page:ea043ffdf400 count:210 mapcount:201 
mapping:888dff4bdda8 index:0x2
[  +0.012798] xfs_address_space_operations [xfs] 
[  +0.34] name:"stress" 
[  +0.004524] flags: 0x5fc004(uptodate)
[  +0.007068] raw: 005fc004 c90e3d80 c90e3d80 
888dff4bdda8
[  +0.009359] raw: 0002  00cb00c8 
888e7353d000
[  +0.007728] page->mem_cgroup:888e7353d000

~
This is numactl -H, the last memory block of node1 can't be migrated.

[root@~]# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 144 145 146 147 148 
149 150 151 152 153 154 155 156 157 158 159 160 161
node 0 size: 58793 MB
node 0 free: 50374 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 162 163 164 
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
node 1 size: 2048 MB
node 1 free: 0 MB
node 2 cpus: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 180 181 182 
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
node 2 size: 65536 MB
node 2 free: 60102 MB
node 3 cpus: 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 198 199 200 
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
node 3 size: 65536 MB
node 3 free: 61237 MB
node 4 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 216 217 218 
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
node 4 size: 65536 MB
node 4 free: 63057 MB
node 5 cpus: 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 234 
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251
node 5 size: 65536 MB
node 5 free: 62507 MB
node 6 cpus: 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 
124 125 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
node 6 size: 65536 MB
node 6 free: 62688 MB
node 7 cpus: 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 
142 143 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
node 7 size: 65536 MB
node 7 free: 61775 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  21  31  21  41  41  51  51 
  1:  21  10  21  31  41  41  51  51 
  2:  31  21  10  21  51  51  41  41 
  3:  21  31  21  10  51  51  41  41 
  4:  41  41  51  51  10  21  31  21 
  5:  41  41  51  51  21  10  21  31 
  6:  51  51  41  41  31  21  10  21 
  7:  51  51  41  41  21  31  21  10

> 
> > You mentioned memory pressure, if our host is under memory pressure we
> > can easily trigger running into an endless loop there, because we
> > basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
> > memory to be offlined. I assume this is the case here.
> > do_migrate_range() could be the bad boy if it keeps failing forever and
> > we keep retrying.
> 
> My hotplug debugging patches [1] should help to tell us.
> 
> [1] http://lkml.kernel.org/r/20181107101830.17405-1-mho...@kernel.org
> -- 
> Michal Hocko
> SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 10:48:09, David Hildenbrand wrote:
> On 14.11.18 10:41, Michal Hocko wrote:
> > On Wed 14-11-18 10:25:57, David Hildenbrand wrote:
> >> On 14.11.18 10:00, Baoquan He wrote:
> >>> Hi David,
> >>>
> >>> On 11/14/18 at 09:18am, David Hildenbrand wrote:
>  Code seems to be waiting for the mem_hotplug_lock in read.
>  We hold mem_hotplug_lock in write whenever we online/offline/add/remove
>  memory. There are two ways to trigger offlining of memory:
> 
>  1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"
> 
>  This always properly took the mem_hotplug_lock. Nothing changed
> 
>  2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"
> 
>  This didn't take the mem_hotplug_lock and I fixed that for this release.
> 
>  So if you were testing with 1., you should have seen the same error
>  before this release (unless there is something else now broken in this
>  release).
> >>>
> >>> Thanks a lot for looking into this.
> >>>
> >>> I triggered sysrq+t to check threads' state. You can see that we use
> >>> firmware to trigger ACPI event to go to acpi_bus_offline(), it truly
> >>> didn't take mem_hotplug_lock() and has taken it with your fix in
> >>> commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called 
> >>> w.o. mem_hotplug_lock
> >>>
> >>> [  +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> >>> [  +0.005398] Call Trace:
> >>> [  +0.002476]  ? page_vma_mapped_walk+0x307/0x710
> >>> [  +0.004538]  ? page_remove_rmap+0xa2/0x340
> >>> [  +0.004104]  ? ptep_clear_flush+0x54/0x60
> >>> [  +0.004027]  ? enqueue_entity+0x11c/0x620
> >>> [  +0.005904]  ? schedule+0x28/0x80
> >>> [  +0.003336]  ? rmap_walk_file+0xf9/0x270
> >>> [  +0.003940]  ? try_to_unmap+0x9c/0xf0
> >>> [  +0.003695]  ? migrate_pages+0x2b0/0xb90
> >>> [  +0.003959]  ? try_offline_node+0x160/0x160
> >>> [  +0.004214]  ? __offline_pages+0x6ce/0x8e0
> >>> [  +0.004134]  ? memory_subsys_offline+0x40/0x60
> >>> [  +0.004474]  ? device_offline+0x81/0xb0
> >>> [  +0.003867]  ? acpi_bus_offline+0xdb/0x140
> >>> [  +0.004117]  ? acpi_device_hotplug+0x21c/0x460
> >>> [  +0.004458]  ? acpi_hotplug_work_fn+0x1a/0x30
> >>> [  +0.004372]  ? process_one_work+0x1a1/0x3a0
> >>> [  +0.004195]  ? worker_thread+0x30/0x380
> >>> [  +0.003851]  ? drain_workqueue+0x120/0x120
> >>> [  +0.004117]  ? kthread+0x112/0x130
> >>> [  +0.003411]  ? kthread_park+0x80/0x80
> >>> [  +0.005325]  ? ret_from_fork+0x35/0x40
> >>>
> >>
> >> Yes, this is indeed another code path that was fixed (and I didn't
> >> actually realize it ;) ). Thanks for the callchain. Before my fix
> >> hotplug still would have never succeeded (offline_pages would have
> >> silently looped forever) as far as I can tell.
> > 
> > I haven't studied your patch yet so I am not really sure why you have
> > added the lock into this path. The memory hotplug locking is certainly
> > far from great but I believe we should really rething the scope of the
> > lock. There shouldn't be any fundamental reason to use the global lock
> > for the full offlining. So rather than moving the lock from one place to
> > another we need a range locking I believe.
> See the patches for details, the lock was removed on this path by
> mistake not by design.

OK, so I guess we should plug that hole first I guess.

> Replacing the lock by some range lock can now be done. The tricky part
> will be get_online_mems(), we'll have to indicate a range somehow. For
> online/offline/add/remove, we have the range.

I would argue that get_online_mems() needs some rethinking. Callers
shouldn't really care that a node went offline. If they care about the
specific pfn range of the node to not go away then the range locking
should work just fine for them.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
[Cc Vladimir]

On Wed 14-11-18 15:09:09, Baoquan He wrote:
> Hi,
> 
> Tested memory hotplug on a bare metal system, hot removing always
> trigger a lock. Usually need hot plug/unplug several times, then the hot
> removing will hang there at the last block. Surely with memory pressure
> added by executing "stress -m 200".
> 
> Will attach the log partly. Any idea or suggestion, appreciated. 
> 
[...]
> [  +0.007169]   Not tainted 4.20.0-rc2+ #4
> [  +0.004630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  +0.008001] kworker/181:1   D0  1187  2 0x8000
> [  +0.005711] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
> [  +0.006467] Call Trace:
> [  +0.002591]  ? __schedule+0x24e/0x880
> [  +0.004995]  schedule+0x28/0x80
> [  +0.003380]  rwsem_down_read_failed+0x103/0x190
> [  +0.006528]  call_rwsem_down_read_failed+0x14/0x30
> [  +0.004937]  __percpu_down_read+0x4f/0x80
> [  +0.004204]  get_online_mems+0x2d/0x30
> [  +0.003871]  memcg_create_kmem_cache+0x1b/0x120
> [  +0.004740]  memcg_kmem_cache_create_func+0x1b/0x60
> [  +0.004986]  process_one_work+0x1a1/0x3a0
> [  +0.004255]  worker_thread+0x30/0x380
> [  +0.003764]  ? drain_workqueue+0x120/0x120
> [  +0.004238]  kthread+0x112/0x130
> [  +0.003320]  ? kthread_park+0x80/0x80
> [  +0.003796]  ret_from_fork+0x35/0x40

For a quick context. We do hold the exclusive mem hotplug lock
throughout the whole offlining and that can take quite some time.
So I am wondering whether we absolutely have to take the shared lock
in this path (introduced by 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}")). Is there any way to relax this
requirement? E.g. nodes stay around even when they are completely
offline. Does that help?
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread David Hildenbrand
On 14.11.18 10:41, Michal Hocko wrote:
> On Wed 14-11-18 10:25:57, David Hildenbrand wrote:
>> On 14.11.18 10:00, Baoquan He wrote:
>>> Hi David,
>>>
>>> On 11/14/18 at 09:18am, David Hildenbrand wrote:
 Code seems to be waiting for the mem_hotplug_lock in read.
 We hold mem_hotplug_lock in write whenever we online/offline/add/remove
 memory. There are two ways to trigger offlining of memory:

 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"

 This always properly took the mem_hotplug_lock. Nothing changed

 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"

 This didn't take the mem_hotplug_lock and I fixed that for this release.

 So if you were testing with 1., you should have seen the same error
 before this release (unless there is something else now broken in this
 release).
>>>
>>> Thanks a lot for looking into this.
>>>
>>> I triggered sysrq+t to check threads' state. You can see that we use
>>> firmware to trigger ACPI event to go to acpi_bus_offline(), it truly
>>> didn't take mem_hotplug_lock() and has taken it with your fix in
>>> commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. 
>>> mem_hotplug_lock
>>>
>>> [  +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>> [  +0.005398] Call Trace:
>>> [  +0.002476]  ? page_vma_mapped_walk+0x307/0x710
>>> [  +0.004538]  ? page_remove_rmap+0xa2/0x340
>>> [  +0.004104]  ? ptep_clear_flush+0x54/0x60
>>> [  +0.004027]  ? enqueue_entity+0x11c/0x620
>>> [  +0.005904]  ? schedule+0x28/0x80
>>> [  +0.003336]  ? rmap_walk_file+0xf9/0x270
>>> [  +0.003940]  ? try_to_unmap+0x9c/0xf0
>>> [  +0.003695]  ? migrate_pages+0x2b0/0xb90
>>> [  +0.003959]  ? try_offline_node+0x160/0x160
>>> [  +0.004214]  ? __offline_pages+0x6ce/0x8e0
>>> [  +0.004134]  ? memory_subsys_offline+0x40/0x60
>>> [  +0.004474]  ? device_offline+0x81/0xb0
>>> [  +0.003867]  ? acpi_bus_offline+0xdb/0x140
>>> [  +0.004117]  ? acpi_device_hotplug+0x21c/0x460
>>> [  +0.004458]  ? acpi_hotplug_work_fn+0x1a/0x30
>>> [  +0.004372]  ? process_one_work+0x1a1/0x3a0
>>> [  +0.004195]  ? worker_thread+0x30/0x380
>>> [  +0.003851]  ? drain_workqueue+0x120/0x120
>>> [  +0.004117]  ? kthread+0x112/0x130
>>> [  +0.003411]  ? kthread_park+0x80/0x80
>>> [  +0.005325]  ? ret_from_fork+0x35/0x40
>>>
>>
>> Yes, this is indeed another code path that was fixed (and I didn't
>> actually realize it ;) ). Thanks for the callchain. Before my fix
>> hotplug still would have never succeeded (offline_pages would have
>> silently looped forever) as far as I can tell.
> 
> I haven't studied your patch yet so I am not really sure why you have
> added the lock into this path. The memory hotplug locking is certainly
> far from great but I believe we should really rething the scope of the
> lock. There shouldn't be any fundamental reason to use the global lock
> for the full offlining. So rather than moving the lock from one place to
> another we need a range locking I believe.
See the patches for details, the lock was removed on this path by
mistake not by design.

Replacing the lock by some range lock can now be done. The tricky part
will be get_online_mems(), we'll have to indicate a range somehow. For
online/offline/add/remove, we have the range.

-- 

Thanks,

David / dhildenb


Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 10:25:57, David Hildenbrand wrote:
> On 14.11.18 10:00, Baoquan He wrote:
> > Hi David,
> > 
> > On 11/14/18 at 09:18am, David Hildenbrand wrote:
> >> Code seems to be waiting for the mem_hotplug_lock in read.
> >> We hold mem_hotplug_lock in write whenever we online/offline/add/remove
> >> memory. There are two ways to trigger offlining of memory:
> >>
> >> 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"
> >>
> >> This always properly took the mem_hotplug_lock. Nothing changed
> >>
> >> 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"
> >>
> >> This didn't take the mem_hotplug_lock and I fixed that for this release.
> >>
> >> So if you were testing with 1., you should have seen the same error
> >> before this release (unless there is something else now broken in this
> >> release).
> > 
> > Thanks a lot for looking into this.
> > 
> > I triggered sysrq+t to check threads' state. You can see that we use
> > firmware to trigger ACPI event to go to acpi_bus_offline(), it truly
> > didn't take mem_hotplug_lock() and has taken it with your fix in
> > commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. 
> > mem_hotplug_lock
> > 
> > [  +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> > [  +0.005398] Call Trace:
> > [  +0.002476]  ? page_vma_mapped_walk+0x307/0x710
> > [  +0.004538]  ? page_remove_rmap+0xa2/0x340
> > [  +0.004104]  ? ptep_clear_flush+0x54/0x60
> > [  +0.004027]  ? enqueue_entity+0x11c/0x620
> > [  +0.005904]  ? schedule+0x28/0x80
> > [  +0.003336]  ? rmap_walk_file+0xf9/0x270
> > [  +0.003940]  ? try_to_unmap+0x9c/0xf0
> > [  +0.003695]  ? migrate_pages+0x2b0/0xb90
> > [  +0.003959]  ? try_offline_node+0x160/0x160
> > [  +0.004214]  ? __offline_pages+0x6ce/0x8e0
> > [  +0.004134]  ? memory_subsys_offline+0x40/0x60
> > [  +0.004474]  ? device_offline+0x81/0xb0
> > [  +0.003867]  ? acpi_bus_offline+0xdb/0x140
> > [  +0.004117]  ? acpi_device_hotplug+0x21c/0x460
> > [  +0.004458]  ? acpi_hotplug_work_fn+0x1a/0x30
> > [  +0.004372]  ? process_one_work+0x1a1/0x3a0
> > [  +0.004195]  ? worker_thread+0x30/0x380
> > [  +0.003851]  ? drain_workqueue+0x120/0x120
> > [  +0.004117]  ? kthread+0x112/0x130
> > [  +0.003411]  ? kthread_park+0x80/0x80
> > [  +0.005325]  ? ret_from_fork+0x35/0x40
> > 
> 
> Yes, this is indeed another code path that was fixed (and I didn't
> actually realize it ;) ). Thanks for the callchain. Before my fix
> hotplug still would have never succeeded (offline_pages would have
> silently looped forever) as far as I can tell.

I haven't studied your patch yet so I am not really sure why you have
added the lock into this path. The memory hotplug locking is certainly
far from great but I believe we should really rething the scope of the
lock. There shouldn't be any fundamental reason to use the global lock
for the full offlining. So rather than moving the lock from one place to
another we need a range locking I believe.

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 10:22:31, David Hildenbrand wrote:
> >>
> >> The real question is, however, why offlining of the last block doesn't
> >> succeed. In __offline_pages() we basically have an endless loop (while
> >> holding the mem_hotplug_lock in write). Now I consider this piece of
> >> code very problematic (we should automatically fail after X
> >> attempts/after X seconds, we should not ignore -ENOMEM), and we've had
> >> other BUGs whereby we would run into an endless loop here (e.g. related
> >> to hugepages I guess).
> > 
> > We used to have number of retries previous and it was too fragile. If
> > you need a timeout then you can easily do that from userspace. Just do
> > timeout $TIME echo 0 > $MEM_PATH/online
> 
> I agree that number of retries is not a good measure.
> 
> But as far as I can see this happens from the kernel via an ACPI event.
> E.g. failing to offline a block after X seconds would still make sense.
> (if something takes 120seconds to offline 128MB/2G there is something
> very bad going on, we could set the default limit to e.g. 30seconds),
> however ...

I disagree. THis is pulling policy into the kernel and that just
generates problems. What might look like a reasonable timeout to some
workloads might be wrong for others.

> > I have seen an issue when the migration cannot make a forward progress
> > because of a glibc page with a reference count bumping up and down. Most
> > probable explanation is the faultaround code. I am working on this and
> > will post a patch soon. In any case the migration should converge and if
> > it doesn't do then there is a bug lurking somewhere.
> 
> ... I also agree that this should converge. And if we detect a serious
> issue that we can't handle/where we can't converge (e.g. -ENOMEM) we
> should abort.

As I've said ENOMEM can be considered a hard failure. We do not trigger
OOM killer when allocating migration target so we only rely on somebody
esle making a forward progress for us and that is suboptimal. Yet I
haven't seen this happening in hotplug scenarios so far. Making
hotremove while the memory is really under pressure is a bad idea in the
first place most of the time. It is quite likely that somebody else just
triggers the oom killer and the offlining part will eventually make a
forward progress.
> 
> > 
> > Failing on ENOMEM is a questionable thing. I haven't seen that happening
> > wildly but if it is a case then I wouldn't be opposed.
> > 
> >> You mentioned memory pressure, if our host is under memory pressure we
> >> can easily trigger running into an endless loop there, because we
> >> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
> >> memory to be offlined. I assume this is the case here.
> >> do_migrate_range() could be the bad boy if it keeps failing forever and
> >> we keep retrying.
> 
> I've seen quite some issues while playing with virtio-mem, but didn't
> have the time to look into the details. Still on my long list of things
> to look into.

Memory hotplug is really far away from being optimal and robust. This
has always been the case. Issues used to be workaround by retry limits
etc. If we ever want to make it more robust we have to bite a bullet and
actually chase all the issues that might be basically anywhere and fix
them. This is just a nature of a pony that memory hotplug is.
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread David Hildenbrand
>>> Failing on ENOMEM is a questionable thing. I haven't seen that happening
>>> wildly but if it is a case then I wouldn't be opposed.
>>>
 You mentioned memory pressure, if our host is under memory pressure we
 can easily trigger running into an endless loop there, because we
 basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
 memory to be offlined. I assume this is the case here.
 do_migrate_range() could be the bad boy if it keeps failing forever and
 we keep retrying.
>>
>> I've seen quite some issues while playing with virtio-mem, but didn't
>> have the time to look into the details. Still on my long list of things
>> to look into.
> 
> Memory hotplug is really far away from being optimal and robust. This
> has always been the case. Issues used to be workaround by retry limits
> etc. If we ever want to make it more robust we have to bite a bullet and
> actually chase all the issues that might be basically anywhere and fix
> them. This is just a nature of a pony that memory hotplug is.
> 

Yes I agree, no more workarounds.

-- 

Thanks,

David / dhildenb


Re: Memory hotplug softlock issue

2018-11-14 Thread David Hildenbrand
On 14.11.18 10:00, Baoquan He wrote:
> Hi David,
> 
> On 11/14/18 at 09:18am, David Hildenbrand wrote:
>> Code seems to be waiting for the mem_hotplug_lock in read.
>> We hold mem_hotplug_lock in write whenever we online/offline/add/remove
>> memory. There are two ways to trigger offlining of memory:
>>
>> 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"
>>
>> This always properly took the mem_hotplug_lock. Nothing changed
>>
>> 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"
>>
>> This didn't take the mem_hotplug_lock and I fixed that for this release.
>>
>> So if you were testing with 1., you should have seen the same error
>> before this release (unless there is something else now broken in this
>> release).
> 
> Thanks a lot for looking into this.
> 
> I triggered sysrq+t to check threads' state. You can see that we use
> firmware to trigger ACPI event to go to acpi_bus_offline(), it truly
> didn't take mem_hotplug_lock() and has taken it with your fix in
> commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. 
> mem_hotplug_lock
> 
> [  +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [  +0.005398] Call Trace:
> [  +0.002476]  ? page_vma_mapped_walk+0x307/0x710
> [  +0.004538]  ? page_remove_rmap+0xa2/0x340
> [  +0.004104]  ? ptep_clear_flush+0x54/0x60
> [  +0.004027]  ? enqueue_entity+0x11c/0x620
> [  +0.005904]  ? schedule+0x28/0x80
> [  +0.003336]  ? rmap_walk_file+0xf9/0x270
> [  +0.003940]  ? try_to_unmap+0x9c/0xf0
> [  +0.003695]  ? migrate_pages+0x2b0/0xb90
> [  +0.003959]  ? try_offline_node+0x160/0x160
> [  +0.004214]  ? __offline_pages+0x6ce/0x8e0
> [  +0.004134]  ? memory_subsys_offline+0x40/0x60
> [  +0.004474]  ? device_offline+0x81/0xb0
> [  +0.003867]  ? acpi_bus_offline+0xdb/0x140
> [  +0.004117]  ? acpi_device_hotplug+0x21c/0x460
> [  +0.004458]  ? acpi_hotplug_work_fn+0x1a/0x30
> [  +0.004372]  ? process_one_work+0x1a1/0x3a0
> [  +0.004195]  ? worker_thread+0x30/0x380
> [  +0.003851]  ? drain_workqueue+0x120/0x120
> [  +0.004117]  ? kthread+0x112/0x130
> [  +0.003411]  ? kthread_park+0x80/0x80
> [  +0.005325]  ? ret_from_fork+0x35/0x40
> 

Yes, this is indeed another code path that was fixed (and I didn't
actually realize it ;) ). Thanks for the callchain. Before my fix
hotplug still would have never succeeded (offline_pages would have
silently looped forever) as far as I can tell.

> 
>>
>>
>> The real question is, however, why offlining of the last block doesn't
>> succeed. In __offline_pages() we basically have an endless loop (while
>> holding the mem_hotplug_lock in write). Now I consider this piece of
>> code very problematic (we should automatically fail after X
>> attempts/after X seconds, we should not ignore -ENOMEM), and we've had
>> other BUGs whereby we would run into an endless loop here (e.g. related
>> to hugepages I guess).
> 
> Hmm, even though memory hotplug stalled there, there are still much
> memory. E.g in this system, it has 8 nodes and each node has 64 GB
> memory, it's 512 GB in all. Now I run "stress -m 200" to trigger 200
> processes to malloc then free 256 MB contiously, and it's eating 50 GB
> in all. In theory, it still has much memory for migrating to.

Maybe a NUMA issue? But I am just making wild guesses. Maybe it is not
-ENOMEM but just some other migration condition that is not properly
handled (see Michals reply).

> 
>>
>> You mentioned memory pressure, if our host is under memory pressure we
>> can easily trigger running into an endless loop there, because we
>> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
>> memory to be offlined. I assume this is the case here.
>> do_migrate_range() could be the bad boy if it keeps failing forever and
>> we keep retrying.
> 
> Not sure what other people think about this. If failed the memory removing
> when still much free memory left, I worry customer will complain.

Indeed, we have to look into this.

> 
> Yeah, it stoped at do_migrate_range() when try to migrate the last
> memory block. And each time it's the last memory block which can't be
> offlined and hang.

It would be interesting to see which error message we keep getting.

> 
> If any message or information needed, I can provide.
> 
> Thanks
> Baoquan
> 


-- 

Thanks,

David / dhildenb


Re: Memory hotplug softlock issue

2018-11-14 Thread David Hildenbrand
>>
>> The real question is, however, why offlining of the last block doesn't
>> succeed. In __offline_pages() we basically have an endless loop (while
>> holding the mem_hotplug_lock in write). Now I consider this piece of
>> code very problematic (we should automatically fail after X
>> attempts/after X seconds, we should not ignore -ENOMEM), and we've had
>> other BUGs whereby we would run into an endless loop here (e.g. related
>> to hugepages I guess).
> 
> We used to have number of retries previous and it was too fragile. If
> you need a timeout then you can easily do that from userspace. Just do
> timeout $TIME echo 0 > $MEM_PATH/online

I agree that number of retries is not a good measure.

But as far as I can see this happens from the kernel via an ACPI event.
E.g. failing to offline a block after X seconds would still make sense.
(if something takes 120seconds to offline 128MB/2G there is something
very bad going on, we could set the default limit to e.g. 30seconds),
however ...

> 
> I have seen an issue when the migration cannot make a forward progress
> because of a glibc page with a reference count bumping up and down. Most
> probable explanation is the faultaround code. I am working on this and
> will post a patch soon. In any case the migration should converge and if
> it doesn't do then there is a bug lurking somewhere.

... I also agree that this should converge. And if we detect a serious
issue that we can't handle/where we can't converge (e.g. -ENOMEM) we
should abort.

> 
> Failing on ENOMEM is a questionable thing. I haven't seen that happening
> wildly but if it is a case then I wouldn't be opposed.
> 
>> You mentioned memory pressure, if our host is under memory pressure we
>> can easily trigger running into an endless loop there, because we
>> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
>> memory to be offlined. I assume this is the case here.
>> do_migrate_range() could be the bad boy if it keeps failing forever and
>> we keep retrying.

I've seen quite some issues while playing with virtio-mem, but didn't
have the time to look into the details. Still on my long list of things
to look into.

> 
> My hotplug debugging patches [1] should help to tell us.
> 
> [1] http://lkml.kernel.org/r/20181107101830.17405-1-mho...@kernel.org
> 


-- 

Thanks,

David / dhildenb


Re: Memory hotplug softlock issue

2018-11-14 Thread Michal Hocko
On Wed 14-11-18 09:18:09, David Hildenbrand wrote:
> On 14.11.18 08:09, Baoquan He wrote:
> > Hi,
> > 
> > Tested memory hotplug on a bare metal system, hot removing always
> > trigger a lock. Usually need hot plug/unplug several times, then the hot
> > removing will hang there at the last block. Surely with memory pressure
> > added by executing "stress -m 200".
> > 
> > Will attach the log partly. Any idea or suggestion, appreciated. 
> > 
> > Thanks
> > Baoquan
> > 
> 
> Code seems to be waiting for the mem_hotplug_lock in read.
> We hold mem_hotplug_lock in write whenever we online/offline/add/remove
> memory. There are two ways to trigger offlining of memory:
> 
> 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"
> 
> This always properly took the mem_hotplug_lock. Nothing changed
> 
> 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"
> 
> This didn't take the mem_hotplug_lock and I fixed that for this release.

This discrepancy should go.

> So if you were testing with 1., you should have seen the same error
> before this release (unless there is something else now broken in this
> release).
> 
> 
> The real question is, however, why offlining of the last block doesn't
> succeed. In __offline_pages() we basically have an endless loop (while
> holding the mem_hotplug_lock in write). Now I consider this piece of
> code very problematic (we should automatically fail after X
> attempts/after X seconds, we should not ignore -ENOMEM), and we've had
> other BUGs whereby we would run into an endless loop here (e.g. related
> to hugepages I guess).

We used to have number of retries previous and it was too fragile. If
you need a timeout then you can easily do that from userspace. Just do
timeout $TIME echo 0 > $MEM_PATH/online

I have seen an issue when the migration cannot make a forward progress
because of a glibc page with a reference count bumping up and down. Most
probable explanation is the faultaround code. I am working on this and
will post a patch soon. In any case the migration should converge and if
it doesn't do then there is a bug lurking somewhere.

Failing on ENOMEM is a questionable thing. I haven't seen that happening
wildly but if it is a case then I wouldn't be opposed.

> You mentioned memory pressure, if our host is under memory pressure we
> can easily trigger running into an endless loop there, because we
> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
> memory to be offlined. I assume this is the case here.
> do_migrate_range() could be the bad boy if it keeps failing forever and
> we keep retrying.

My hotplug debugging patches [1] should help to tell us.

[1] http://lkml.kernel.org/r/20181107101830.17405-1-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-14 Thread Baoquan He
Hi David,

On 11/14/18 at 09:18am, David Hildenbrand wrote:
> Code seems to be waiting for the mem_hotplug_lock in read.
> We hold mem_hotplug_lock in write whenever we online/offline/add/remove
> memory. There are two ways to trigger offlining of memory:
> 
> 1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"
> 
> This always properly took the mem_hotplug_lock. Nothing changed
> 
> 2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"
> 
> This didn't take the mem_hotplug_lock and I fixed that for this release.
> 
> So if you were testing with 1., you should have seen the same error
> before this release (unless there is something else now broken in this
> release).

Thanks a lot for looking into this.

I triggered sysrq+t to check threads' state. You can see that we use
firmware to trigger ACPI event to go to acpi_bus_offline(), it truly
didn't take mem_hotplug_lock() and has taken it with your fix in
commit 381eab4a6ee mm/memory_hotplug: fix online/offline_pages called w.o. 
mem_hotplug_lock

[  +0.007062] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[  +0.005398] Call Trace:
[  +0.002476]  ? page_vma_mapped_walk+0x307/0x710
[  +0.004538]  ? page_remove_rmap+0xa2/0x340
[  +0.004104]  ? ptep_clear_flush+0x54/0x60
[  +0.004027]  ? enqueue_entity+0x11c/0x620
[  +0.005904]  ? schedule+0x28/0x80
[  +0.003336]  ? rmap_walk_file+0xf9/0x270
[  +0.003940]  ? try_to_unmap+0x9c/0xf0
[  +0.003695]  ? migrate_pages+0x2b0/0xb90
[  +0.003959]  ? try_offline_node+0x160/0x160
[  +0.004214]  ? __offline_pages+0x6ce/0x8e0
[  +0.004134]  ? memory_subsys_offline+0x40/0x60
[  +0.004474]  ? device_offline+0x81/0xb0
[  +0.003867]  ? acpi_bus_offline+0xdb/0x140
[  +0.004117]  ? acpi_device_hotplug+0x21c/0x460
[  +0.004458]  ? acpi_hotplug_work_fn+0x1a/0x30
[  +0.004372]  ? process_one_work+0x1a1/0x3a0
[  +0.004195]  ? worker_thread+0x30/0x380
[  +0.003851]  ? drain_workqueue+0x120/0x120
[  +0.004117]  ? kthread+0x112/0x130
[  +0.003411]  ? kthread_park+0x80/0x80
[  +0.005325]  ? ret_from_fork+0x35/0x40


> 
> 
> The real question is, however, why offlining of the last block doesn't
> succeed. In __offline_pages() we basically have an endless loop (while
> holding the mem_hotplug_lock in write). Now I consider this piece of
> code very problematic (we should automatically fail after X
> attempts/after X seconds, we should not ignore -ENOMEM), and we've had
> other BUGs whereby we would run into an endless loop here (e.g. related
> to hugepages I guess).

Hmm, even though memory hotplug stalled there, there are still much
memory. E.g in this system, it has 8 nodes and each node has 64 GB
memory, it's 512 GB in all. Now I run "stress -m 200" to trigger 200
processes to malloc then free 256 MB contiously, and it's eating 50 GB
in all. In theory, it still has much memory for migrating to.

> 
> You mentioned memory pressure, if our host is under memory pressure we
> can easily trigger running into an endless loop there, because we
> basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
> memory to be offlined. I assume this is the case here.
> do_migrate_range() could be the bad boy if it keeps failing forever and
> we keep retrying.

Not sure what other people think about this. If failed the memory removing
when still much free memory left, I worry customer will complain.

Yeah, it stoped at do_migrate_range() when try to migrate the last
memory block. And each time it's the last memory block which can't be
offlined and hang.

If any message or information needed, I can provide.

Thanks
Baoquan


Re: Memory hotplug softlock issue

2018-11-14 Thread David Hildenbrand
On 14.11.18 08:09, Baoquan He wrote:
> Hi,
> 
> Tested memory hotplug on a bare metal system, hot removing always
> trigger a lock. Usually need hot plug/unplug several times, then the hot
> removing will hang there at the last block. Surely with memory pressure
> added by executing "stress -m 200".
> 
> Will attach the log partly. Any idea or suggestion, appreciated. 
> 
> Thanks
> Baoquan
> 

Code seems to be waiting for the mem_hotplug_lock in read.
We hold mem_hotplug_lock in write whenever we online/offline/add/remove
memory. There are two ways to trigger offlining of memory:

1. Offlining via "cat offline > /sys/devices/system/memory/memory0/state"

This always properly took the mem_hotplug_lock. Nothing changed

2. Offlining via "cat 0 > /sys/devices/system/memory/memory0/online"

This didn't take the mem_hotplug_lock and I fixed that for this release.

So if you were testing with 1., you should have seen the same error
before this release (unless there is something else now broken in this
release).


The real question is, however, why offlining of the last block doesn't
succeed. In __offline_pages() we basically have an endless loop (while
holding the mem_hotplug_lock in write). Now I consider this piece of
code very problematic (we should automatically fail after X
attempts/after X seconds, we should not ignore -ENOMEM), and we've had
other BUGs whereby we would run into an endless loop here (e.g. related
to hugepages I guess).

You mentioned memory pressure, if our host is under memory pressure we
can easily trigger running into an endless loop there, because we
basically ignore -ENOMEM e.g. when we cannot get a page to migrate some
memory to be offlined. I assume this is the case here.
do_migrate_range() could be the bad boy if it keeps failing forever and
we keep retrying.

-- 

Thanks,

David / dhildenb