Re: [RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
On 11/13/2018 05:46 AM, Naoya Horiguchi wrote: > Hi Anshuman, > > On Fri, Nov 09, 2018 at 03:50:41PM +0530, Anshuman Khandual wrote: >> >> On 11/09/2018 12:17 PM, Naoya Horiguchi wrote: >>> set_hwpoison_free_buddy_page() could fail, then the target page is >>> finally not isolated, so it's better to report -EBUSY for userspace >>> to know the failure and chance of retry. >>> >> IIUC set_hwpoison_free_buddy_page() could only fail if the page is not >> free in the buddy. At least for soft_offline_huge_page() that wont be >> the case otherwise dissolve_free_huge_page() would have returned non >> zero -EBUSY. Is there any other reason set_hwpoison_free_buddy_page() >> would not succeed ? > There is a race window between page freeing (after successful soft-offline > -> page migration case) and the containment by set_hwpoison_free_buddy_page(). > Or a target page can be allocated just after get_any_page() decided that > the target page is a free page. > So set_hwpoison_free_buddy_page() would safely fail in such cases. Makes sense. Thanks.
Re: [RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
On 11/13/2018 05:46 AM, Naoya Horiguchi wrote: > Hi Anshuman, > > On Fri, Nov 09, 2018 at 03:50:41PM +0530, Anshuman Khandual wrote: >> >> On 11/09/2018 12:17 PM, Naoya Horiguchi wrote: >>> set_hwpoison_free_buddy_page() could fail, then the target page is >>> finally not isolated, so it's better to report -EBUSY for userspace >>> to know the failure and chance of retry. >>> >> IIUC set_hwpoison_free_buddy_page() could only fail if the page is not >> free in the buddy. At least for soft_offline_huge_page() that wont be >> the case otherwise dissolve_free_huge_page() would have returned non >> zero -EBUSY. Is there any other reason set_hwpoison_free_buddy_page() >> would not succeed ? > There is a race window between page freeing (after successful soft-offline > -> page migration case) and the containment by set_hwpoison_free_buddy_page(). > Or a target page can be allocated just after get_any_page() decided that > the target page is a free page. > So set_hwpoison_free_buddy_page() would safely fail in such cases. Makes sense. Thanks.
Re: [RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
Hi Anshuman, On Fri, Nov 09, 2018 at 03:50:41PM +0530, Anshuman Khandual wrote: > > > On 11/09/2018 12:17 PM, Naoya Horiguchi wrote: > > set_hwpoison_free_buddy_page() could fail, then the target page is > > finally not isolated, so it's better to report -EBUSY for userspace > > to know the failure and chance of retry. > > > > IIUC set_hwpoison_free_buddy_page() could only fail if the page is not > free in the buddy. At least for soft_offline_huge_page() that wont be > the case otherwise dissolve_free_huge_page() would have returned non > zero -EBUSY. Is there any other reason set_hwpoison_free_buddy_page() > would not succeed ? There is a race window between page freeing (after successful soft-offline -> page migration case) and the containment by set_hwpoison_free_buddy_page(). Or a target page can be allocated just after get_any_page() decided that the target page is a free page. So set_hwpoison_free_buddy_page() would safely fail in such cases. Thanks, Naoya Horiguchi > > > And for consistency, this patch moves set_hwpoison_free_buddy_page() > > in unmap_and_move() to __soft_offline_page(). > > Yeah this check should be handled in soft offline functions not inside > migrations they trigger. >
Re: [RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
Hi Anshuman, On Fri, Nov 09, 2018 at 03:50:41PM +0530, Anshuman Khandual wrote: > > > On 11/09/2018 12:17 PM, Naoya Horiguchi wrote: > > set_hwpoison_free_buddy_page() could fail, then the target page is > > finally not isolated, so it's better to report -EBUSY for userspace > > to know the failure and chance of retry. > > > > IIUC set_hwpoison_free_buddy_page() could only fail if the page is not > free in the buddy. At least for soft_offline_huge_page() that wont be > the case otherwise dissolve_free_huge_page() would have returned non > zero -EBUSY. Is there any other reason set_hwpoison_free_buddy_page() > would not succeed ? There is a race window between page freeing (after successful soft-offline -> page migration case) and the containment by set_hwpoison_free_buddy_page(). Or a target page can be allocated just after get_any_page() decided that the target page is a free page. So set_hwpoison_free_buddy_page() would safely fail in such cases. Thanks, Naoya Horiguchi > > > And for consistency, this patch moves set_hwpoison_free_buddy_page() > > in unmap_and_move() to __soft_offline_page(). > > Yeah this check should be handled in soft offline functions not inside > migrations they trigger. >
Re: [RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
On 11/09/2018 12:17 PM, Naoya Horiguchi wrote: > set_hwpoison_free_buddy_page() could fail, then the target page is > finally not isolated, so it's better to report -EBUSY for userspace > to know the failure and chance of retry. > IIUC set_hwpoison_free_buddy_page() could only fail if the page is not free in the buddy. At least for soft_offline_huge_page() that wont be the case otherwise dissolve_free_huge_page() would have returned non zero -EBUSY. Is there any other reason set_hwpoison_free_buddy_page() would not succeed ? > And for consistency, this patch moves set_hwpoison_free_buddy_page() > in unmap_and_move() to __soft_offline_page(). Yeah this check should be handled in soft offline functions not inside migrations they trigger.
Re: [RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
On 11/09/2018 12:17 PM, Naoya Horiguchi wrote: > set_hwpoison_free_buddy_page() could fail, then the target page is > finally not isolated, so it's better to report -EBUSY for userspace > to know the failure and chance of retry. > IIUC set_hwpoison_free_buddy_page() could only fail if the page is not free in the buddy. At least for soft_offline_huge_page() that wont be the case otherwise dissolve_free_huge_page() would have returned non zero -EBUSY. Is there any other reason set_hwpoison_free_buddy_page() would not succeed ? > And for consistency, this patch moves set_hwpoison_free_buddy_page() > in unmap_and_move() to __soft_offline_page(). Yeah this check should be handled in soft offline functions not inside migrations they trigger.
[RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
set_hwpoison_free_buddy_page() could fail, then the target page is finally not isolated, so it's better to report -EBUSY for userspace to know the failure and chance of retry. And for consistency, this patch moves set_hwpoison_free_buddy_page() in unmap_and_move() to __soft_offline_page(). Fixes: 6bc9b56433b7 ("mm: fix race on soft-offlining free huge pages") Signed-off-by: Naoya Horiguchi --- mm/memory-failure.c | 15 --- mm/migrate.c| 9 - 2 files changed, 12 insertions(+), 12 deletions(-) diff --git v4.19-mmotm-2018-10-30-16-08/mm/memory-failure.c v4.19-mmotm-2018-10-30-16-08_patched/mm/memory-failure.c index 9f09bf3..11e283e 100644 --- v4.19-mmotm-2018-10-30-16-08/mm/memory-failure.c +++ v4.19-mmotm-2018-10-30-16-08_patched/mm/memory-failure.c @@ -1719,14 +1719,18 @@ static int soft_offline_huge_page(struct page *page, int flags) /* * We set PG_hwpoison only when the migration source hugepage * was successfully dissolved, because otherwise hwpoisoned -* hugepage remains on free hugepage list, then userspace will -* find it as SIGBUS by allocation failure. That's not expected -* in soft-offlining. +* hugepage remains on free hugepage list. The allocator ignores +* such a hwpoisoned page so it's never allocated, but it could +* kill a process because of no-memory rather than hwpoison. +* Soft-offline never impacts the userspace, so this is +* undesired. */ ret = dissolve_free_huge_page(page); if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); + else + ret = -EBUSY; } } return ret; @@ -1804,6 +1808,11 @@ static int __soft_offline_page(struct page *page, int flags) pfn, ret, page->flags, >flags); if (ret > 0) ret = -EIO; + } else { + if (set_hwpoison_free_buddy_page(page)) + num_poisoned_pages_inc(); + else + ret = -EBUSY; } } else { pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n", diff --git v4.19-mmotm-2018-10-30-16-08/mm/migrate.c v4.19-mmotm-2018-10-30-16-08_patched/mm/migrate.c index f7e4bfd..1742372 100644 --- v4.19-mmotm-2018-10-30-16-08/mm/migrate.c +++ v4.19-mmotm-2018-10-30-16-08_patched/mm/migrate.c @@ -1199,15 +1199,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page, */ if (rc == MIGRATEPAGE_SUCCESS) { put_page(page); - if (reason == MR_MEMORY_FAILURE) { - /* -* Set PG_HWPoison on just freed page -* intentionally. Although it's rather weird, -* it's how HWPoison flag works at the moment. -*/ - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - } } else { if (rc != -EAGAIN) { if (likely(!__PageMovable(page))) { -- 2.7.0
[RFC][PATCH v1 02/11] mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page()
set_hwpoison_free_buddy_page() could fail, then the target page is finally not isolated, so it's better to report -EBUSY for userspace to know the failure and chance of retry. And for consistency, this patch moves set_hwpoison_free_buddy_page() in unmap_and_move() to __soft_offline_page(). Fixes: 6bc9b56433b7 ("mm: fix race on soft-offlining free huge pages") Signed-off-by: Naoya Horiguchi --- mm/memory-failure.c | 15 --- mm/migrate.c| 9 - 2 files changed, 12 insertions(+), 12 deletions(-) diff --git v4.19-mmotm-2018-10-30-16-08/mm/memory-failure.c v4.19-mmotm-2018-10-30-16-08_patched/mm/memory-failure.c index 9f09bf3..11e283e 100644 --- v4.19-mmotm-2018-10-30-16-08/mm/memory-failure.c +++ v4.19-mmotm-2018-10-30-16-08_patched/mm/memory-failure.c @@ -1719,14 +1719,18 @@ static int soft_offline_huge_page(struct page *page, int flags) /* * We set PG_hwpoison only when the migration source hugepage * was successfully dissolved, because otherwise hwpoisoned -* hugepage remains on free hugepage list, then userspace will -* find it as SIGBUS by allocation failure. That's not expected -* in soft-offlining. +* hugepage remains on free hugepage list. The allocator ignores +* such a hwpoisoned page so it's never allocated, but it could +* kill a process because of no-memory rather than hwpoison. +* Soft-offline never impacts the userspace, so this is +* undesired. */ ret = dissolve_free_huge_page(page); if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); + else + ret = -EBUSY; } } return ret; @@ -1804,6 +1808,11 @@ static int __soft_offline_page(struct page *page, int flags) pfn, ret, page->flags, >flags); if (ret > 0) ret = -EIO; + } else { + if (set_hwpoison_free_buddy_page(page)) + num_poisoned_pages_inc(); + else + ret = -EBUSY; } } else { pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n", diff --git v4.19-mmotm-2018-10-30-16-08/mm/migrate.c v4.19-mmotm-2018-10-30-16-08_patched/mm/migrate.c index f7e4bfd..1742372 100644 --- v4.19-mmotm-2018-10-30-16-08/mm/migrate.c +++ v4.19-mmotm-2018-10-30-16-08_patched/mm/migrate.c @@ -1199,15 +1199,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page, */ if (rc == MIGRATEPAGE_SUCCESS) { put_page(page); - if (reason == MR_MEMORY_FAILURE) { - /* -* Set PG_HWPoison on just freed page -* intentionally. Although it's rather weird, -* it's how HWPoison flag works at the moment. -*/ - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - } } else { if (rc != -EAGAIN) { if (likely(!__PageMovable(page))) { -- 2.7.0