Hi,

When analyzing a problem reported by one of our customers, I stumbbled upon an 
issue
that origins from the fact that poisoned pages end up in the buddy allocator.

Let me break down the stepts that lie to the problem:

1) We soft-offline a page
2) Page gets flagged as HWPoison and is being sent to the buddy allocator.
   This is done through set_hwpoison_free_buddy_page().
3) Kcompactd wakes up in order to perform some compaction.
4) compact_zone() will call migrate_pages()
5) migrate_pages() will try to get a new page from compaction_alloc() to 
migrate to
6) if cc->freelist is empty, compaction_alloc() will call 
isolate_free_pagesblock()
7) isolate_free_pagesblock only checks for PageBuddy() to assume that a page is 
OK
   to be used to migrate to. Since HWPoisoned page are also PageBuddy, we add
   the page to the list. (same problem exists in fast_isolate_freepages()).

The outcome of that is that we end up happily handing poisoned pages in 
compaction_alloc,
so if we ever got a fault on that page through *_fault, we will return 
VM_FAULT_HWPOISON,
and the process will be killed.

I first though that I could get away with it by checking PageHWPoison in
{fast_isolate_freepages/isolate_free_pagesblock}, but not really.
It might be that the page we are checking is an order > 0 page, so the first 
page
might not be poisoned, but the one the follows might be, and we end up in the
same situation.

After some more thought, I really came to the conclusion that HWPoison pages 
should not
really be in the buddy allocator, as this is only asking for problems.
In this case it is only compaction code, but it could be happening somewhere 
else,
and one would expect that the pages you got from the buddy allocator are 
__ready__ to use.

I __think__ that we thought we were safe to put HWPoison pages in the buddy 
allocator as we
perform healthy checks when getting a page from there, so we skip poisoned pages

Of course, this is not the end of the story, now that someone got a page, if he 
frees it,
there is a high chance that this page ends up in a pcplist (I saw that).
Unless we are on CONFIG_VM_DEBUG, we do not check for the health of pages got 
from pcplist,
as we do when getting a page from the buddy allocator.

I checked [1], and it seems that [2] was going towards fixing this kind of 
issue.

I think it is about time to revamp the whole thing.

@Naoya: I could give it a try if you are busy.

[1] 
https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horigu...@ah.jp.nec.com/
[2] 
https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horigu...@ah.jp.nec.com/

-- 
Oscar Salvador
SUSE L3

Reply via email to