On Tue, Feb 23, 2021 at 11:50:05AM +0100, Oscar Salvador wrote: > > CPU0: CPU1: > > set_compound_page_dtor(HUGETLB_PAGE_DTOR); > > memory_failure_hugetlb > > get_hwpoison_page > > __get_hwpoison_page > > get_page_unless_zero > > put_page_testzero() > > > > Maybe this can happen. But it is a very corner case. If we want to > > deal with this. We can put_page_testzero() first and then > > set_compound_page_dtor(HUGETLB_PAGE_DTOR). > > I have to check further, but it looks like this could actually happen. > Handling this with VM_BUG_ON is wrong, because memory_failure/soft_offline are > entitled to increase the refcount of the page. > > AFAICS, > > CPU0: CPU1: > > set_compound_page_dtor(HUGETLB_PAGE_DTOR); > memory_failure_hugetlb > get_hwpoison_page > __get_hwpoison_page > get_page_unless_zero > put_page_testzero() > identify_page_state > me_huge_page > > I think we can reach me_huge_page with either refcount = 1 or refcount =2, > depending whether put_page_testzero has been issued. > > For now, I would not re-enqueue the page if put_page_testzero == false. > I have to see how this can be handled gracefully.
I took a brief look. It is not really your patch fault. Hugetlb <-> memory-failure synchronization is a bit odd, it definitely needs improvment. The thing is, we can have different scenarios here. E.g: by the time we return from put_page_testzero, we might have refcount == 0 and PageHWPoison, or refcount == 1 PageHWPoison. The former will let a user get a page from the pool and get a sigbus when it faults in the page, and the latter will be even more odd as we will have a self-refcounted page in the free pool (and hwpoisoned). As I said, it is not this patchset fault. I just made me realize this problem. I have to think some more about this. -- Oscar Salvador SUSE L3