Hi Oscar, Naoya, On Mon, Sep 14, 2020 at 12:15:54PM +0200, Oscar Salvador wrote: > The important bit of this patchset is patch#1, which is a fix to take off > HWPoison pages off a buddy freelist since it can lead us to having HWPoison > pages back in the game without no one noticing it. > So fix it (we did that already for soft_offline_page [1]). > > The other patches are clean-ups and not that important, so if anything, > consider patch#1 for inclusion. > > [1] https://patchwork.kernel.org/cover/11704083/
I found something strange with your and Naoya's hwpoison rework. We have a customer with a testcase that basically does: p1 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); p2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); madvise(p1, size, MADV_MERGEABLE); madvise(p2, size, MADV_MERGEABLE); memset(p1, 'a', size); memset(p2, 'a', size); madvise(p1, size, MADV_SOFT_OFFLINE); madvise(p1, size, MADV_UNMERGEABLE); madvise(p2, size, MADV_UNMERGEABLE); where size is about 200,000 pages. It works on a x86_64 box (with and without the hwpoison rework). On ppc64 boxes (tested 3 different ones with at least 250GB memory) it fails to take a page off the buddy list (page_handle_poison()/take_page_off_buddy()) (madvise MADV_SOFT_OFFLINE returns -EBUSY). Without the hwpoison rework the test passes. Possibly related is that ppc64 takes a long time to run this test and according perf, it spends most of the time clearing pages: 17.15% ksm_poison [kernel.kallsyms] [k] copypage_power7 13.39% ksm_poison [kernel.kallsyms] [k] clear_user_page 8.70% ksm_poison libc-2.28.so [.] __memset_power8 8.63% ksm_poison [kernel.kallsyms] [k] opal_return 6.04% ksm_poison [kernel.kallsyms] [k] __opal_call 2.67% ksm_poison [kernel.kallsyms] [k] opal_call 1.52% ksm_poison [kernel.kallsyms] [k] _raw_spin_lock 1.45% ksm_poison [kernel.kallsyms] [k] opal_flush_console 1.43% ksm_poison [unknown] [k] 0x0000000030005138 1.43% ksm_poison [kernel.kallsyms] [k] opal_console_write_buffer_space 1.26% ksm_poison [kernel.kallsyms] [k] hvc_console_print (...) I've run these tests using mmotm and mmotm with this patchset on top. Do you know what might be happening here? -- Aristeu