On 2026/1/12 19:33, Miaohe Lin wrote: > On 2026/1/12 17:40, David Hildenbrand (Red Hat) wrote: >> On 1/12/26 10:19, Miaohe Lin wrote: >>> On 2026/1/9 21:45, David Hildenbrand (Red Hat) wrote: >>>> On 1/7/26 10:37, Miaohe Lin wrote: >>>>> Introduce selftests to validate the functionality of memory failure. >>>>> These tests help ensure that memory failure handling for anonymous >>>>> pages, pagecaches pages works correctly, including proper SIGBUS >>>>> delivery to user processes, page isolation, and recovery paths. >>>>> >>>>> Currently madvise syscall is used to inject memory failures. And only >>>>> anonymous pages and pagecaches are tested. More test scenarios, e.g. >>>>> hugetlb, shmem, thp, will be added. Also more memory failure injecting >>>>> methods will be supported, e.g. APEI Error INJection, if required. >>>> >>> >>> Thanks for test and report. :) >>> >>>> 0day reports that these tests fail: >>>> >>>> # # ------------------------ >>>> # # running ./memory-failure >>>> # # ------------------------ >>>> # # TAP version 13 >>>> # # 1..6 >>>> # # # Starting 6 tests from 2 test cases. >>>> # # # RUN memory_failure.madv_hard.anon ... >>>> # # # OK memory_failure.madv_hard.anon >>>> # # ok 1 memory_failure.madv_hard.anon >>>> # # # RUN memory_failure.madv_hard.clean_pagecache ... >>>> # # # memory-failure.c:166:clean_pagecache:Expected setjmp (1) == 0 (0) >>>> # # # clean_pagecache: Test terminated by assertion >>>> # # # FAIL memory_failure.madv_hard.clean_pagecache >>>> # # not ok 2 memory_failure.madv_hard.clean_pagecache >>>> # # # RUN memory_failure.madv_hard.dirty_pagecache ... >>>> # # # memory-failure.c:207:dirty_pagecache:Expected >>>> unpoison_memory(self->pfn) (-16) == 0 (0) >>>> # # # dirty_pagecache: Test terminated by assertion >>>> # # # FAIL memory_failure.madv_hard.dirty_pagecache >>>> # # not ok 3 memory_failure.madv_hard.dirty_pagecache >>>> # # # RUN memory_failure.madv_soft.anon ... >>>> # # # OK memory_failure.madv_soft.anon >>>> # # ok 4 memory_failure.madv_soft.anon >>>> # # # RUN memory_failure.madv_soft.clean_pagecache ... >>>> # # # memory-failure.c:282:clean_pagecache:Expected variant->inject(self, >>>> addr) (-1) == 0 (0) >>>> # # # clean_pagecache: Test terminated by assertion >>>> # # # FAIL memory_failure.madv_soft.clean_pagecache >>>> # # not ok 5 memory_failure.madv_soft.clean_pagecache >>>> # # # RUN memory_failure.madv_soft.dirty_pagecache ... >>>> # # # memory-failure.c:319:dirty_pagecache:Expected variant->inject(self, >>>> addr) (-1) == 0 (0) >>>> # # # dirty_pagecache: Test terminated by assertion >>>> # # # FAIL memory_failure.madv_soft.dirty_pagecache >>>> # # not ok 6 memory_failure.madv_soft.dirty_pagecache >>>> # # # FAILED: 2 / 6 tests passed. >>>> # # # Totals: pass:2 fail:4 xfail:0 xpass:0 skip:0 error:0 >>>> # # [FAIL] >>>> # not ok 71 memory-failure # exit=1 >>>> >>>> >>>> Can the test maybe not deal with running in certain environments (config >>>> options etc)? >>> >>> To run the test, I think there should be: >>> 1.CONFIG_MEMORY_FAILURE and CONFIG_HWPOISON_INJECT should be enabled. >>> 2.Root privilege is required. >>> 3.For dirty/clean pagecache testcases, the test file >>> "./clean-page-cache-test-file" and >>> "./dirty-page-cache-test-file" are assumed to be created on non-memory >>> file systems >>> such as xfs, ext4, etc. >>> >>> Does your test environment break any of the above rules? >> >> It is 0day environment, so very likely yes. I suspect 1).
Hi David, After taking a more close look, I think CONFIG_MEMORY_FAILURE and CONFIG_HWPOISON_INJECT should have been enabled in 0day environment or testcase memory_failure.madv_hard.anon should fail. memory_failure.madv_hard.anon will inject memory failure and expects seeing a SIGBUG signal. >> >>> Am I expected to add some code to >>> guard against this? >> >> Yes, at least some. >> >> Checking for root privileges is not required. The tests are commonly run >> from non-memory file systems, but, in theory, could be run from nfs etc. >> >> If you require special file systems, take a look at gup_longterm.o where we >> test for some fileystsem types. And I think the cause of failures of testcases memory_failure.madv_hard.clean_pagecache and memory_failure.madv_hard.dirty_pagecache is they running on memory filesystems. The error pages are kept in page cache in that case while memory_failure.madv_hard.clean_pagecache expects to see the error page truncated. But I have no idea why memory_failure.madv_soft.dirty_pagecache and memory_failure.madv_soft.clean_pagecache return -1(-EPERM?) when try to inject memory error through madvise syscall. It could be really helpful if more information can be provided. Thanks! .

