On Thu, Dec 11, 2025 at 05:37:26PM +0000, Michael Kelley wrote: > From: Stanislav Kinsburskii <[email protected]> Sent: > Thursday, December 4, 2025 1:09 PM
<snip> > I've been playing around with mmu notifiers and 2 Meg pages. At least in my > experiment, there's a case where the .invalidate callback is invoked on a > range *before* the 2 Meg page is split. The kernel code that does this is > in zap_page_range_single_batched(). Early on this function calls > mmu_notifier_invalidate_range_start(), which invokes the .invalidate > callback on the initial range. Later on, unmap_single_vma() is called, which > does the split and eventually makes a second .invalidate callback for the > entire 2 Meg page. > > Details: My experiment is a user space program that does the following: > > 1. Allocates 16 Megs of memory on a 16 Meg boundary using > posix_memalign(). So this is private anonymous memory. Transparent > huge pages are enabled. > > 2. Writes to a byte in each 4K page so they are all populated. > /proc/meminfo shows eight 2 Meg pages have been allocated. > > 3. Creates an mmu notifier for the allocated 16 Megs, using an ioctl > hacked into the kernel for experimentation purposes. > > 4. Uses madvise() with the DONTNEED option to free 32 Kbytes on a 4K > page boundary somewhere in the 16 Meg allocation. This results in an mmu > notifier invalidate callback for that 32 Kbytes. Then there's a second > invalidate > callback covering the entire 2 Meg page that contains the 32 Kbyte range. > Kernel stack traces for the two invalidate callbacks show them originating > in zap_page_range_single_batched(). > > 5. Sleeps for 60 seconds. During that time, khugepaged wakes up and does > hpage_collapse_scan_pmd() -> collapse_huge_page(), which generates a third > .invalidate callback for the 2 Meg page. I'm haven't investigated what this is > all about. > > 6. Interestingly, if Step 4 above does a slightly different operation using > mprotect() with PROT_READ instead of madvise(), the 2 Meg page is split first. > The .invalidate callback for the full 2 Meg happens before the .invalidate > callback for the specified range. > > The root partition probably isn't doing madvise() with DONTNEED for memory > allocated for guests. But regardless of what user space does or doesn't do, > MSHV's > invalidate callback path should be made safe for this case. Maybe that's just > detecting it and returning an error (and maybe a WARN_ON) if user space > doesn't need it to work. > > Michael > The issue is addressed by "mshv: Align huge page stride with guest mapping" patch. Thanks a lot once again for your help in identifying it, Stanislav
