MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages are dirty. This affects scenarios like package/container updates or executing binaries immediately after writing them, etc.
The issue is that collapse_file() triggers async writeback and returns SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But MADV_COLLAPSE is synchronous and userspace expects immediate success or a clear retry signal. Reproduction: - Compile or copy 2MB-aligned executable to XFS/ext4 FS - Call MADV_COLLAPSE on .text section - First call fails with -EINVAL (text pages dirty from copy) - Second call succeeds (async writeback completed) Issue Report: https://lore.kernel.org/all/[email protected] Hi Andrew, This V5 series incorporates David's feedback for simplifying the retry logic. To apply on mm-new, please drop: - [email protected]: [PATCH V4 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE - [email protected]: [PATCH V2 0/5] mm/khugepaged: cleanups and scan limit fix (merge conflicts with this series; V3 with review fixes posting soon) Thank you :) Changelog: V5: - In patch 2/2, Simplify dirty writeback retry logic (David) V4: - https://lore.kernel.org/all/[email protected] - Rebase on mm-new - Fix spurious blank line (Lance) V3: - https://lore.kernel.org/all/[email protected] - Reordered patches: Enum definition comes first as the retry logic depends on it - Renamed SCAN_PAGE_NOT_CLEAN to SCAN_PAGE_DIRTY_OR_WRITEBACK (Dev, Lance, David) - Changed writeback logic: Only trigger synchronous writeback and retry if the initial collapse attempt failed specifically due to dirty/writeback pages, rather than blindly flushing all file-backed VMAs (David) - Added proper file reference counting (get_file/fput) around the unlock window to prevent UAF (Lance) V2: - https://lore.kernel.org/all/[email protected] - Move writeback to madvise_collapse() (better abstraction, proper mmap_lock handling and does VMA revalidation after I/O) (Lorenzo) - Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use for all dirty/writeback folio cases that previously returned incorrect results (Dev) V1: https://lore.kernel.org/all/[email protected] Thanks, Shivank Garg (2): mm/khugepaged: map dirty/writeback pages failures to EAGAIN mm/khugepaged: retry with sync writeback for MADV_COLLAPSE include/trace/events/huge_memory.h | 3 ++- mm/khugepaged.c | 23 ++++++++++++++++++++--- 2 files changed, 22 insertions(+), 4 deletions(-) -- 2.43.0
