Add hmm_range_fault_unlockable(), a new HMM entry point that allows the mmap read lock to be dropped during page faults. This follows the int *locked pattern from get_user_pages_remote() in mm/gup.c: callers pass an int *locked variable indicating they can handle the lock being dropped.
When locked is non-NULL, hmm_vma_fault() adds FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_KILLABLE to the fault flags passed to handle_mm_fault(). If the fault handler drops the mmap lock (returning VM_FAULT_RETRY or VM_FAULT_COMPLETED), the function sets *locked = 0 and returns 0, signalling the caller to restart its walk with a fresh notifier sequence. Fatal signals are checked before returning, matching GUP behavior. The caller is responsible for re-acquiring the lock and restarting from the beginning, since previously collected PFNs may be stale after the lock was dropped. The existing hmm_range_fault() is refactored into a thin wrapper that calls hmm_range_fault_unlockable(range, NULL). Passing NULL means FAULT_FLAG_ALLOW_RETRY is never set, preserving existing behavior for all current callers with no functional change. Faulting hugetlb pages is not supported on the unlockable path: if a hugetlb page requires faulting, -EFAULT is returned. This is because walk_hugetlb_range() holds hugetlb_vma_lock_read across the callback and unconditionally unlocks on return; if the mmap lock is dropped inside the callback the VMA may be freed, making the walk framework's unlock a use-after-free. Hugetlb pages already present in page tables are handled normally. Documentation/mm/hmm.rst is updated with a new section describing the unlockable API, its usage pattern, and the hugetlb limitation. Signed-off-by: Stanislav Kinsburskii <[email protected]> --- Documentation/mm/hmm.rst | 89 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/hmm.h | 1 + mm/hmm.c | 91 +++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 172 insertions(+), 9 deletions(-) diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst index 7d61b7a8b65b7..13874b4dfd5f4 100644 --- a/Documentation/mm/hmm.rst +++ b/Documentation/mm/hmm.rst @@ -208,6 +208,95 @@ invalidate() callback. That lock must be held before calling mmu_interval_read_retry() to avoid any race with a concurrent CPU page table update. +Scalable lock-drop support (hmm_range_fault_unlockable) +======================================================= + +Some page fault handlers (e.g., userfaultfd) require the mmap lock to be +dropped during fault resolution. Drivers that need to support such mappings +can use:: + + int hmm_range_fault_unlockable(struct hmm_range *range, int *locked); + +This follows the same ``int *locked`` pattern used by ``get_user_pages_remote()`` +in ``mm/gup.c``. The caller sets ``*locked = 1`` and holds the mmap read lock +before calling. If the lock is dropped during the fault (VM_FAULT_RETRY or +VM_FAULT_COMPLETED), the function returns 0 with ``*locked = 0``, signalling +the caller to restart its walk with a fresh notifier sequence. The caller is +responsible for re-acquiring the lock and restarting from the beginning, since +previously collected PFNs may be stale. + +The usage pattern is:: + + int driver_populate_range_unlockable(...) + { + struct hmm_range range; + int locked; + ... + + range.notifier = &interval_sub; + range.start = ...; + range.end = ...; + range.hmm_pfns = ...; + + if (!mmget_not_zero(interval_sub->notifier.mm)) + return -EFAULT; + + again: + range.notifier_seq = mmu_interval_read_begin(&interval_sub); + locked = 1; + mmap_read_lock(mm); + ret = hmm_range_fault_unlockable(&range, &locked); + if (locked) + mmap_read_unlock(mm); + if (ret) { + if (ret == -EBUSY) + goto again; + return ret; + } + if (!locked) + goto again; + + take_lock(driver->update); + if (mmu_interval_read_retry(&ni, range.notifier_seq) { + release_lock(driver->update); + goto again; + } + + /* Use pfns array content to update device page table, + * under the update lock */ + + release_lock(driver->update); + return 0; + } + +Passing ``locked = NULL`` to ``hmm_range_fault_unlockable()`` is equivalent to +calling ``hmm_range_fault()`` — the lock will never be dropped. + +Note: hugetlb pages are not supported with the unlockable path. If a hugetlb +page requires faulting during an ``hmm_range_fault_unlockable()`` call, +``-EFAULT`` is returned. Hugetlb pages that are already present in page tables +are handled normally. + +This limitation exists because ``walk_hugetlb_range()`` in the page walk +framework holds ``hugetlb_vma_lock_read`` across the callback and unconditionally +unlocks on return. If the mmap lock is dropped inside the callback (via +VM_FAULT_RETRY), the VMA may be freed before the walk framework's unlock, +resulting in a use-after-free. Possible approaches to lift this limitation in +the future: + +1. Extend the walk framework to allow callbacks to signal that the hugetlb vma + lock was dropped (e.g., a flag in ``struct mm_walk`` that tells + ``walk_hugetlb_range()`` to skip the unlock). + +2. Bypass ``walk_page_range()`` for hugetlb pages in the unlockable path and + walk hugetlb page tables directly with custom lock management (similar to + how GUP handles hugetlb without the walk framework). + +3. Re-acquire the mmap lock before returning from the hugetlb callback (like + ``fixup_user_fault()``), ensuring the VMA remains valid for the walk + framework's unlock. This changes the "never re-take" contract and would + require callers to handle hugetlb differently. + Leverage default_flags and pfn_flags_mask ========================================= diff --git a/include/linux/hmm.h b/include/linux/hmm.h index db75ffc949a7a..46e581865c48a 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -123,6 +123,7 @@ struct hmm_range { * Please see Documentation/mm/hmm.rst for how to use the range API. */ int hmm_range_fault(struct hmm_range *range); +int hmm_range_fault_unlockable(struct hmm_range *range, int *locked); /* * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range diff --git a/mm/hmm.c b/mm/hmm.c index 5955f2f0c83db..9bf2fa37f2efd 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -33,6 +33,7 @@ struct hmm_vma_walk { struct hmm_range *range; unsigned long last; + int *locked; }; enum { @@ -86,10 +87,28 @@ static int hmm_vma_fault(unsigned long addr, unsigned long end, fault_flags |= FAULT_FLAG_WRITE; } - for (; addr < end; addr += PAGE_SIZE) - if (handle_mm_fault(vma, addr, fault_flags, NULL) & - VM_FAULT_ERROR) + if (hmm_vma_walk->locked) + fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; + + for (; addr < end; addr += PAGE_SIZE) { + vm_fault_t ret; + + ret = handle_mm_fault(vma, addr, fault_flags, NULL); + + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) { + /* + * The mmap lock has been dropped by the fault handler. + * Record the failing address and signal lock-drop to + * the caller. + */ + *hmm_vma_walk->locked = 0; + hmm_vma_walk->last = addr; + return -EAGAIN; + } + + if (ret & VM_FAULT_ERROR) return -EFAULT; + } return -EBUSY; } @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, if (required_fault) { int ret; + /* + * Faulting hugetlb pages on the unlockable path is not + * supported. The walk framework holds hugetlb_vma_lock_read + * which must be dropped before handle_mm_fault, but if the + * mmap lock is also dropped (VM_FAULT_RETRY), the vma may + * be freed and the walk framework's unconditional unlock + * becomes a use-after-free. + */ + if (hmm_vma_walk->locked) + return -EFAULT; + spin_unlock(ptl); hugetlb_vma_unlock_read(vma); /* @@ -655,14 +685,49 @@ static const struct mm_walk_ops hmm_walk_ops = { * * This is similar to get_user_pages(), except that it can read the page tables * without mutating them (ie causing faults). + * + * The mmap lock must be held by the caller and will remain held on return. + * For a variant that allows the mmap lock to be dropped during faults (e.g., + * for userfaultfd support), see hmm_range_fault_unlockable(). */ int hmm_range_fault(struct hmm_range *range) { + return hmm_range_fault_unlockable(range, NULL); +} +EXPORT_SYMBOL(hmm_range_fault); + +/** + * hmm_range_fault_unlockable - fault a range with mmap lock-drop support + * @range: argument structure + * @locked: pointer to lock state variable (input: 1; output: 0 if lock + * was dropped) + * + * Similar to hmm_range_fault() but allows the mmap lock to be dropped during + * page faults. This enables support for userfaultfd-backed mappings and other + * cases where handle_mm_fault() may need to release the mmap lock. + * + * The caller must hold the mmap read lock and set *locked = 1 before calling. + * On return: + * - *locked == 1: mmap lock is still held, return value has normal semantics + * - *locked == 0: mmap lock was dropped. The caller must re-acquire the lock + * and restart the operation. Return value is -EBUSY in this case. + * + * When the lock is dropped internally, this function will attempt to + * re-acquire it and retry the fault with FAULT_FLAG_TRIED set. If the retry + * also results in lock-drop (possible but unusual), or if a fatal signal is + * pending, the function returns with *locked == 0. + * + * Returns 0 on success or a negative error code. See hmm_range_fault() for + * the full list of possible errors. + */ +int hmm_range_fault_unlockable(struct hmm_range *range, int *locked) +{ + struct mm_struct *mm = range->notifier->mm; struct hmm_vma_walk hmm_vma_walk = { .range = range, .last = range->start, + .locked = locked, }; - struct mm_struct *mm = range->notifier->mm; int ret; mmap_assert_locked(mm); @@ -674,16 +739,24 @@ int hmm_range_fault(struct hmm_range *range) return -EBUSY; ret = walk_page_range(mm, hmm_vma_walk.last, range->end, &hmm_walk_ops, &hmm_vma_walk); + if (ret == -EAGAIN) { + /* + * The mmap lock was dropped during the fault + * (e.g. userfaultfd). Signal the caller to restart + * by returning with *locked = 0. + */ + if (fatal_signal_pending(current)) + return -EINTR; + return 0; + } /* - * When -EBUSY is returned the loop restarts with - * hmm_vma_walk.last set to an address that has not been stored - * in pfns. All entries < last in the pfn array are set to their - * output, and all >= are still at their input values. + * -EBUSY: page table changed during the walk. + * Restart from hmm_vma_walk.last. */ } while (ret == -EBUSY); return ret; } -EXPORT_SYMBOL(hmm_range_fault); +EXPORT_SYMBOL(hmm_range_fault_unlockable); /** * hmm_dma_map_alloc - Allocate HMM map structure

