This fixes a critical mmap_sem lock leak that leads to deadlock when a
page fault with FAULT_FLAG_RETRY_NOWAIT encounters a fatal signal.

The Bug Scenario:
=================

When get_user_pages() is called with FOLL_NOWAIT flag (e.g., from
direct I/O or certain network operations), it sets both
FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT.

__do_page_fault()
 handle_mm_fault()
  __handle_mm_fault()
   handle_pte_fault()
    do_swap_page()
     lock_page_or_retry()
      __lock_page_or_retry()

The page fault path calls __lock_page_or_retry() which has special
handling for FAULT_FLAG_RETRY_NOWAIT (mm/filemap.c:1025-1026):

    if (flags & FAULT_FLAG_RETRY_NOWAIT)
        return 0;  /* Lock NOT released! */

This returns 0 (page not locked) but mmap_sem is STILL HELD, as
documented by the explicit "CAUTION" comment at line 1022.

The fault handler then sets VM_FAULT_RETRY and returns. When
__do_page_fault() sees VM_FAULT_RETRY, there are 3 cases possible:

   0. FAULT_FLAG_ALLOW_RETRY flag set, no fatal signal => goto retry and
      try to acquire mmap_sem => self deadlock in case someone tries to
      grab the mmap_sem for write

 1/2. FAULT_FLAG_ALLOW_RETRY flag set with a fatal signal pending, it
      incorrectly assumes the lock was already released (line 1318
      comment: "the mmap_sem has already been released") and returns to
      userspace or calls no_context() without releasing the lock.

      Result: Process continues with mmap_sem held, leading to deadlock
      on the next page fault or mmap_sem operation.

The Correct Pattern:
====================

The proper handling is documented in fs/userfaultfd.c:331-333:

    "If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
     before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT
     is not set."

And implemented at lines 444-483:

    ret = VM_FAULT_RETRY;
    if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
        goto out;  /* Return WITHOUT releasing lock */

    /* ... */
    up_read(&mm->mmap_sem);  /* Release only in non-NOWAIT case */

The Fix:
========

Before going to retry, returning to userspace or calling no_context(),
explicitly check if FAULT_FLAG_RETRY_NOWAIT was set. If yes, release
mmap_sem that was left held by __lock_page_or_retry().

The release is needed in all 3 paths:
 - retry if no fatal signals: Re-acquire mmap_sem right after goto
 - Userspace: Cannot return with kernel lock held
 - no_context(): May return via fixup_exception() for kernel faults
   (e.g., copy_from_user), and cannot continue with lock held

This matches the semantic contract documented in userfaultfd.c.

Note:
=====
If we went to the retry, cleared FAULT_FLAG_ALLOW_RETRY but still have
FAULT_FLAG_RETRY_NOWAIT, VM_FAULT_RETRY could be returned by
__lock_page_or_retry again, but this time mmap_sem is always dropped.

And userfaultfd even BUGs in case
!FAULT_FLAG_ALLOW_RETRY && FAULT_FLAG_RETRY_NOWAIT.

Observed Symptoms:
==================

Process deadlocks with stack trace showing:
 - Owns mmap_sem
 - Page fault from userspace (no syscall context)
 - Blocked in rwsem_down_read_failed trying to re-acquire mmap_sem

https://virtuozzo.atlassian.net/browse/PSBM-161478

Signed-off-by: Konstantin Khorenko <[email protected]>
---
 arch/x86/mm/fault.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ebbedfd975730..565b11c6b3e0b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1322,6 +1322,18 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
        if (unlikely(fault & VM_FAULT_RETRY)) {
                /* Retry at most once */
                if (flags & FAULT_FLAG_ALLOW_RETRY) {
+                       /*
+                        * FAULT_FLAG_RETRY_NOWAIT means the lock was NOT
+                        * released by __lock_page_or_retry().
+                        * We must release it here before in all 3 cases:
+                        *  0. before going to retry if no fatal signal -
+                        *     it acquires the mmap_sem
+                        *  1. before returning to userspace
+                        *  2. before calling no_context()
+                        */
+                       if (flags & FAULT_FLAG_RETRY_NOWAIT)
+                               up_read(&mm->mmap_sem);
+
                        flags &= ~FAULT_FLAG_ALLOW_RETRY;
                        flags |= FAULT_FLAG_TRIED;
                        if (!fatal_signal_pending(tsk))
-- 
2.43.0

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to