From: Michal Hocko <mho...@suse.com>

David has noticed that the oom killer might kill additional tasks while
the existing victim hasn't terminated yet because the oom_reaper marks
the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
to 0. The race is as follows

oom_reap_task                           do_exit
                                          exit_mm
  __oom_reap_task_mm
                                            mmput
                                              __mmput
    mmget_not_zero # fails
                                                exit_mmap # frees memory
  set_bit(MMF_OOM_SKIP)

Currently we are try to reduce a risk of this race by taking oom_lock
and wait for out_of_memory sleep while holding the lock to give the
victim some time to exit. This is quite suboptimal approach because
there is no guarantee the victim (especially a large one) will manage
to unmap its address space and free enough memory to the particular oom
domain which needs a memory (e.g. a specific NUMA node).

Fix this problem by allowing __oom_reap_task_mm and __mmput path to
race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
to run in parallel with other unmappers (hence the mmap_sem for read).
The only tricky part is we have to exclude page tables tear down and all
operations which modify the address space in the __mmput path. exit_mmap
doesn't expect any other users so it doesn't use any locking. Nothing
really forbids us to use mmap_sem for write, though. In fact we are
already relying on this lock earlier in the __mmput path to synchronize
with ksm and khugepaged.

Take the exclusive mmap_sem when calling free_pgtables and destroying
vmas to sync with __oom_reap_task_mm which take the lock for read. All
other operations can safely race with the parallel unmap.

Reported-by: David Rientjes <rient...@google.com>
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Michal Hocko <mho...@suse.com>
---

Hi,
I am sending this as an RFC because I am not yet sure I haven't missed
something subtle here but the appoach should work in principle. I have
run it through some of my OOM stress tests to see if anything blows up
and it all went smoothly.

The issue has been brought up by David [1]. There were some attempts to
address it in oom proper [2][3] but the first one would cause problems
on their own [4] while the later is just too hairy.

Thoughts, objections, alternatives?

[1] 
http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com
[2] 
http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp
[3] http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp
[4] http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp

 mm/mmap.c     |  7 +++++++
 mm/oom_kill.c | 40 ++--------------------------------------
 2 files changed, 9 insertions(+), 38 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..253808e716dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
        /* Use -1 here to ensure all VMAs in the mm are unmapped */
        unmap_vmas(&tlb, vma, 0, -1);
 
+       /*
+        * oom reaper might race with exit_mmap so make sure we won't free
+        * page tables or unmap VMAs under its feet
+        */
+       down_write(&mm->mmap_sem);
        free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
        tlb_finish_mmu(&tlb, 0, -1);
 
@@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
                        nr_accounted += vma_pages(vma);
                vma = remove_vma(vma);
        }
+       mm->mmap = NULL;
        vm_unacct_memory(nr_accounted);
+       up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..5dc0ff22d567 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
        struct vm_area_struct *vma;
        bool ret = true;
 
-       /*
-        * We have to make sure to not race with the victim exit path
-        * and cause premature new oom victim selection:
-        * __oom_reap_task_mm           exit_mm
-        *   mmget_not_zero
-        *                                mmput
-        *                                  atomic_dec_and_test
-        *                                exit_oom_victim
-        *                              [...]
-        *                              out_of_memory
-        *                                select_bad_process
-        *                                  # no TIF_MEMDIE task selects new 
victim
-        *  unmap_page_range # frees some memory
-        */
-       mutex_lock(&oom_lock);
-
-       if (!down_read_trylock(&mm->mmap_sem)) {
-               ret = false;
-               goto unlock_oom;
-       }
-
-       /*
-        * increase mm_users only after we know we will reap something so
-        * that the mmput_async is called only when we have reaped something
-        * and delayed __mmput doesn't matter that much
-        */
-       if (!mmget_not_zero(mm)) {
-               up_read(&mm->mmap_sem);
-               goto unlock_oom;
-       }
+       if (!down_read_trylock(&mm->mmap_sem))
+               return false;
 
        /*
         * Tell all users of get_user/copy_from_user etc... that the content
@@ -538,14 +510,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
                        K(get_mm_counter(mm, MM_SHMEMPAGES)));
        up_read(&mm->mmap_sem);
 
-       /*
-        * Drop our reference but make sure the mmput slow path is called from a
-        * different context because we shouldn't risk we get stuck there and
-        * put the oom_reaper out of the way.
-        */
-       mmput_async(mm);
-unlock_oom:
-       mutex_unlock(&oom_lock);
        return ret;
 }
 
-- 
2.11.0

Reply via email to