> On Sun, 22 Apr 2018, Tetsuo Handa wrote:
> 
> > > I'm wondering why you do not see oom killing of many processes if the 
> > > victim is a very large process that takes a long time to free memory in 
> > > exit_mmap() as I do because the oom reaper gives up trying to acquire 
> > > mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> > > 
> > 
> > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > 
> 
> I think that's an exceptionally good idea and will mitigate the concerns 
> of others.
> 
> It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> don't get dozens of unnecessary oom kills.
> 
> What do you think about this?  It passes preliminary testing on powerpc 
> and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> was a better way to acknowledge your contribution to fixing this issue, 
> especially since you brought up the exact problem this is addressing in 
> previous emails.)
> 

I don't think this patch is safe, for exit_mmap() is calling
mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().
exit_mmap() must not block while holding oom_lock in order to guarantee that
oom_reap_task_mm() can give up.

Some suggestion on top of your patch:

 mm/mmap.c     | 13 +++++--------
 mm/oom_kill.c | 51 ++++++++++++++++++++++++++-------------------------
 2 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 981eed4..7b31357 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
                /*
                 * Manually reap the mm to free as much memory as possible.
                 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
-                * mm from further consideration.  Taking mm->mmap_sem for write
-                * after setting MMF_OOM_SKIP will guarantee that the oom reaper
-                * will not run on this mm again after mmap_sem is dropped.
+                * mm from further consideration. Setting MMF_OOM_SKIP under
+                * oom_lock held will guarantee that the OOM reaper will not
+                * run on this mm again.
                 *
                 * This needs to be done before calling munlock_vma_pages_all(),
                 * which clears VM_LOCKED, otherwise the oom reaper cannot
                 * reliably test it.
                 */
-               mutex_lock(&oom_lock);
                __oom_reap_task_mm(mm);
-               mutex_unlock(&oom_lock);
-
+               mutex_lock(&oom_lock);
                set_bit(MMF_OOM_SKIP, &mm->flags);
-               down_write(&mm->mmap_sem);
-               up_write(&mm->mmap_sem);
+               mutex_unlock(&oom_lock);
        }
 
        if (mm->locked_vm) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..9a29df8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 {
        bool ret = true;
 
+       mutex_lock(&oom_lock);
+
        /*
-        * We have to make sure to not race with the victim exit path
-        * and cause premature new oom victim selection:
-        * oom_reap_task_mm             exit_mm
-        *   mmget_not_zero
-        *                                mmput
-        *                                  atomic_dec_and_test
-        *                                exit_oom_victim
-        *                              [...]
-        *                              out_of_memory
-        *                                select_bad_process
-        *                                  # no TIF_MEMDIE task selects new 
victim
-        *  unmap_page_range # frees some memory
+        * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
+        * work on the mm anymore. The check for MMF_OOM_SKIP must run
+        * under oom_lock held.
         */
-       mutex_lock(&oom_lock);
+       if (test_bit(MMF_OOM_SKIP, &mm->flags))
+               goto unlock_oom;
 
        if (!down_read_trylock(&mm->mmap_sem)) {
                ret = false;
@@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
                goto unlock_oom;
        }
 
-       /*
-        * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-        * work on the mm anymore. The check for MMF_OOM_SKIP must run
-        * under mmap_sem for reading because it serializes against the
-        * down_write();up_write() cycle in exit_mmap().
-        */
-       if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-               up_read(&mm->mmap_sem);
-               trace_skip_task_reaping(tsk->pid);
-               goto unlock_oom;
-       }
-
        trace_start_task_reaping(tsk->pid);
 
        __oom_reap_task_mm(mm);
@@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk)
        /*
         * Hide this mm from OOM killer because it has been either reaped or
         * somebody can't call up_write(mmap_sem).
+        *
+        * We have to make sure to not cause premature new oom victim selection:
+        *
+        * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
+        *   mutex_trylock(&oom_lock)
+        *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
+        *                               unmap_page_range() # frees some memory
+        *                               set_bit(MMF_OOM_SKIP)
+        *   out_of_memory()
+        *     select_bad_process()
+        *       test_bit(MMF_OOM_SKIP) # selects new oom victim
+        *   mutex_unlock(&oom_lock)
+        *
+        * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the
+        * last second alocation attempt is done by __alloc_pages_may_oom()
+        * before out_of_memory() selects next OOM victim by finding
+        * MMF_OOM_SKIP.
         */
+       mutex_lock(&oom_lock);
        set_bit(MMF_OOM_SKIP, &mm->flags);
+       mutex_unlock(&oom_lock);
 
        /* Drop a reference taken by wake_oom_reaper */
        put_task_struct(tsk);
-- 
1.8.3.1

Reply via email to