Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-23 Thread Hugh Dickins
On Thu, 20 Jul 2017, Michal Hocko wrote:
> On Wed 19-07-17 18:18:27, Hugh Dickins wrote:
> > 
> > But I haven't looked at the oom_kill or oom_reaper end of it at all,
> > perhaps you have an overriding argument on the placement from that end.
> 
> Well, the main problem here is that the oom_reaper tries to
> MADV_DONTNEED the oom victim and then hide it from the oom killer (by
> setting MMF_OOM_SKIP) to guarantee a forward progress. In order to do
> that it needs mmap_sem for read. Currently we try to avoid races with
> the eixt path by checking mm->mm_users and that can lead to premature
> MMF_OOM_SKIP and that in turn to additional oom victim(s) selection
> while the current one is still tearing the address space down.
> 
> One way around that is to allow final unmap race with the oom_reaper
> tear down.
> 
> I hope this clarify the motivation

Thanks, yes, if you have a good reason of that kind, then I agree that
it's appropriate to leave the down_write(mmap_sem) until reaching the
free_pgtables() stage.

Hugh


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-20 Thread Michal Hocko
On Wed 19-07-17 18:18:27, Hugh Dickins wrote:
> On Wed, 19 Jul 2017, Michal Hocko wrote:
> > On Thu 29-06-17 10:46:21, Michal Hocko wrote:
> > > Forgot to CC Hugh.
> > > 
> > > Hugh, Andrew, do you see this could cause any problem wrt.
> > > ksm/khugepaged exit path?
> > 
> > ping. I would really appreciate some help here. I would like to resend
> > the patch soon.
> 
> Sorry, Michal, I've been hiding from everyone.
> 
> No, I don't think your patch will cause any trouble for the ksm or
> khugepaged exit path; but we'll find out for sure when akpm puts it
> in mmotm - I doubt I'll get to trying it out in advance of that.
> 
> On the contrary, I think it will allow us to remove the peculiar
> "down_write(mmap_sem); up_write(mmap_sem);" from those exit paths:
> which were there to serialize, precisely because exit_mmap() did
> not otherwise take mmap_sem; but you're now changing it to do so.

I was actually suspecting this could be done but didn't get to study the
code to be sure enough, your words are surely encouraging...

> You could add a patch to remove those yourself, or any of us add
> that on afterwards.

I will add it on my todo list and let's see when I get there.
 
> But I don't entirely agree (or disagree) with your placement:
> see comment below.
[...]
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 3bd5ecd20d4d..253808e716dc 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > > > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > > unmap_vmas(&tlb, vma, 0, -1);
> > > >  
> > > > +   /*
> > > > +* oom reaper might race with exit_mmap so make sure we won't 
> > > > free
> > > > +* page tables or unmap VMAs under its feet
> > > > +*/
> > > > +   down_write(&mm->mmap_sem);
> 
> Hmm.  I'm conflicted about this.  From a design point of view, I would
> very much prefer you to take the mmap_sem higher up, maybe just before
> or after the mmu_notifier_release() or arch_exit_mmap() (depends on
> what those actually do): anyway before the unmap_vmas().

This thing is that I _want_ unmap_vmas to race with the oom reaper so I
cannot take the write log before unmap_vmas... If this whole area should
be covered by the write lock then I would need a handshake mechanism
between the oom reaper and the final unmap_vmas to know that oom reaper
won't set MMF_OOM_SKIP prematurely (see more on that below).

> Because the things which go on in exit_mmap() are things which we expect
> mmap_sem to be held across, and we get caught out when it is not: it's
> awkard and error-prone enough that MADV_DONTNEED and MADV_FREE (for
> very good reason) do things with only down_read(mmap_sem).  But there's
> a number of times (ksm exit being only one of them) when I've found it
> a nuisance that we had no proper way of serializing against exit_mmap().
> 
> I'm conflicted because, on the other hand, I'm staunchly against adding
> obstructions ("robust" futexes? gah!) into the exit patch, or widening
> the use of locks that are not strictly needed.  But wouldn't it be the
> case here, that most contenders on the mmap_sem must hold a reference
> to mm_users, and that prevents any possibility of racing exit_mmap();
> only ksm and khugepaged, and any others who already need such mmap_sem
> tricks to serialize against exit_mmap(), could offer any contention.
> 
> But I haven't looked at the oom_kill or oom_reaper end of it at all,
> perhaps you have an overriding argument on the placement from that end.

Well, the main problem here is that the oom_reaper tries to
MADV_DONTNEED the oom victim and then hide it from the oom killer (by
setting MMF_OOM_SKIP) to guarantee a forward progress. In order to do
that it needs mmap_sem for read. Currently we try to avoid races with
the eixt path by checking mm->mm_users and that can lead to premature
MMF_OOM_SKIP and that in turn to additional oom victim(s) selection
while the current one is still tearing the address space down.

One way around that is to allow final unmap race with the oom_reaper
tear down.

I hope this clarify the motivation

> Hugh
> 
> [Not strictly relevant here, but a related note: I was very surprised
> to discover, only quite recently, how handle_mm_fault() may be called
> without down_read(mmap_sem) - when core dumping.  That seems a
> misguided optimization to me, which would also be nice to correct;
> but again I might not appreciate the full picture.]

shrug
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-19 Thread Hugh Dickins
On Wed, 19 Jul 2017, Michal Hocko wrote:
> On Thu 29-06-17 10:46:21, Michal Hocko wrote:
> > Forgot to CC Hugh.
> > 
> > Hugh, Andrew, do you see this could cause any problem wrt.
> > ksm/khugepaged exit path?
> 
> ping. I would really appreciate some help here. I would like to resend
> the patch soon.

Sorry, Michal, I've been hiding from everyone.

No, I don't think your patch will cause any trouble for the ksm or
khugepaged exit path; but we'll find out for sure when akpm puts it
in mmotm - I doubt I'll get to trying it out in advance of that.

On the contrary, I think it will allow us to remove the peculiar
"down_write(mmap_sem); up_write(mmap_sem);" from those exit paths:
which were there to serialize, precisely because exit_mmap() did
not otherwise take mmap_sem; but you're now changing it to do so.

You could add a patch to remove those yourself, or any of us add
that on afterwards.

But I don't entirely agree (or disagree) with your placement:
see comment below.

> 
> > On Mon 26-06-17 15:03:46, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > David has noticed that the oom killer might kill additional tasks while
> > > the existing victim hasn't terminated yet because the oom_reaper marks
> > > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > > to 0. The race is as follows
> > > 
> > > oom_reap_task do_exit
> > > exit_mm
> > >   __oom_reap_task_mm
> > >   mmput
> > > __mmput
> > > mmget_not_zero # fails
> > >   exit_mmap # frees memory
> > >   set_bit(MMF_OOM_SKIP)
> > > 
> > > Currently we are try to reduce a risk of this race by taking oom_lock
> > > and wait for out_of_memory sleep while holding the lock to give the
> > > victim some time to exit. This is quite suboptimal approach because
> > > there is no guarantee the victim (especially a large one) will manage
> > > to unmap its address space and free enough memory to the particular oom
> > > domain which needs a memory (e.g. a specific NUMA node).
> > > 
> > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > > to run in parallel with other unmappers (hence the mmap_sem for read).
> > > The only tricky part is we have to exclude page tables tear down and all
> > > operations which modify the address space in the __mmput path. exit_mmap
> > > doesn't expect any other users so it doesn't use any locking. Nothing
> > > really forbids us to use mmap_sem for write, though. In fact we are
> > > already relying on this lock earlier in the __mmput path to synchronize
> > > with ksm and khugepaged.
> > > 
> > > Take the exclusive mmap_sem when calling free_pgtables and destroying
> > > vmas to sync with __oom_reap_task_mm which take the lock for read. All
> > > other operations can safely race with the parallel unmap.
> > > 
> > > Reported-by: David Rientjes 
> > > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> > > Signed-off-by: Michal Hocko 
> > > ---
> > > 
> > > Hi,
> > > I am sending this as an RFC because I am not yet sure I haven't missed
> > > something subtle here but the appoach should work in principle. I have
> > > run it through some of my OOM stress tests to see if anything blows up
> > > and it all went smoothly.
> > > 
> > > The issue has been brought up by David [1]. There were some attempts to
> > > address it in oom proper [2][3] but the first one would cause problems
> > > on their own [4] while the later is just too hairy.
> > > 
> > > Thoughts, objections, alternatives?
> > > 
> > > [1] 
> > > http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com
> > > [2] 
> > > http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp
> > > [3] 
> > > http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp
> > > [4] 
> > > http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp
> > > 
> > >  mm/mmap.c |  7 +++
> > >  mm/oom_kill.c | 40 ++--
> > >  2 files changed, 9 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 3bd5ecd20d4d..253808e716dc 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > >   /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >   unmap_vmas(&tlb, vma, 0, -1);
> > >  
> > > + /*
> > > +  * oom reaper might race with exit_mmap so make sure we won't free
> > > +  * page tables or unmap VMAs under its feet
> > > +  */
> > > + down_write(&mm->mmap_sem);

Hmm.  I'm conflicted about this.  From a design point of view, I would
very much prefer you to take the mmap_sem higher up, maybe just before
or after the mmu_notifier_release() o

Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-18 Thread Michal Hocko
On Thu 29-06-17 10:46:21, Michal Hocko wrote:
> Forgot to CC Hugh.
> 
> Hugh, Andrew, do you see this could cause any problem wrt.
> ksm/khugepaged exit path?

ping. I would really appreciate some help here. I would like to resend
the patch soon.

> On Mon 26-06-17 15:03:46, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > David has noticed that the oom killer might kill additional tasks while
> > the existing victim hasn't terminated yet because the oom_reaper marks
> > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > to 0. The race is as follows
> > 
> > oom_reap_task   do_exit
> >   exit_mm
> >   __oom_reap_task_mm
> > mmput
> >   __mmput
> > mmget_not_zero # fails
> > exit_mmap # frees memory
> >   set_bit(MMF_OOM_SKIP)
> > 
> > Currently we are try to reduce a risk of this race by taking oom_lock
> > and wait for out_of_memory sleep while holding the lock to give the
> > victim some time to exit. This is quite suboptimal approach because
> > there is no guarantee the victim (especially a large one) will manage
> > to unmap its address space and free enough memory to the particular oom
> > domain which needs a memory (e.g. a specific NUMA node).
> > 
> > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > to run in parallel with other unmappers (hence the mmap_sem for read).
> > The only tricky part is we have to exclude page tables tear down and all
> > operations which modify the address space in the __mmput path. exit_mmap
> > doesn't expect any other users so it doesn't use any locking. Nothing
> > really forbids us to use mmap_sem for write, though. In fact we are
> > already relying on this lock earlier in the __mmput path to synchronize
> > with ksm and khugepaged.
> > 
> > Take the exclusive mmap_sem when calling free_pgtables and destroying
> > vmas to sync with __oom_reap_task_mm which take the lock for read. All
> > other operations can safely race with the parallel unmap.
> > 
> > Reported-by: David Rientjes 
> > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> > Signed-off-by: Michal Hocko 
> > ---
> > 
> > Hi,
> > I am sending this as an RFC because I am not yet sure I haven't missed
> > something subtle here but the appoach should work in principle. I have
> > run it through some of my OOM stress tests to see if anything blows up
> > and it all went smoothly.
> > 
> > The issue has been brought up by David [1]. There were some attempts to
> > address it in oom proper [2][3] but the first one would cause problems
> > on their own [4] while the later is just too hairy.
> > 
> > Thoughts, objections, alternatives?
> > 
> > [1] 
> > http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com
> > [2] 
> > http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp
> > [3] http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp
> > [4] http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp
> > 
> >  mm/mmap.c |  7 +++
> >  mm/oom_kill.c | 40 ++--
> >  2 files changed, 9 insertions(+), 38 deletions(-)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 3bd5ecd20d4d..253808e716dc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > unmap_vmas(&tlb, vma, 0, -1);
> >  
> > +   /*
> > +* oom reaper might race with exit_mmap so make sure we won't free
> > +* page tables or unmap VMAs under its feet
> > +*/
> > +   down_write(&mm->mmap_sem);
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb, 0, -1);
> >  
> > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > }
> > +   mm->mmap = NULL;
> > vm_unacct_memory(nr_accounted);
> > +   up_write(&mm->mmap_sem);
> >  }
> >  
> >  /* Insert vm structure into process list sorted by address
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0e2c925e7826..5dc0ff22d567 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct 
> > *tsk, struct mm_struct *mm)
> > struct vm_area_struct *vma;
> > bool ret = true;
> >  
> > -   /*
> > -* We have to make sure to not race with the victim exit path
> > -* and cause premature new oom victim selection:
> > -* __oom_reap_task_mm   exit_mm
> > -*   mmget_not_zero
> > -*mmput
> > -*

Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-12 Thread Michal Hocko
On Tue 11-07-17 13:40:04, David Rientjes wrote:
> On Tue, 11 Jul 2017, Michal Hocko wrote:
> 
> > This?
> > ---
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5dc0ff22d567..e155d1d8064f 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct 
> > *tsk, struct mm_struct *mm)
> >  {
> > struct mmu_gather tlb;
> > struct vm_area_struct *vma;
> > -   bool ret = true;
> >  
> > if (!down_read_trylock(&mm->mmap_sem))
> > return false;
> >  
> > +   /* There is nothing to reap so bail out without signs in the log */
> > +   if (!mm->mmap)
> > +   goto unlock;
> > +
> > /*
> >  * Tell all users of get_user/copy_from_user etc... that the content
> >  * is no longer stable. No barriers really needed because unmapping
> > @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct 
> > *tsk, struct mm_struct *mm)
> > K(get_mm_counter(mm, MM_ANONPAGES)),
> > K(get_mm_counter(mm, MM_FILEPAGES)),
> > K(get_mm_counter(mm, MM_SHMEMPAGES)));
> > +unlock:
> > up_read(&mm->mmap_sem);
> >  
> > -   return ret;
> > +   return true;
> >  }
> >  
> >  #define MAX_OOM_REAP_RETRIES 10
> 
> Yes, this folded in with the original RFC patch appears to work better 
> with light testing.

Yes folding it into the original patch was the plan. I would really
appreciate some Tested-by here.

> However, I think MAX_OOM_REAP_RETRIES and/or the timeout of HZ/10 needs to 
> be increased as well to address the issue that Tetsuo pointed out.  The 
> oom reaper shouldn't be required to do any work unless it is resolving a 
> livelock, and that scenario should be relatively rare.  The oom killer 
> being a natural ultra slow path, I think it would be justifiable to wait 
> longer or retry more times than simply 1 second before declaring that 
> reaping is not possible.  It reduces the likelihood of additional oom 
> killing.

I believe that this is an independent issue and as such it should be
addressed separately along with some data backing up that decision. I am
not against improving the waiting logic. We would need some requeuing
when we cannot reap the victim because we cannot really wait too much
time on a single oom victim considering there might be many victims
queued (because of memcg ooms). This would obviously need some more code
and I am willing to implement that but I would like to see that this is
something that is a real problem first.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-11 Thread David Rientjes
On Tue, 11 Jul 2017, Michal Hocko wrote:

> This?
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5dc0ff22d567..e155d1d8064f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  {
>   struct mmu_gather tlb;
>   struct vm_area_struct *vma;
> - bool ret = true;
>  
>   if (!down_read_trylock(&mm->mmap_sem))
>   return false;
>  
> + /* There is nothing to reap so bail out without signs in the log */
> + if (!mm->mmap)
> + goto unlock;
> +
>   /*
>* Tell all users of get_user/copy_from_user etc... that the content
>* is no longer stable. No barriers really needed because unmapping
> @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   K(get_mm_counter(mm, MM_ANONPAGES)),
>   K(get_mm_counter(mm, MM_FILEPAGES)),
>   K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +unlock:
>   up_read(&mm->mmap_sem);
>  
> - return ret;
> + return true;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10

Yes, this folded in with the original RFC patch appears to work better 
with light testing.

However, I think MAX_OOM_REAP_RETRIES and/or the timeout of HZ/10 needs to 
be increased as well to address the issue that Tetsuo pointed out.  The 
oom reaper shouldn't be required to do any work unless it is resolving a 
livelock, and that scenario should be relatively rare.  The oom killer 
being a natural ultra slow path, I think it would be justifiable to wait 
longer or retry more times than simply 1 second before declaring that 
reaping is not possible.  It reduces the likelihood of additional oom 
killing.


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-10 Thread Michal Hocko
On Mon 10-07-17 16:55:22, David Rientjes wrote:
> On Mon, 26 Jun 2017, Michal Hocko wrote:
> 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 3bd5ecd20d4d..253808e716dc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
> > /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > unmap_vmas(&tlb, vma, 0, -1);
> >  
> > +   /*
> > +* oom reaper might race with exit_mmap so make sure we won't free
> > +* page tables or unmap VMAs under its feet
> > +*/
> > +   down_write(&mm->mmap_sem);
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb, 0, -1);
> >  
> > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > }
> > +   mm->mmap = NULL;
> > vm_unacct_memory(nr_accounted);
> > +   up_write(&mm->mmap_sem);
> >  }
> >  
> >  /* Insert vm structure into process list sorted by address
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0e2c925e7826..5dc0ff22d567 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct 
> > *tsk, struct mm_struct *mm)
> > struct vm_area_struct *vma;
> > bool ret = true;
> >  
> > -   /*
> > -* We have to make sure to not race with the victim exit path
> > -* and cause premature new oom victim selection:
> > -* __oom_reap_task_mm   exit_mm
> > -*   mmget_not_zero
> > -*mmput
> > -*  atomic_dec_and_test
> > -*exit_oom_victim
> > -*  [...]
> > -*  out_of_memory
> > -*select_bad_process
> > -*  # no TIF_MEMDIE task selects new 
> > victim
> > -*  unmap_page_range # frees some memory
> > -*/
> > -   mutex_lock(&oom_lock);
> > -
> > -   if (!down_read_trylock(&mm->mmap_sem)) {
> > -   ret = false;
> > -   goto unlock_oom;
> > -   }
> > -
> > -   /*
> > -* increase mm_users only after we know we will reap something so
> > -* that the mmput_async is called only when we have reaped something
> > -* and delayed __mmput doesn't matter that much
> > -*/
> > -   if (!mmget_not_zero(mm)) {
> > -   up_read(&mm->mmap_sem);
> > -   goto unlock_oom;
> > -   }
> > +   if (!down_read_trylock(&mm->mmap_sem))
> > +   return false;
> 
> I think this should return true if mm->mmap == NULL here.

This?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5dc0ff22d567..e155d1d8064f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 {
struct mmu_gather tlb;
struct vm_area_struct *vma;
-   bool ret = true;
 
if (!down_read_trylock(&mm->mmap_sem))
return false;
 
+   /* There is nothing to reap so bail out without signs in the log */
+   if (!mm->mmap)
+   goto unlock;
+
/*
 * Tell all users of get_user/copy_from_user etc... that the content
 * is no longer stable. No barriers really needed because unmapping
@@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+unlock:
up_read(&mm->mmap_sem);
 
-   return ret;
+   return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-10 Thread David Rientjes
On Tue, 27 Jun 2017, Tetsuo Handa wrote:

> I wonder why you prefer timeout based approach. Your patch will after all
> set MMF_OOM_SKIP if operations between down_write() and up_write() took
> more than one second. lock_anon_vma_root() from unlink_anon_vmas() from
> free_pgtables() for example calls down_write()/up_write(). unlink_file_vma()
>  from free_pgtables() for another example calls down_write()/up_write().
> This means that it might happen that exit_mmap() takes more than one second
> with mm->mmap_sem held for write, doesn't this?
> 

I certainly have no objection to increasing the timeout period or 
increasing MAX_OOM_REAP_RETRIES to be substantially higher.  All threads 
holding mm->mmap_sem should be oom killed and be able to access memory 
reserves to make forward progress if they fail to reclaim.  If we are 
truly blocked on mm->mmap_sem, waiting longer than one second to declare 
that seems justifiable to prevent the exact situation you describe.


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-07-10 Thread David Rientjes
On Mon, 26 Jun 2017, Michal Hocko wrote:

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3bd5ecd20d4d..253808e716dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
>   /* Use -1 here to ensure all VMAs in the mm are unmapped */
>   unmap_vmas(&tlb, vma, 0, -1);
>  
> + /*
> +  * oom reaper might race with exit_mmap so make sure we won't free
> +  * page tables or unmap VMAs under its feet
> +  */
> + down_write(&mm->mmap_sem);
>   free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>   tlb_finish_mmu(&tlb, 0, -1);
>  
> @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
>   nr_accounted += vma_pages(vma);
>   vma = remove_vma(vma);
>   }
> + mm->mmap = NULL;
>   vm_unacct_memory(nr_accounted);
> + up_write(&mm->mmap_sem);
>  }
>  
>  /* Insert vm structure into process list sorted by address
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0e2c925e7826..5dc0ff22d567 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   struct vm_area_struct *vma;
>   bool ret = true;
>  
> - /*
> -  * We have to make sure to not race with the victim exit path
> -  * and cause premature new oom victim selection:
> -  * __oom_reap_task_mm   exit_mm
> -  *   mmget_not_zero
> -  *mmput
> -  *  atomic_dec_and_test
> -  *exit_oom_victim
> -  *  [...]
> -  *  out_of_memory
> -  *select_bad_process
> -  *  # no TIF_MEMDIE task selects new 
> victim
> -  *  unmap_page_range # frees some memory
> -  */
> - mutex_lock(&oom_lock);
> -
> - if (!down_read_trylock(&mm->mmap_sem)) {
> - ret = false;
> - goto unlock_oom;
> - }
> -
> - /*
> -  * increase mm_users only after we know we will reap something so
> -  * that the mmput_async is called only when we have reaped something
> -  * and delayed __mmput doesn't matter that much
> -  */
> - if (!mmget_not_zero(mm)) {
> - up_read(&mm->mmap_sem);
> - goto unlock_oom;
> - }
> + if (!down_read_trylock(&mm->mmap_sem))
> + return false;

I think this should return true if mm->mmap == NULL here.


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-29 Thread Michal Hocko
Forgot to CC Hugh.

Hugh, Andrew, do you see this could cause any problem wrt.
ksm/khugepaged exit path?

On Mon 26-06-17 15:03:46, Michal Hocko wrote:
> From: Michal Hocko 
> 
> David has noticed that the oom killer might kill additional tasks while
> the existing victim hasn't terminated yet because the oom_reaper marks
> the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> to 0. The race is as follows
> 
> oom_reap_task do_exit
> exit_mm
>   __oom_reap_task_mm
>   mmput
> __mmput
> mmget_not_zero # fails
>   exit_mmap # frees memory
>   set_bit(MMF_OOM_SKIP)
> 
> Currently we are try to reduce a risk of this race by taking oom_lock
> and wait for out_of_memory sleep while holding the lock to give the
> victim some time to exit. This is quite suboptimal approach because
> there is no guarantee the victim (especially a large one) will manage
> to unmap its address space and free enough memory to the particular oom
> domain which needs a memory (e.g. a specific NUMA node).
> 
> Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> to run in parallel with other unmappers (hence the mmap_sem for read).
> The only tricky part is we have to exclude page tables tear down and all
> operations which modify the address space in the __mmput path. exit_mmap
> doesn't expect any other users so it doesn't use any locking. Nothing
> really forbids us to use mmap_sem for write, though. In fact we are
> already relying on this lock earlier in the __mmput path to synchronize
> with ksm and khugepaged.
> 
> Take the exclusive mmap_sem when calling free_pgtables and destroying
> vmas to sync with __oom_reap_task_mm which take the lock for read. All
> other operations can safely race with the parallel unmap.
> 
> Reported-by: David Rientjes 
> Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> I am sending this as an RFC because I am not yet sure I haven't missed
> something subtle here but the appoach should work in principle. I have
> run it through some of my OOM stress tests to see if anything blows up
> and it all went smoothly.
> 
> The issue has been brought up by David [1]. There were some attempts to
> address it in oom proper [2][3] but the first one would cause problems
> on their own [4] while the later is just too hairy.
> 
> Thoughts, objections, alternatives?
> 
> [1] 
> http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com
> [2] 
> http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp
> [3] http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp
> [4] http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp
> 
>  mm/mmap.c |  7 +++
>  mm/oom_kill.c | 40 ++--
>  2 files changed, 9 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3bd5ecd20d4d..253808e716dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
>   /* Use -1 here to ensure all VMAs in the mm are unmapped */
>   unmap_vmas(&tlb, vma, 0, -1);
>  
> + /*
> +  * oom reaper might race with exit_mmap so make sure we won't free
> +  * page tables or unmap VMAs under its feet
> +  */
> + down_write(&mm->mmap_sem);
>   free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>   tlb_finish_mmu(&tlb, 0, -1);
>  
> @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm)
>   nr_accounted += vma_pages(vma);
>   vma = remove_vma(vma);
>   }
> + mm->mmap = NULL;
>   vm_unacct_memory(nr_accounted);
> + up_write(&mm->mmap_sem);
>  }
>  
>  /* Insert vm structure into process list sorted by address
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0e2c925e7826..5dc0ff22d567 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   struct vm_area_struct *vma;
>   bool ret = true;
>  
> - /*
> -  * We have to make sure to not race with the victim exit path
> -  * and cause premature new oom victim selection:
> -  * __oom_reap_task_mm   exit_mm
> -  *   mmget_not_zero
> -  *mmput
> -  *  atomic_dec_and_test
> -  *exit_oom_victim
> -  *  [...]
> -  *  out_of_memory
> -  *select_bad_process
> -  *  # no TIF_MEMDIE tas

Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Michal Hocko
On Tue 27-06-17 23:26:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 27-06-17 22:31:58, Tetsuo Handa wrote:
[...]
> > > shouldn't we try __oom_reap_task_mm() before calling these down_write()
> > > if mm is OOM victim's?
> > 
> > This is what we try. We simply try to get mmap_sem for read and do our
> > work as soon as possible with the proposed patch. This is already an
> > improvement, no?
> 
> We can ask the OOM reaper kernel thread try to reap before the OOM killer
> releases oom_lock mutex. But that is not guaranteed. It is possible that
> the OOM victim thread is executed until down_write() in __ksm_exit() or
> __khugepaged_exit() and then the OOM reaper kernel thread starts calling
> down_read_trylock().

I strongly suspect we are getting tangent here. While I see your concern
and yes the approach can be probably improved, can we focus on one thing
at the time? I would like to fix the original problem first and only
then go deeper down the rat hole of other subtle details. Do you have
any fundamental objection to the suggested approach or see any issues
with it?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 27-06-17 22:31:58, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > > I wonder why you prefer timeout based approach. Your patch will 
> > > > > > after all
> > > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() 
> > > > > > took
> > > > > > more than one second.
> > > > > 
> > > > > if we reach down_write then we have unmapped the address space in
> > > > > exit_mmap and oom reaper cannot do much more.
> > > > 
> > > > So, by the time down_write() is called, majority of memory is already 
> > > > released, isn't it?
> > > 
> > > In most cases yes. To be put it in other words. By the time exit_mmap
> > > takes down_write there is nothing more oom reaper could reclaim.
> > > 
> > Then, aren't there two exceptions which your patch cannot guarantee;
> > down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ?
> 
> yes it cannot. Those would be quite rare situations. Somebody holding
> the mmap sem would have to block those to wait for too long (that too
> long might be for ever actually if we are livelocked). We cannot rule
> that out of course and I would argue that it would be more appropriate
> to simply go after another task in those rare cases. There is not much
> we can really do. At some point the oom reaper has to give up and move
> on otherwise we are back to square one when OOM could deadlock...
> 
> Maybe we can actually get rid of this down_write but I would go that way
> only when it proves to be a real issue.
> 
> > Since for some reason exit_mmap() cannot be brought to before
> > ksm_exit(mm)/khugepaged_exit(mm) calls,
> 
> 9ba692948008 ("ksm: fix oom deadlock") would tell you more about the
> ordering and the motivation.

I don't understand ksm nor khugepaged. But that commit was actually calling
ksm_exit() just before free_pgtables() in exit_mmap(). It is ba76149f47d8c939
("thp: khugepaged") which added /* must run before exit_mmap */ comment.

> 
> > 
> > ksm_exit(mm);
> > khugepaged_exit(mm); /* must run before exit_mmap */
> > exit_mmap(mm);
> > 
> > shouldn't we try __oom_reap_task_mm() before calling these down_write()
> > if mm is OOM victim's?
> 
> This is what we try. We simply try to get mmap_sem for read and do our
> work as soon as possible with the proposed patch. This is already an
> improvement, no?

We can ask the OOM reaper kernel thread try to reap before the OOM killer
releases oom_lock mutex. But that is not guaranteed. It is possible that
the OOM victim thread is executed until down_write() in __ksm_exit() or
__khugepaged_exit() and then the OOM reaper kernel thread starts calling
down_read_trylock().


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Michal Hocko
On Tue 27-06-17 22:31:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > I wonder why you prefer timeout based approach. Your patch will after 
> > > > > all
> > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() 
> > > > > took
> > > > > more than one second.
> > > > 
> > > > if we reach down_write then we have unmapped the address space in
> > > > exit_mmap and oom reaper cannot do much more.
> > > 
> > > So, by the time down_write() is called, majority of memory is already 
> > > released, isn't it?
> > 
> > In most cases yes. To be put it in other words. By the time exit_mmap
> > takes down_write there is nothing more oom reaper could reclaim.
> > 
> Then, aren't there two exceptions which your patch cannot guarantee;
> down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ?

yes it cannot. Those would be quite rare situations. Somebody holding
the mmap sem would have to block those to wait for too long (that too
long might be for ever actually if we are livelocked). We cannot rule
that out of course and I would argue that it would be more appropriate
to simply go after another task in those rare cases. There is not much
we can really do. At some point the oom reaper has to give up and move
on otherwise we are back to square one when OOM could deadlock...

Maybe we can actually get rid of this down_write but I would go that way
only when it proves to be a real issue.

> Since for some reason exit_mmap() cannot be brought to before
> ksm_exit(mm)/khugepaged_exit(mm) calls,

9ba692948008 ("ksm: fix oom deadlock") would tell you more about the
ordering and the motivation.

> 
>   ksm_exit(mm);
>   khugepaged_exit(mm); /* must run before exit_mmap */
>   exit_mmap(mm);
> 
> shouldn't we try __oom_reap_task_mm() before calling these down_write()
> if mm is OOM victim's?

This is what we try. We simply try to get mmap_sem for read and do our
work as soon as possible with the proposed patch. This is already an
improvement, no?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > I wonder why you prefer timeout based approach. Your patch will after 
> > > > all
> > > > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > > > more than one second.
> > > 
> > > if we reach down_write then we have unmapped the address space in
> > > exit_mmap and oom reaper cannot do much more.
> > 
> > So, by the time down_write() is called, majority of memory is already 
> > released, isn't it?
> 
> In most cases yes. To be put it in other words. By the time exit_mmap
> takes down_write there is nothing more oom reaper could reclaim.
> 
Then, aren't there two exceptions which your patch cannot guarantee;
down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ?

Since for some reason exit_mmap() cannot be brought to before
ksm_exit(mm)/khugepaged_exit(mm) calls,

ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
exit_mmap(mm);

shouldn't we try __oom_reap_task_mm() before calling these down_write()
if mm is OOM victim's?


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Michal Hocko
On Tue 27-06-17 20:39:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I wonder why you prefer timeout based approach. Your patch will after all
> > > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > > more than one second.
> > 
> > if we reach down_write then we have unmapped the address space in
> > exit_mmap and oom reaper cannot do much more.
> 
> So, by the time down_write() is called, majority of memory is already 
> released, isn't it?

In most cases yes. To be put it in other words. By the time exit_mmap
takes down_write there is nothing more oom reaper could reclaim.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Tetsuo Handa
Michal Hocko wrote:
> > I wonder why you prefer timeout based approach. Your patch will after all
> > set MMF_OOM_SKIP if operations between down_write() and up_write() took
> > more than one second.
> 
> if we reach down_write then we have unmapped the address space in
> exit_mmap and oom reaper cannot do much more.

So, by the time down_write() is called, majority of memory is already released, 
isn't it?


Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Michal Hocko
On Tue 27-06-17 19:52:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > David has noticed that the oom killer might kill additional tasks while
> > the existing victim hasn't terminated yet because the oom_reaper marks
> > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > to 0. The race is as follows
> > 
> > oom_reap_task   do_exit
> >   exit_mm
> >   __oom_reap_task_mm
> > mmput
> >   __mmput
> > mmget_not_zero # fails
> > exit_mmap # frees memory
> >   set_bit(MMF_OOM_SKIP)
> > 
> > Currently we are try to reduce a risk of this race by taking oom_lock
> > and wait for out_of_memory sleep while holding the lock to give the
> > victim some time to exit. This is quite suboptimal approach because
> > there is no guarantee the victim (especially a large one) will manage
> > to unmap its address space and free enough memory to the particular oom
> > domain which needs a memory (e.g. a specific NUMA node).
> > 
> > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > to run in parallel with other unmappers (hence the mmap_sem for read).
> > The only tricky part is we have to exclude page tables tear down and all
> > operations which modify the address space in the __mmput path. exit_mmap
> > doesn't expect any other users so it doesn't use any locking. Nothing
> > really forbids us to use mmap_sem for write, though. In fact we are
> > already relying on this lock earlier in the __mmput path to synchronize
> > with ksm and khugepaged.
> > 
> > Take the exclusive mmap_sem when calling free_pgtables and destroying
> > vmas to sync with __oom_reap_task_mm which take the lock for read. All
> > other operations can safely race with the parallel unmap.
> > 
> > Reported-by: David Rientjes 
> > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> > Signed-off-by: Michal Hocko 
> > ---
> > 
> > Hi,
> > I am sending this as an RFC because I am not yet sure I haven't missed
> > something subtle here but the appoach should work in principle. I have
> > run it through some of my OOM stress tests to see if anything blows up
> > and it all went smoothly.
> > 
> > The issue has been brought up by David [1]. There were some attempts to
> > address it in oom proper [2][3] but the first one would cause problems
> > on their own [4] while the later is just too hairy.
> > 
> > Thoughts, objections, alternatives?
> 
> I wonder why you prefer timeout based approach. Your patch will after all
> set MMF_OOM_SKIP if operations between down_write() and up_write() took
> more than one second.

if we reach down_write then we have unmapped the address space in
exit_mmap and oom reaper cannot do much more.

> lock_anon_vma_root() from unlink_anon_vmas() from
> free_pgtables() for example calls down_write()/up_write(). unlink_file_vma()
>  from free_pgtables() for another example calls down_write()/up_write().
> This means that it might happen that exit_mmap() takes more than one second
> with mm->mmap_sem held for write, doesn't this?
> 
> The worst situation is that no memory is released by uprobe_clear_state(), 
> exit_aio(),
> ksm_exit(), khugepaged_exit() and operations before 
> down_write(&mm->mmap_sem), and then
> one second elapses before some memory is released after 
> down_write(&mm->mmap_sem).
> In that case, down_write()/up_write() in your patch helps nothing.
> 
> Less worst situation is that no memory is released by uprobe_clear_state(), 
> exit_aio(),
> ksm_exit(), khugepaged_exit() and operations before 
> down_write(&mm->mmap_sem), and then
> only some memory is released after down_write(&mm->mmap_sem) before one 
> second elapses.
> Someone might think that this is still premature.

This would basically mean that the the oom victim had all its memory in
page tables and vma structures with basically nothing mapped. While this
is possible this is something oom reaper cannot really help with until
we start reclaiming page tables as well. I have had a plan for that but
never got to implement it so this is still on my todo list.

> More likely situation is that down_read_trylock(&mm->mmap_sem) in 
> __oom_reap_task_mm()
> succeeds before exit_mmap() calls down_write(&mm->mmap_sem) (especially true 
> if we remove
> mutex_lock(&oom_lock) from __oom_reap_task_mm()). In this case, your patch 
> merely gives
> uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and 
> operations before
> down_write(&mm->mmap_sem) some time to release memory, for your patch will 
> after all set
> MMF_OOM_SKIP immediately after __oom_reap_task_mm() called 
> up_read(&mm->mmap_sem). If we
> assume that majority of memory is released by operations between
> down_write(&mm->mmap_se

Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap

2017-06-27 Thread Tetsuo Handa
Michal Hocko wrote:
> From: Michal Hocko 
> 
> David has noticed that the oom killer might kill additional tasks while
> the existing victim hasn't terminated yet because the oom_reaper marks
> the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> to 0. The race is as follows
> 
> oom_reap_task do_exit
> exit_mm
>   __oom_reap_task_mm
>   mmput
> __mmput
> mmget_not_zero # fails
>   exit_mmap # frees memory
>   set_bit(MMF_OOM_SKIP)
> 
> Currently we are try to reduce a risk of this race by taking oom_lock
> and wait for out_of_memory sleep while holding the lock to give the
> victim some time to exit. This is quite suboptimal approach because
> there is no guarantee the victim (especially a large one) will manage
> to unmap its address space and free enough memory to the particular oom
> domain which needs a memory (e.g. a specific NUMA node).
> 
> Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> to run in parallel with other unmappers (hence the mmap_sem for read).
> The only tricky part is we have to exclude page tables tear down and all
> operations which modify the address space in the __mmput path. exit_mmap
> doesn't expect any other users so it doesn't use any locking. Nothing
> really forbids us to use mmap_sem for write, though. In fact we are
> already relying on this lock earlier in the __mmput path to synchronize
> with ksm and khugepaged.
> 
> Take the exclusive mmap_sem when calling free_pgtables and destroying
> vmas to sync with __oom_reap_task_mm which take the lock for read. All
> other operations can safely race with the parallel unmap.
> 
> Reported-by: David Rientjes 
> Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> I am sending this as an RFC because I am not yet sure I haven't missed
> something subtle here but the appoach should work in principle. I have
> run it through some of my OOM stress tests to see if anything blows up
> and it all went smoothly.
> 
> The issue has been brought up by David [1]. There were some attempts to
> address it in oom proper [2][3] but the first one would cause problems
> on their own [4] while the later is just too hairy.
> 
> Thoughts, objections, alternatives?

I wonder why you prefer timeout based approach. Your patch will after all
set MMF_OOM_SKIP if operations between down_write() and up_write() took
more than one second. lock_anon_vma_root() from unlink_anon_vmas() from
free_pgtables() for example calls down_write()/up_write(). unlink_file_vma()
 from free_pgtables() for another example calls down_write()/up_write().
This means that it might happen that exit_mmap() takes more than one second
with mm->mmap_sem held for write, doesn't this?

The worst situation is that no memory is released by uprobe_clear_state(), 
exit_aio(),
ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), 
and then
one second elapses before some memory is released after 
down_write(&mm->mmap_sem).
In that case, down_write()/up_write() in your patch helps nothing.

Less worst situation is that no memory is released by uprobe_clear_state(), 
exit_aio(),
ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), 
and then
only some memory is released after down_write(&mm->mmap_sem) before one second 
elapses.
Someone might think that this is still premature.

More likely situation is that down_read_trylock(&mm->mmap_sem) in 
__oom_reap_task_mm()
succeeds before exit_mmap() calls down_write(&mm->mmap_sem) (especially true if 
we remove
mutex_lock(&oom_lock) from __oom_reap_task_mm()). In this case, your patch 
merely gives
uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and operations 
before
down_write(&mm->mmap_sem) some time to release memory, for your patch will 
after all set
MMF_OOM_SKIP immediately after __oom_reap_task_mm() called 
up_read(&mm->mmap_sem). If we
assume that majority of memory is released by operations between
down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm(), this is not a 
preferable
behavior.

My patch [3] cannot give uprobe_clear_state(), exit_aio(), ksm_exit(), 
khugepaged_exit()
and exit_mm() some time to release memory. But [3] can guarantee that all 
memory which
the OOM reaper can reclaim is reclaimed before setting MMF_OOM_SKIP.

If we wait for another second after setting MMF_OOM_SKIP, we could give 
operations between
down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm() (in your patch) 
or __mmput()
(in my patch) some more chance to reclaim memory before next OOM victim is 
selected.

> 
> [1] 
> http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com
> [2