Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Thu, 20 Jul 2017, Michal Hocko wrote: > On Wed 19-07-17 18:18:27, Hugh Dickins wrote: > > > > But I haven't looked at the oom_kill or oom_reaper end of it at all, > > perhaps you have an overriding argument on the placement from that end. > > Well, the main problem here is that the oom_reaper tries to > MADV_DONTNEED the oom victim and then hide it from the oom killer (by > setting MMF_OOM_SKIP) to guarantee a forward progress. In order to do > that it needs mmap_sem for read. Currently we try to avoid races with > the eixt path by checking mm->mm_users and that can lead to premature > MMF_OOM_SKIP and that in turn to additional oom victim(s) selection > while the current one is still tearing the address space down. > > One way around that is to allow final unmap race with the oom_reaper > tear down. > > I hope this clarify the motivation Thanks, yes, if you have a good reason of that kind, then I agree that it's appropriate to leave the down_write(mmap_sem) until reaching the free_pgtables() stage. Hugh
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Wed 19-07-17 18:18:27, Hugh Dickins wrote: > On Wed, 19 Jul 2017, Michal Hocko wrote: > > On Thu 29-06-17 10:46:21, Michal Hocko wrote: > > > Forgot to CC Hugh. > > > > > > Hugh, Andrew, do you see this could cause any problem wrt. > > > ksm/khugepaged exit path? > > > > ping. I would really appreciate some help here. I would like to resend > > the patch soon. > > Sorry, Michal, I've been hiding from everyone. > > No, I don't think your patch will cause any trouble for the ksm or > khugepaged exit path; but we'll find out for sure when akpm puts it > in mmotm - I doubt I'll get to trying it out in advance of that. > > On the contrary, I think it will allow us to remove the peculiar > "down_write(mmap_sem); up_write(mmap_sem);" from those exit paths: > which were there to serialize, precisely because exit_mmap() did > not otherwise take mmap_sem; but you're now changing it to do so. I was actually suspecting this could be done but didn't get to study the code to be sure enough, your words are surely encouraging... > You could add a patch to remove those yourself, or any of us add > that on afterwards. I will add it on my todo list and let's see when I get there. > But I don't entirely agree (or disagree) with your placement: > see comment below. [...] > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 3bd5ecd20d4d..253808e716dc 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm) > > > > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > > > > unmap_vmas(&tlb, vma, 0, -1); > > > > > > > > + /* > > > > +* oom reaper might race with exit_mmap so make sure we won't > > > > free > > > > +* page tables or unmap VMAs under its feet > > > > +*/ > > > > + down_write(&mm->mmap_sem); > > Hmm. I'm conflicted about this. From a design point of view, I would > very much prefer you to take the mmap_sem higher up, maybe just before > or after the mmu_notifier_release() or arch_exit_mmap() (depends on > what those actually do): anyway before the unmap_vmas(). This thing is that I _want_ unmap_vmas to race with the oom reaper so I cannot take the write log before unmap_vmas... If this whole area should be covered by the write lock then I would need a handshake mechanism between the oom reaper and the final unmap_vmas to know that oom reaper won't set MMF_OOM_SKIP prematurely (see more on that below). > Because the things which go on in exit_mmap() are things which we expect > mmap_sem to be held across, and we get caught out when it is not: it's > awkard and error-prone enough that MADV_DONTNEED and MADV_FREE (for > very good reason) do things with only down_read(mmap_sem). But there's > a number of times (ksm exit being only one of them) when I've found it > a nuisance that we had no proper way of serializing against exit_mmap(). > > I'm conflicted because, on the other hand, I'm staunchly against adding > obstructions ("robust" futexes? gah!) into the exit patch, or widening > the use of locks that are not strictly needed. But wouldn't it be the > case here, that most contenders on the mmap_sem must hold a reference > to mm_users, and that prevents any possibility of racing exit_mmap(); > only ksm and khugepaged, and any others who already need such mmap_sem > tricks to serialize against exit_mmap(), could offer any contention. > > But I haven't looked at the oom_kill or oom_reaper end of it at all, > perhaps you have an overriding argument on the placement from that end. Well, the main problem here is that the oom_reaper tries to MADV_DONTNEED the oom victim and then hide it from the oom killer (by setting MMF_OOM_SKIP) to guarantee a forward progress. In order to do that it needs mmap_sem for read. Currently we try to avoid races with the eixt path by checking mm->mm_users and that can lead to premature MMF_OOM_SKIP and that in turn to additional oom victim(s) selection while the current one is still tearing the address space down. One way around that is to allow final unmap race with the oom_reaper tear down. I hope this clarify the motivation > Hugh > > [Not strictly relevant here, but a related note: I was very surprised > to discover, only quite recently, how handle_mm_fault() may be called > without down_read(mmap_sem) - when core dumping. That seems a > misguided optimization to me, which would also be nice to correct; > but again I might not appreciate the full picture.] shrug -- Michal Hocko SUSE Labs
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Wed, 19 Jul 2017, Michal Hocko wrote: > On Thu 29-06-17 10:46:21, Michal Hocko wrote: > > Forgot to CC Hugh. > > > > Hugh, Andrew, do you see this could cause any problem wrt. > > ksm/khugepaged exit path? > > ping. I would really appreciate some help here. I would like to resend > the patch soon. Sorry, Michal, I've been hiding from everyone. No, I don't think your patch will cause any trouble for the ksm or khugepaged exit path; but we'll find out for sure when akpm puts it in mmotm - I doubt I'll get to trying it out in advance of that. On the contrary, I think it will allow us to remove the peculiar "down_write(mmap_sem); up_write(mmap_sem);" from those exit paths: which were there to serialize, precisely because exit_mmap() did not otherwise take mmap_sem; but you're now changing it to do so. You could add a patch to remove those yourself, or any of us add that on afterwards. But I don't entirely agree (or disagree) with your placement: see comment below. > > > On Mon 26-06-17 15:03:46, Michal Hocko wrote: > > > From: Michal Hocko > > > > > > David has noticed that the oom killer might kill additional tasks while > > > the existing victim hasn't terminated yet because the oom_reaper marks > > > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down > > > to 0. The race is as follows > > > > > > oom_reap_task do_exit > > > exit_mm > > > __oom_reap_task_mm > > > mmput > > > __mmput > > > mmget_not_zero # fails > > > exit_mmap # frees memory > > > set_bit(MMF_OOM_SKIP) > > > > > > Currently we are try to reduce a risk of this race by taking oom_lock > > > and wait for out_of_memory sleep while holding the lock to give the > > > victim some time to exit. This is quite suboptimal approach because > > > there is no guarantee the victim (especially a large one) will manage > > > to unmap its address space and free enough memory to the particular oom > > > domain which needs a memory (e.g. a specific NUMA node). > > > > > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to > > > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed > > > to run in parallel with other unmappers (hence the mmap_sem for read). > > > The only tricky part is we have to exclude page tables tear down and all > > > operations which modify the address space in the __mmput path. exit_mmap > > > doesn't expect any other users so it doesn't use any locking. Nothing > > > really forbids us to use mmap_sem for write, though. In fact we are > > > already relying on this lock earlier in the __mmput path to synchronize > > > with ksm and khugepaged. > > > > > > Take the exclusive mmap_sem when calling free_pgtables and destroying > > > vmas to sync with __oom_reap_task_mm which take the lock for read. All > > > other operations can safely race with the parallel unmap. > > > > > > Reported-by: David Rientjes > > > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") > > > Signed-off-by: Michal Hocko > > > --- > > > > > > Hi, > > > I am sending this as an RFC because I am not yet sure I haven't missed > > > something subtle here but the appoach should work in principle. I have > > > run it through some of my OOM stress tests to see if anything blows up > > > and it all went smoothly. > > > > > > The issue has been brought up by David [1]. There were some attempts to > > > address it in oom proper [2][3] but the first one would cause problems > > > on their own [4] while the later is just too hairy. > > > > > > Thoughts, objections, alternatives? > > > > > > [1] > > > http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com > > > [2] > > > http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp > > > [3] > > > http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp > > > [4] > > > http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp > > > > > > mm/mmap.c | 7 +++ > > > mm/oom_kill.c | 40 ++-- > > > 2 files changed, 9 insertions(+), 38 deletions(-) > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 3bd5ecd20d4d..253808e716dc 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm) > > > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > > > unmap_vmas(&tlb, vma, 0, -1); > > > > > > + /* > > > + * oom reaper might race with exit_mmap so make sure we won't free > > > + * page tables or unmap VMAs under its feet > > > + */ > > > + down_write(&mm->mmap_sem); Hmm. I'm conflicted about this. From a design point of view, I would very much prefer you to take the mmap_sem higher up, maybe just before or after the mmu_notifier_release() o
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Thu 29-06-17 10:46:21, Michal Hocko wrote: > Forgot to CC Hugh. > > Hugh, Andrew, do you see this could cause any problem wrt. > ksm/khugepaged exit path? ping. I would really appreciate some help here. I would like to resend the patch soon. > On Mon 26-06-17 15:03:46, Michal Hocko wrote: > > From: Michal Hocko > > > > David has noticed that the oom killer might kill additional tasks while > > the existing victim hasn't terminated yet because the oom_reaper marks > > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down > > to 0. The race is as follows > > > > oom_reap_task do_exit > > exit_mm > > __oom_reap_task_mm > > mmput > > __mmput > > mmget_not_zero # fails > > exit_mmap # frees memory > > set_bit(MMF_OOM_SKIP) > > > > Currently we are try to reduce a risk of this race by taking oom_lock > > and wait for out_of_memory sleep while holding the lock to give the > > victim some time to exit. This is quite suboptimal approach because > > there is no guarantee the victim (especially a large one) will manage > > to unmap its address space and free enough memory to the particular oom > > domain which needs a memory (e.g. a specific NUMA node). > > > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to > > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed > > to run in parallel with other unmappers (hence the mmap_sem for read). > > The only tricky part is we have to exclude page tables tear down and all > > operations which modify the address space in the __mmput path. exit_mmap > > doesn't expect any other users so it doesn't use any locking. Nothing > > really forbids us to use mmap_sem for write, though. In fact we are > > already relying on this lock earlier in the __mmput path to synchronize > > with ksm and khugepaged. > > > > Take the exclusive mmap_sem when calling free_pgtables and destroying > > vmas to sync with __oom_reap_task_mm which take the lock for read. All > > other operations can safely race with the parallel unmap. > > > > Reported-by: David Rientjes > > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") > > Signed-off-by: Michal Hocko > > --- > > > > Hi, > > I am sending this as an RFC because I am not yet sure I haven't missed > > something subtle here but the appoach should work in principle. I have > > run it through some of my OOM stress tests to see if anything blows up > > and it all went smoothly. > > > > The issue has been brought up by David [1]. There were some attempts to > > address it in oom proper [2][3] but the first one would cause problems > > on their own [4] while the later is just too hairy. > > > > Thoughts, objections, alternatives? > > > > [1] > > http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com > > [2] > > http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp > > [3] http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp > > [4] http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp > > > > mm/mmap.c | 7 +++ > > mm/oom_kill.c | 40 ++-- > > 2 files changed, 9 insertions(+), 38 deletions(-) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 3bd5ecd20d4d..253808e716dc 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm) > > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > > unmap_vmas(&tlb, vma, 0, -1); > > > > + /* > > +* oom reaper might race with exit_mmap so make sure we won't free > > +* page tables or unmap VMAs under its feet > > +*/ > > + down_write(&mm->mmap_sem); > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > tlb_finish_mmu(&tlb, 0, -1); > > > > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm) > > nr_accounted += vma_pages(vma); > > vma = remove_vma(vma); > > } > > + mm->mmap = NULL; > > vm_unacct_memory(nr_accounted); > > + up_write(&mm->mmap_sem); > > } > > > > /* Insert vm structure into process list sorted by address > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 0e2c925e7826..5dc0ff22d567 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct > > *tsk, struct mm_struct *mm) > > struct vm_area_struct *vma; > > bool ret = true; > > > > - /* > > -* We have to make sure to not race with the victim exit path > > -* and cause premature new oom victim selection: > > -* __oom_reap_task_mm exit_mm > > -* mmget_not_zero > > -*mmput > > -*
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue 11-07-17 13:40:04, David Rientjes wrote: > On Tue, 11 Jul 2017, Michal Hocko wrote: > > > This? > > --- > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 5dc0ff22d567..e155d1d8064f 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct > > *tsk, struct mm_struct *mm) > > { > > struct mmu_gather tlb; > > struct vm_area_struct *vma; > > - bool ret = true; > > > > if (!down_read_trylock(&mm->mmap_sem)) > > return false; > > > > + /* There is nothing to reap so bail out without signs in the log */ > > + if (!mm->mmap) > > + goto unlock; > > + > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > * is no longer stable. No barriers really needed because unmapping > > @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct > > *tsk, struct mm_struct *mm) > > K(get_mm_counter(mm, MM_ANONPAGES)), > > K(get_mm_counter(mm, MM_FILEPAGES)), > > K(get_mm_counter(mm, MM_SHMEMPAGES))); > > +unlock: > > up_read(&mm->mmap_sem); > > > > - return ret; > > + return true; > > } > > > > #define MAX_OOM_REAP_RETRIES 10 > > Yes, this folded in with the original RFC patch appears to work better > with light testing. Yes folding it into the original patch was the plan. I would really appreciate some Tested-by here. > However, I think MAX_OOM_REAP_RETRIES and/or the timeout of HZ/10 needs to > be increased as well to address the issue that Tetsuo pointed out. The > oom reaper shouldn't be required to do any work unless it is resolving a > livelock, and that scenario should be relatively rare. The oom killer > being a natural ultra slow path, I think it would be justifiable to wait > longer or retry more times than simply 1 second before declaring that > reaping is not possible. It reduces the likelihood of additional oom > killing. I believe that this is an independent issue and as such it should be addressed separately along with some data backing up that decision. I am not against improving the waiting logic. We would need some requeuing when we cannot reap the victim because we cannot really wait too much time on a single oom victim considering there might be many victims queued (because of memcg ooms). This would obviously need some more code and I am willing to implement that but I would like to see that this is something that is a real problem first. Thanks! -- Michal Hocko SUSE Labs
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue, 11 Jul 2017, Michal Hocko wrote: > This? > --- > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 5dc0ff22d567..e155d1d8064f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > { > struct mmu_gather tlb; > struct vm_area_struct *vma; > - bool ret = true; > > if (!down_read_trylock(&mm->mmap_sem)) > return false; > > + /* There is nothing to reap so bail out without signs in the log */ > + if (!mm->mmap) > + goto unlock; > + > /* >* Tell all users of get_user/copy_from_user etc... that the content >* is no longer stable. No barriers really needed because unmapping > @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > K(get_mm_counter(mm, MM_ANONPAGES)), > K(get_mm_counter(mm, MM_FILEPAGES)), > K(get_mm_counter(mm, MM_SHMEMPAGES))); > +unlock: > up_read(&mm->mmap_sem); > > - return ret; > + return true; > } > > #define MAX_OOM_REAP_RETRIES 10 Yes, this folded in with the original RFC patch appears to work better with light testing. However, I think MAX_OOM_REAP_RETRIES and/or the timeout of HZ/10 needs to be increased as well to address the issue that Tetsuo pointed out. The oom reaper shouldn't be required to do any work unless it is resolving a livelock, and that scenario should be relatively rare. The oom killer being a natural ultra slow path, I think it would be justifiable to wait longer or retry more times than simply 1 second before declaring that reaping is not possible. It reduces the likelihood of additional oom killing.
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Mon 10-07-17 16:55:22, David Rientjes wrote: > On Mon, 26 Jun 2017, Michal Hocko wrote: > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 3bd5ecd20d4d..253808e716dc 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm) > > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > > unmap_vmas(&tlb, vma, 0, -1); > > > > + /* > > +* oom reaper might race with exit_mmap so make sure we won't free > > +* page tables or unmap VMAs under its feet > > +*/ > > + down_write(&mm->mmap_sem); > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > tlb_finish_mmu(&tlb, 0, -1); > > > > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm) > > nr_accounted += vma_pages(vma); > > vma = remove_vma(vma); > > } > > + mm->mmap = NULL; > > vm_unacct_memory(nr_accounted); > > + up_write(&mm->mmap_sem); > > } > > > > /* Insert vm structure into process list sorted by address > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 0e2c925e7826..5dc0ff22d567 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct > > *tsk, struct mm_struct *mm) > > struct vm_area_struct *vma; > > bool ret = true; > > > > - /* > > -* We have to make sure to not race with the victim exit path > > -* and cause premature new oom victim selection: > > -* __oom_reap_task_mm exit_mm > > -* mmget_not_zero > > -*mmput > > -* atomic_dec_and_test > > -*exit_oom_victim > > -* [...] > > -* out_of_memory > > -*select_bad_process > > -* # no TIF_MEMDIE task selects new > > victim > > -* unmap_page_range # frees some memory > > -*/ > > - mutex_lock(&oom_lock); > > - > > - if (!down_read_trylock(&mm->mmap_sem)) { > > - ret = false; > > - goto unlock_oom; > > - } > > - > > - /* > > -* increase mm_users only after we know we will reap something so > > -* that the mmput_async is called only when we have reaped something > > -* and delayed __mmput doesn't matter that much > > -*/ > > - if (!mmget_not_zero(mm)) { > > - up_read(&mm->mmap_sem); > > - goto unlock_oom; > > - } > > + if (!down_read_trylock(&mm->mmap_sem)) > > + return false; > > I think this should return true if mm->mmap == NULL here. This? --- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5dc0ff22d567..e155d1d8064f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -470,11 +470,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; - bool ret = true; if (!down_read_trylock(&mm->mmap_sem)) return false; + /* There is nothing to reap so bail out without signs in the log */ + if (!mm->mmap) + goto unlock; + /* * Tell all users of get_user/copy_from_user etc... that the content * is no longer stable. No barriers really needed because unmapping @@ -508,9 +511,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); +unlock: up_read(&mm->mmap_sem); - return ret; + return true; } #define MAX_OOM_REAP_RETRIES 10 -- Michal Hocko SUSE Labs
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue, 27 Jun 2017, Tetsuo Handa wrote: > I wonder why you prefer timeout based approach. Your patch will after all > set MMF_OOM_SKIP if operations between down_write() and up_write() took > more than one second. lock_anon_vma_root() from unlink_anon_vmas() from > free_pgtables() for example calls down_write()/up_write(). unlink_file_vma() > from free_pgtables() for another example calls down_write()/up_write(). > This means that it might happen that exit_mmap() takes more than one second > with mm->mmap_sem held for write, doesn't this? > I certainly have no objection to increasing the timeout period or increasing MAX_OOM_REAP_RETRIES to be substantially higher. All threads holding mm->mmap_sem should be oom killed and be able to access memory reserves to make forward progress if they fail to reclaim. If we are truly blocked on mm->mmap_sem, waiting longer than one second to declare that seems justifiable to prevent the exact situation you describe.
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Mon, 26 Jun 2017, Michal Hocko wrote: > diff --git a/mm/mmap.c b/mm/mmap.c > index 3bd5ecd20d4d..253808e716dc 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm) > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > > + /* > + * oom reaper might race with exit_mmap so make sure we won't free > + * page tables or unmap VMAs under its feet > + */ > + down_write(&mm->mmap_sem); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm) > nr_accounted += vma_pages(vma); > vma = remove_vma(vma); > } > + mm->mmap = NULL; > vm_unacct_memory(nr_accounted); > + up_write(&mm->mmap_sem); > } > > /* Insert vm structure into process list sorted by address > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 0e2c925e7826..5dc0ff22d567 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > struct vm_area_struct *vma; > bool ret = true; > > - /* > - * We have to make sure to not race with the victim exit path > - * and cause premature new oom victim selection: > - * __oom_reap_task_mm exit_mm > - * mmget_not_zero > - *mmput > - * atomic_dec_and_test > - *exit_oom_victim > - * [...] > - * out_of_memory > - *select_bad_process > - * # no TIF_MEMDIE task selects new > victim > - * unmap_page_range # frees some memory > - */ > - mutex_lock(&oom_lock); > - > - if (!down_read_trylock(&mm->mmap_sem)) { > - ret = false; > - goto unlock_oom; > - } > - > - /* > - * increase mm_users only after we know we will reap something so > - * that the mmput_async is called only when we have reaped something > - * and delayed __mmput doesn't matter that much > - */ > - if (!mmget_not_zero(mm)) { > - up_read(&mm->mmap_sem); > - goto unlock_oom; > - } > + if (!down_read_trylock(&mm->mmap_sem)) > + return false; I think this should return true if mm->mmap == NULL here.
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
Forgot to CC Hugh. Hugh, Andrew, do you see this could cause any problem wrt. ksm/khugepaged exit path? On Mon 26-06-17 15:03:46, Michal Hocko wrote: > From: Michal Hocko > > David has noticed that the oom killer might kill additional tasks while > the existing victim hasn't terminated yet because the oom_reaper marks > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down > to 0. The race is as follows > > oom_reap_task do_exit > exit_mm > __oom_reap_task_mm > mmput > __mmput > mmget_not_zero # fails > exit_mmap # frees memory > set_bit(MMF_OOM_SKIP) > > Currently we are try to reduce a risk of this race by taking oom_lock > and wait for out_of_memory sleep while holding the lock to give the > victim some time to exit. This is quite suboptimal approach because > there is no guarantee the victim (especially a large one) will manage > to unmap its address space and free enough memory to the particular oom > domain which needs a memory (e.g. a specific NUMA node). > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed > to run in parallel with other unmappers (hence the mmap_sem for read). > The only tricky part is we have to exclude page tables tear down and all > operations which modify the address space in the __mmput path. exit_mmap > doesn't expect any other users so it doesn't use any locking. Nothing > really forbids us to use mmap_sem for write, though. In fact we are > already relying on this lock earlier in the __mmput path to synchronize > with ksm and khugepaged. > > Take the exclusive mmap_sem when calling free_pgtables and destroying > vmas to sync with __oom_reap_task_mm which take the lock for read. All > other operations can safely race with the parallel unmap. > > Reported-by: David Rientjes > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") > Signed-off-by: Michal Hocko > --- > > Hi, > I am sending this as an RFC because I am not yet sure I haven't missed > something subtle here but the appoach should work in principle. I have > run it through some of my OOM stress tests to see if anything blows up > and it all went smoothly. > > The issue has been brought up by David [1]. There were some attempts to > address it in oom proper [2][3] but the first one would cause problems > on their own [4] while the later is just too hairy. > > Thoughts, objections, alternatives? > > [1] > http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com > [2] > http://lkml.kernel.org/r/201706171417.jhg48401.joqlhmfsvoo...@i-love.sakura.ne.jp > [3] http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp > [4] http://lkml.kernel.org/r/201706210217.v5l2hazc081...@www262.sakura.ne.jp > > mm/mmap.c | 7 +++ > mm/oom_kill.c | 40 ++-- > 2 files changed, 9 insertions(+), 38 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 3bd5ecd20d4d..253808e716dc 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm) > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > > + /* > + * oom reaper might race with exit_mmap so make sure we won't free > + * page tables or unmap VMAs under its feet > + */ > + down_write(&mm->mmap_sem); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > > @@ -2974,7 +2979,9 @@ void exit_mmap(struct mm_struct *mm) > nr_accounted += vma_pages(vma); > vma = remove_vma(vma); > } > + mm->mmap = NULL; > vm_unacct_memory(nr_accounted); > + up_write(&mm->mmap_sem); > } > > /* Insert vm structure into process list sorted by address > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 0e2c925e7826..5dc0ff22d567 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -472,36 +472,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > struct vm_area_struct *vma; > bool ret = true; > > - /* > - * We have to make sure to not race with the victim exit path > - * and cause premature new oom victim selection: > - * __oom_reap_task_mm exit_mm > - * mmget_not_zero > - *mmput > - * atomic_dec_and_test > - *exit_oom_victim > - * [...] > - * out_of_memory > - *select_bad_process > - * # no TIF_MEMDIE tas
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue 27-06-17 23:26:22, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 27-06-17 22:31:58, Tetsuo Handa wrote: [...] > > > shouldn't we try __oom_reap_task_mm() before calling these down_write() > > > if mm is OOM victim's? > > > > This is what we try. We simply try to get mmap_sem for read and do our > > work as soon as possible with the proposed patch. This is already an > > improvement, no? > > We can ask the OOM reaper kernel thread try to reap before the OOM killer > releases oom_lock mutex. But that is not guaranteed. It is possible that > the OOM victim thread is executed until down_write() in __ksm_exit() or > __khugepaged_exit() and then the OOM reaper kernel thread starts calling > down_read_trylock(). I strongly suspect we are getting tangent here. While I see your concern and yes the approach can be probably improved, can we focus on one thing at the time? I would like to fix the original problem first and only then go deeper down the rat hole of other subtle details. Do you have any fundamental objection to the suggested approach or see any issues with it? -- Michal Hocko SUSE Labs
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
Michal Hocko wrote: > On Tue 27-06-17 22:31:58, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > > I wonder why you prefer timeout based approach. Your patch will > > > > > > after all > > > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() > > > > > > took > > > > > > more than one second. > > > > > > > > > > if we reach down_write then we have unmapped the address space in > > > > > exit_mmap and oom reaper cannot do much more. > > > > > > > > So, by the time down_write() is called, majority of memory is already > > > > released, isn't it? > > > > > > In most cases yes. To be put it in other words. By the time exit_mmap > > > takes down_write there is nothing more oom reaper could reclaim. > > > > > Then, aren't there two exceptions which your patch cannot guarantee; > > down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ? > > yes it cannot. Those would be quite rare situations. Somebody holding > the mmap sem would have to block those to wait for too long (that too > long might be for ever actually if we are livelocked). We cannot rule > that out of course and I would argue that it would be more appropriate > to simply go after another task in those rare cases. There is not much > we can really do. At some point the oom reaper has to give up and move > on otherwise we are back to square one when OOM could deadlock... > > Maybe we can actually get rid of this down_write but I would go that way > only when it proves to be a real issue. > > > Since for some reason exit_mmap() cannot be brought to before > > ksm_exit(mm)/khugepaged_exit(mm) calls, > > 9ba692948008 ("ksm: fix oom deadlock") would tell you more about the > ordering and the motivation. I don't understand ksm nor khugepaged. But that commit was actually calling ksm_exit() just before free_pgtables() in exit_mmap(). It is ba76149f47d8c939 ("thp: khugepaged") which added /* must run before exit_mmap */ comment. > > > > > ksm_exit(mm); > > khugepaged_exit(mm); /* must run before exit_mmap */ > > exit_mmap(mm); > > > > shouldn't we try __oom_reap_task_mm() before calling these down_write() > > if mm is OOM victim's? > > This is what we try. We simply try to get mmap_sem for read and do our > work as soon as possible with the proposed patch. This is already an > improvement, no? We can ask the OOM reaper kernel thread try to reap before the OOM killer releases oom_lock mutex. But that is not guaranteed. It is possible that the OOM victim thread is executed until down_write() in __ksm_exit() or __khugepaged_exit() and then the OOM reaper kernel thread starts calling down_read_trylock().
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue 27-06-17 22:31:58, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > > I wonder why you prefer timeout based approach. Your patch will after > > > > > all > > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() > > > > > took > > > > > more than one second. > > > > > > > > if we reach down_write then we have unmapped the address space in > > > > exit_mmap and oom reaper cannot do much more. > > > > > > So, by the time down_write() is called, majority of memory is already > > > released, isn't it? > > > > In most cases yes. To be put it in other words. By the time exit_mmap > > takes down_write there is nothing more oom reaper could reclaim. > > > Then, aren't there two exceptions which your patch cannot guarantee; > down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ? yes it cannot. Those would be quite rare situations. Somebody holding the mmap sem would have to block those to wait for too long (that too long might be for ever actually if we are livelocked). We cannot rule that out of course and I would argue that it would be more appropriate to simply go after another task in those rare cases. There is not much we can really do. At some point the oom reaper has to give up and move on otherwise we are back to square one when OOM could deadlock... Maybe we can actually get rid of this down_write but I would go that way only when it proves to be a real issue. > Since for some reason exit_mmap() cannot be brought to before > ksm_exit(mm)/khugepaged_exit(mm) calls, 9ba692948008 ("ksm: fix oom deadlock") would tell you more about the ordering and the motivation. > > ksm_exit(mm); > khugepaged_exit(mm); /* must run before exit_mmap */ > exit_mmap(mm); > > shouldn't we try __oom_reap_task_mm() before calling these down_write() > if mm is OOM victim's? This is what we try. We simply try to get mmap_sem for read and do our work as soon as possible with the proposed patch. This is already an improvement, no? -- Michal Hocko SUSE Labs
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
Michal Hocko wrote: > On Tue 27-06-17 20:39:28, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > I wonder why you prefer timeout based approach. Your patch will after > > > > all > > > > set MMF_OOM_SKIP if operations between down_write() and up_write() took > > > > more than one second. > > > > > > if we reach down_write then we have unmapped the address space in > > > exit_mmap and oom reaper cannot do much more. > > > > So, by the time down_write() is called, majority of memory is already > > released, isn't it? > > In most cases yes. To be put it in other words. By the time exit_mmap > takes down_write there is nothing more oom reaper could reclaim. > Then, aren't there two exceptions which your patch cannot guarantee; down_write(&mm->mmap_sem) in __ksm_exit() and __khugepaged_exit() ? Since for some reason exit_mmap() cannot be brought to before ksm_exit(mm)/khugepaged_exit(mm) calls, ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); shouldn't we try __oom_reap_task_mm() before calling these down_write() if mm is OOM victim's?
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue 27-06-17 20:39:28, Tetsuo Handa wrote: > Michal Hocko wrote: > > > I wonder why you prefer timeout based approach. Your patch will after all > > > set MMF_OOM_SKIP if operations between down_write() and up_write() took > > > more than one second. > > > > if we reach down_write then we have unmapped the address space in > > exit_mmap and oom reaper cannot do much more. > > So, by the time down_write() is called, majority of memory is already > released, isn't it? In most cases yes. To be put it in other words. By the time exit_mmap takes down_write there is nothing more oom reaper could reclaim. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
Michal Hocko wrote: > > I wonder why you prefer timeout based approach. Your patch will after all > > set MMF_OOM_SKIP if operations between down_write() and up_write() took > > more than one second. > > if we reach down_write then we have unmapped the address space in > exit_mmap and oom reaper cannot do much more. So, by the time down_write() is called, majority of memory is already released, isn't it?
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
On Tue 27-06-17 19:52:03, Tetsuo Handa wrote: > Michal Hocko wrote: > > From: Michal Hocko > > > > David has noticed that the oom killer might kill additional tasks while > > the existing victim hasn't terminated yet because the oom_reaper marks > > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down > > to 0. The race is as follows > > > > oom_reap_task do_exit > > exit_mm > > __oom_reap_task_mm > > mmput > > __mmput > > mmget_not_zero # fails > > exit_mmap # frees memory > > set_bit(MMF_OOM_SKIP) > > > > Currently we are try to reduce a risk of this race by taking oom_lock > > and wait for out_of_memory sleep while holding the lock to give the > > victim some time to exit. This is quite suboptimal approach because > > there is no guarantee the victim (especially a large one) will manage > > to unmap its address space and free enough memory to the particular oom > > domain which needs a memory (e.g. a specific NUMA node). > > > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to > > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed > > to run in parallel with other unmappers (hence the mmap_sem for read). > > The only tricky part is we have to exclude page tables tear down and all > > operations which modify the address space in the __mmput path. exit_mmap > > doesn't expect any other users so it doesn't use any locking. Nothing > > really forbids us to use mmap_sem for write, though. In fact we are > > already relying on this lock earlier in the __mmput path to synchronize > > with ksm and khugepaged. > > > > Take the exclusive mmap_sem when calling free_pgtables and destroying > > vmas to sync with __oom_reap_task_mm which take the lock for read. All > > other operations can safely race with the parallel unmap. > > > > Reported-by: David Rientjes > > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") > > Signed-off-by: Michal Hocko > > --- > > > > Hi, > > I am sending this as an RFC because I am not yet sure I haven't missed > > something subtle here but the appoach should work in principle. I have > > run it through some of my OOM stress tests to see if anything blows up > > and it all went smoothly. > > > > The issue has been brought up by David [1]. There were some attempts to > > address it in oom proper [2][3] but the first one would cause problems > > on their own [4] while the later is just too hairy. > > > > Thoughts, objections, alternatives? > > I wonder why you prefer timeout based approach. Your patch will after all > set MMF_OOM_SKIP if operations between down_write() and up_write() took > more than one second. if we reach down_write then we have unmapped the address space in exit_mmap and oom reaper cannot do much more. > lock_anon_vma_root() from unlink_anon_vmas() from > free_pgtables() for example calls down_write()/up_write(). unlink_file_vma() > from free_pgtables() for another example calls down_write()/up_write(). > This means that it might happen that exit_mmap() takes more than one second > with mm->mmap_sem held for write, doesn't this? > > The worst situation is that no memory is released by uprobe_clear_state(), > exit_aio(), > ksm_exit(), khugepaged_exit() and operations before > down_write(&mm->mmap_sem), and then > one second elapses before some memory is released after > down_write(&mm->mmap_sem). > In that case, down_write()/up_write() in your patch helps nothing. > > Less worst situation is that no memory is released by uprobe_clear_state(), > exit_aio(), > ksm_exit(), khugepaged_exit() and operations before > down_write(&mm->mmap_sem), and then > only some memory is released after down_write(&mm->mmap_sem) before one > second elapses. > Someone might think that this is still premature. This would basically mean that the the oom victim had all its memory in page tables and vma structures with basically nothing mapped. While this is possible this is something oom reaper cannot really help with until we start reclaiming page tables as well. I have had a plan for that but never got to implement it so this is still on my todo list. > More likely situation is that down_read_trylock(&mm->mmap_sem) in > __oom_reap_task_mm() > succeeds before exit_mmap() calls down_write(&mm->mmap_sem) (especially true > if we remove > mutex_lock(&oom_lock) from __oom_reap_task_mm()). In this case, your patch > merely gives > uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and > operations before > down_write(&mm->mmap_sem) some time to release memory, for your patch will > after all set > MMF_OOM_SKIP immediately after __oom_reap_task_mm() called > up_read(&mm->mmap_sem). If we > assume that majority of memory is released by operations between > down_write(&mm->mmap_se
Re: [RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap
Michal Hocko wrote: > From: Michal Hocko > > David has noticed that the oom killer might kill additional tasks while > the existing victim hasn't terminated yet because the oom_reaper marks > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down > to 0. The race is as follows > > oom_reap_task do_exit > exit_mm > __oom_reap_task_mm > mmput > __mmput > mmget_not_zero # fails > exit_mmap # frees memory > set_bit(MMF_OOM_SKIP) > > Currently we are try to reduce a risk of this race by taking oom_lock > and wait for out_of_memory sleep while holding the lock to give the > victim some time to exit. This is quite suboptimal approach because > there is no guarantee the victim (especially a large one) will manage > to unmap its address space and free enough memory to the particular oom > domain which needs a memory (e.g. a specific NUMA node). > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed > to run in parallel with other unmappers (hence the mmap_sem for read). > The only tricky part is we have to exclude page tables tear down and all > operations which modify the address space in the __mmput path. exit_mmap > doesn't expect any other users so it doesn't use any locking. Nothing > really forbids us to use mmap_sem for write, though. In fact we are > already relying on this lock earlier in the __mmput path to synchronize > with ksm and khugepaged. > > Take the exclusive mmap_sem when calling free_pgtables and destroying > vmas to sync with __oom_reap_task_mm which take the lock for read. All > other operations can safely race with the parallel unmap. > > Reported-by: David Rientjes > Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") > Signed-off-by: Michal Hocko > --- > > Hi, > I am sending this as an RFC because I am not yet sure I haven't missed > something subtle here but the appoach should work in principle. I have > run it through some of my OOM stress tests to see if anything blows up > and it all went smoothly. > > The issue has been brought up by David [1]. There were some attempts to > address it in oom proper [2][3] but the first one would cause problems > on their own [4] while the later is just too hairy. > > Thoughts, objections, alternatives? I wonder why you prefer timeout based approach. Your patch will after all set MMF_OOM_SKIP if operations between down_write() and up_write() took more than one second. lock_anon_vma_root() from unlink_anon_vmas() from free_pgtables() for example calls down_write()/up_write(). unlink_file_vma() from free_pgtables() for another example calls down_write()/up_write(). This means that it might happen that exit_mmap() takes more than one second with mm->mmap_sem held for write, doesn't this? The worst situation is that no memory is released by uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), and then one second elapses before some memory is released after down_write(&mm->mmap_sem). In that case, down_write()/up_write() in your patch helps nothing. Less worst situation is that no memory is released by uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem), and then only some memory is released after down_write(&mm->mmap_sem) before one second elapses. Someone might think that this is still premature. More likely situation is that down_read_trylock(&mm->mmap_sem) in __oom_reap_task_mm() succeeds before exit_mmap() calls down_write(&mm->mmap_sem) (especially true if we remove mutex_lock(&oom_lock) from __oom_reap_task_mm()). In this case, your patch merely gives uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and operations before down_write(&mm->mmap_sem) some time to release memory, for your patch will after all set MMF_OOM_SKIP immediately after __oom_reap_task_mm() called up_read(&mm->mmap_sem). If we assume that majority of memory is released by operations between down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm(), this is not a preferable behavior. My patch [3] cannot give uprobe_clear_state(), exit_aio(), ksm_exit(), khugepaged_exit() and exit_mm() some time to release memory. But [3] can guarantee that all memory which the OOM reaper can reclaim is reclaimed before setting MMF_OOM_SKIP. If we wait for another second after setting MMF_OOM_SKIP, we could give operations between down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) in exit_mm() (in your patch) or __mmput() (in my patch) some more chance to reclaim memory before next OOM victim is selected. > > [1] > http://lkml.kernel.org/r/alpine.deb.2.10.1706141632100.93...@chino.kir.corp.google.com > [2