On Fri, Dec 16, 2016 at 01:56:50PM +0100, Michal Hocko wrote: > On Fri 16-12-16 15:35:55, Kirill A. Shutemov wrote: > > On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote: > > > On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote: > > > > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote: > > > > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote: > > > > > [...] > > > > > > I don't think it's a bug in the OOM reaper itself, but either of the > > > > > > following two patches will fix the problem (without my understand > > > > > > how or > > > > > > why): > > > > > > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > > > index ec9f11d4f094..37b14b2e2af4 100644 > > > > > > --- a/mm/oom_kill.c > > > > > > +++ b/mm/oom_kill.c > > > > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct > > > > > > task_struct *tsk, > > > > > > struct mm_struct *mm) > > > > > > */ > > > > > > mutex_lock(&oom_lock); > > > > > > > > > > > > - if (!down_read_trylock(&mm->mmap_sem)) { > > > > > > + if (!down_write_trylock(&mm->mmap_sem)) { > > > > > > > > > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and > > > > > that > > > > > doesn't require the exlusive mmap_sem. So this looks correct to me. > > > > > > > > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at > > > > least. > > > > > > > > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching > > > > on __oom_reap_task_mm() side. > > > > > > I guess you are right and we should match the MADV_DONTNEED behavior > > > here. Care to send a patch? > > > > Below. Testing required. > > > > > > Other difference is that you use unmap_page_range() witch doesn't touch > > > > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which > > > > invalidates > > > > the range. Not sure if it can make any difference here. > > > > > > Which mmu notifier would care about this? I am not really familiar with > > > those users so I might miss something easily. > > > > No idea either. > > > > Is there any reason not to use zap_page_range here too? > > Yes, zap_page_range is much more heavy and performs operations which > might lock AFAIR which I really would like to prevent from.
What exactly can block there? I don't see anything with that potential. > > Few more notes: > > > > I propably miss something, but why do we need details->ignore_dirty? > > > > It only appiled for non-anon pages, but since we filter out shared > > mappings, how can we have pte_dirty() for !PageAnon()? > > Why couldn't we have dirty pages on the private file mappings? The > underlying page might be still in the page cache, right? The check is about dirty PTE, not dirty page. > > check_swap_entries is also sloppy: the behavior doesn't match the comment: > > details == NULL makes it check swap entries. I removed it and restore > > details->check_mapping test as we had before. > > the reason is unmap_mapping_range which didn't use to check swap entries > so I wanted to have it opt in AFAIR. details == NULL would give you it in both cases. > > @@ -531,8 +519,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, > > struct mm_struct *mm) > > * count elevated without a good reason. > > */ > > if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) > > - unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end, > > - &details); > > + madvise_dontneed(vma, &vma, vma->vm_start, vma->vm_end); > > I would rather keep the unmap_page_range because it is the bare minumum > we have to do. Currently we are doing > > if (is_vm_hugetlb_page(vma)) > continue; > > so I would rather do something like > if (!can_vma_madv_dontneed(vma)) > continue; > instead. We can do that. But let's first understand why code should differ from madvise_dontneed(). It's not obvious to me. -- Kirill A. Shutemov