Re: next-20200515: Xorg killed due to "OOM"

2020-06-01 Thread Michal Hocko
On Sun 31-05-20 14:16:01, Pavel Machek wrote:
> On Thu 2020-05-28 14:07:50, Michal Hocko wrote:
> > On Thu 28-05-20 14:03:54, Pavel Machek wrote:
> > > On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> > > > On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> > > > [...]
> > > > > [38617.276517] oom_reaper: reaped process 31769 (chromium), now 
> > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > > > > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> > > > > oom_score_adj=0
> > > > > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> > > > > 5.7.0-rc5-next-20200515+ #117
> > > > > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW 
> > > > > (2.19 ) 03/31/2011
> > > > > [38617.277266] Call Trace:
> > > > > [38617.277286]  dump_stack+0x54/0x6e
> > > > > [38617.277300]  dump_header+0x45/0x321
> > > > > [38617.277313]  oom_kill_process.cold+0x9/0xe
> > > > > [38617.277324]  ? out_of_memory+0x167/0x420
> > > > > [38617.277336]  out_of_memory+0x1f2/0x420
> > > > > [38617.277348]  pagefault_out_of_memory+0x34/0x56
> > > > > [38617.277361]  mm_fault_error+0x4a/0x130
> > > > > [38617.277372]  do_page_fault+0x3ce/0x416
> > > > 
> > > > The reason the OOM killer has been invoked is that the page fault
> > > > handler has returned VM_FAULT_OOM. So this is not a result of the page
> > > > allocator struggling to allocate a memory. It would be interesting to
> > > > check which code path has returned this. 
> > > 
> > > Should the core WARN_ON if that happens and there's enough memory, or
> > > something like that?
> > 
> > I wish it would simply go away. There shouldn't be really any reason for
> > VM_FAULT_OOM to exist. The real low on memory situation is already
> > handled in the page allocator.
> 
> Umm. Maybe the WARN_ON is first step in that direction? So we can see
> what driver actually did that, and complain to its authors?

This is much harder done than it seems. But maybe this doesn't really
need a full coverage. Some of the code paths which return VM_FAULT_OOM
will simply not fail. But checking for vma->vm_ops->fault() failures
might be interesting. Does the following tell you more about the failure
you can see

diff --git a/mm/memory.c b/mm/memory.c
index 9ab00dcb95d4..5ff023ab7b49 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3442,8 +3442,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
-   VM_FAULT_DONE_COW)))
+   VM_FAULT_DONE_COW))) {
+   if (unlikely(ret & VM_FAULT_OOM))
+   pr_warn("VM_FAULT_OOM returned from %ps\n", 
vma->vm_ops->fault);
return ret;
+   }
 
if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)

-- 
Michal Hocko
SUSE Labs


Re: next-20200515: Xorg killed due to "OOM"

2020-05-31 Thread Pavel Machek
On Thu 2020-05-28 14:07:50, Michal Hocko wrote:
> On Thu 28-05-20 14:03:54, Pavel Machek wrote:
> > On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> > > On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> > > [...]
> > > > [38617.276517] oom_reaper: reaped process 31769 (chromium), now 
> > > > anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > > > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> > > > oom_score_adj=0
> > > > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> > > > 5.7.0-rc5-next-20200515+ #117
> > > > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW 
> > > > (2.19 ) 03/31/2011
> > > > [38617.277266] Call Trace:
> > > > [38617.277286]  dump_stack+0x54/0x6e
> > > > [38617.277300]  dump_header+0x45/0x321
> > > > [38617.277313]  oom_kill_process.cold+0x9/0xe
> > > > [38617.277324]  ? out_of_memory+0x167/0x420
> > > > [38617.277336]  out_of_memory+0x1f2/0x420
> > > > [38617.277348]  pagefault_out_of_memory+0x34/0x56
> > > > [38617.277361]  mm_fault_error+0x4a/0x130
> > > > [38617.277372]  do_page_fault+0x3ce/0x416
> > > 
> > > The reason the OOM killer has been invoked is that the page fault
> > > handler has returned VM_FAULT_OOM. So this is not a result of the page
> > > allocator struggling to allocate a memory. It would be interesting to
> > > check which code path has returned this. 
> > 
> > Should the core WARN_ON if that happens and there's enough memory, or
> > something like that?
> 
> I wish it would simply go away. There shouldn't be really any reason for
> VM_FAULT_OOM to exist. The real low on memory situation is already
> handled in the page allocator.

Umm. Maybe the WARN_ON is first step in that direction? So we can see
what driver actually did that, and complain to its authors?

Best regards,
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature


Re: next-20200515: Xorg killed due to "OOM"

2020-05-28 Thread Michal Hocko
On Thu 28-05-20 14:03:54, Pavel Machek wrote:
> On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> > On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> > [...]
> > > [38617.276517] oom_reaper: reaped process 31769 (chromium), now 
> > > anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> > > oom_score_adj=0
> > > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> > > 5.7.0-rc5-next-20200515+ #117
> > > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 
> > > ) 03/31/2011
> > > [38617.277266] Call Trace:
> > > [38617.277286]  dump_stack+0x54/0x6e
> > > [38617.277300]  dump_header+0x45/0x321
> > > [38617.277313]  oom_kill_process.cold+0x9/0xe
> > > [38617.277324]  ? out_of_memory+0x167/0x420
> > > [38617.277336]  out_of_memory+0x1f2/0x420
> > > [38617.277348]  pagefault_out_of_memory+0x34/0x56
> > > [38617.277361]  mm_fault_error+0x4a/0x130
> > > [38617.277372]  do_page_fault+0x3ce/0x416
> > 
> > The reason the OOM killer has been invoked is that the page fault
> > handler has returned VM_FAULT_OOM. So this is not a result of the page
> > allocator struggling to allocate a memory. It would be interesting to
> > check which code path has returned this. 
> 
> Should the core WARN_ON if that happens and there's enough memory, or
> something like that?

I wish it would simply go away. There shouldn't be really any reason for
VM_FAULT_OOM to exist. The real low on memory situation is already
handled in the page allocator.

-- 
Michal Hocko
SUSE Labs


Re: next-20200515: Xorg killed due to "OOM"

2020-05-28 Thread Pavel Machek
On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> [...]
> > [38617.276517] oom_reaper: reaped process 31769 (chromium), now 
> > anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> > oom_score_adj=0
> > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> > 5.7.0-rc5-next-20200515+ #117
> > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 
> > 03/31/2011
> > [38617.277266] Call Trace:
> > [38617.277286]  dump_stack+0x54/0x6e
> > [38617.277300]  dump_header+0x45/0x321
> > [38617.277313]  oom_kill_process.cold+0x9/0xe
> > [38617.277324]  ? out_of_memory+0x167/0x420
> > [38617.277336]  out_of_memory+0x1f2/0x420
> > [38617.277348]  pagefault_out_of_memory+0x34/0x56
> > [38617.277361]  mm_fault_error+0x4a/0x130
> > [38617.277372]  do_page_fault+0x3ce/0x416
> 
> The reason the OOM killer has been invoked is that the page fault
> handler has returned VM_FAULT_OOM. So this is not a result of the page
> allocator struggling to allocate a memory. It would be interesting to
> check which code path has returned this. 

Should the core WARN_ON if that happens and there's enough memory, or
something like that?

I grepped, and there are not too many users of VM_FAULT_OOM. These
might be relevant:

drivers/gpu/drm/ttm/ttm_bo_vm.c: *   VM_FAULT_OOM on out-of-memory
drivers/gpu/drm/ttm/ttm_bo_vm.c:ret = VM_FAULT_OOM;
drivers/gpu/drm/ttm/ttm_bo_vm.c:ret = 
VM_FAULT_OOM;
drivers/gpu/drm/i915/gem/i915_gem_mman.c:   return VM_FAULT_OOM;
drivers/gpu/drm/vkms/vkms_gem.c:ret = 
VM_FAULT_OOM;
drivers/gpu/drm/vgem/vgem_drv.c:ret = 
VM_FAULT_OOM;

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature


Re: next-20200515: Xorg killed due to "OOM"

2020-05-28 Thread Michal Hocko
On Tue 26-05-20 11:10:54, Pavel Machek wrote:
[...]
> [38617.276517] oom_reaper: reaped process 31769 (chromium), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:7968kB
> [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> oom_score_adj=0
> [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> 5.7.0-rc5-next-20200515+ #117
> [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 
> 03/31/2011
> [38617.277266] Call Trace:
> [38617.277286]  dump_stack+0x54/0x6e
> [38617.277300]  dump_header+0x45/0x321
> [38617.277313]  oom_kill_process.cold+0x9/0xe
> [38617.277324]  ? out_of_memory+0x167/0x420
> [38617.277336]  out_of_memory+0x1f2/0x420
> [38617.277348]  pagefault_out_of_memory+0x34/0x56
> [38617.277361]  mm_fault_error+0x4a/0x130
> [38617.277372]  do_page_fault+0x3ce/0x416

The reason the OOM killer has been invoked is that the page fault
handler has returned VM_FAULT_OOM. So this is not a result of the page
allocator struggling to allocate a memory. It would be interesting to
check which code path has returned this. 
-- 
Michal Hocko
SUSE Labs