Re: OOM detection regressions since 4.7
On Mon, 2016-08-29 at 10:28 -0700, Linus Torvalds wrote: > > On Mon, Aug 29, 2016 at 7:52 AM, Olaf Heringwrote: > > > > > > Today I noticed the nfsserver was disabled, probably since months already. > > Starting it gives a OOM, not sure if this is new with 4.7+. > > That's not an oom, that's just an allocation failure. > > And with order-4, that's actually pretty normal. Nobody should use > order-4 (that's 16 contiguous pages, fragmentation can easily make > that hard - *much* harder than the small order-2 or order-2 cases that > we should largely be able to rely on). > > In fact, people who do multi-order allocations should always have a > fallback, and use __GFP_NOWARN. > > > > > [93348.306406] Call Trace: > > [93348.306490] [] __alloc_pages_slowpath+0x1af/0xa10 > > [93348.306501] [] __alloc_pages_nodemask+0x250/0x290 > > [93348.306511] [] cache_grow_begin+0x8d/0x540 > > [93348.306520] [] fallback_alloc+0x161/0x200 > > [93348.306530] [] __kmalloc+0x1d2/0x570 > > [93348.306589] [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] > > Hmm. That's kmalloc itself falling back after already failing to grow > the slab cache earlier (the earlier allocations *were* done with > NOWARN afaik). > > It does look like nfsdstarts out by allocating the hash table with one > single fairly big allocation, and has no fallback position. > > I suspect the code expects to be started at boot time, when this just > isn't an issue. The fact that you loaded the nfsd kernel module with > memory already fragmented after heavy use is likely why nobody else > has seen this. > > Adding the nfsd people to the cc, because just from a robustness > standpoint I suspect it would be better if the code did something like > > (a) shrink the hash table if the allocation fails (we've got some > examples of that elsewhere) > > or > > (b) fall back on a vmalloc allocation (that's certainly the simpler model) > > We do have a "kvfree()" helper function for the "free either a kmalloc > or vmalloc allocation" but we don't actually have a good helper > pattern for the allocation side. People just do it by hand, at least > partly because we have so many different ways to allocate things - > zeroing, non-zeroing, node-specific or not, atomic or not (atomic > cannot fall back to vmalloc, obviously) etc etc. > > Bruce, Jeff, comments? > > Linus Yeah, that makes total sense. Hmm...we _do_ already auto-size the hash at init time already, so shrinking it downward and retrying if the allocation fails wouldn't be hard to do. Maybe I can just cut it in half and throw a pr_warn to tell the admin in that case. In any case...I'll take a look at how we can improve it. Thanks for the heads-up! -- Jeff Layton
Re: OOM detection regressions since 4.7
On Mon, 2016-08-29 at 10:28 -0700, Linus Torvalds wrote: > > On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering wrote: > > > > > > Today I noticed the nfsserver was disabled, probably since months already. > > Starting it gives a OOM, not sure if this is new with 4.7+. > > That's not an oom, that's just an allocation failure. > > And with order-4, that's actually pretty normal. Nobody should use > order-4 (that's 16 contiguous pages, fragmentation can easily make > that hard - *much* harder than the small order-2 or order-2 cases that > we should largely be able to rely on). > > In fact, people who do multi-order allocations should always have a > fallback, and use __GFP_NOWARN. > > > > > [93348.306406] Call Trace: > > [93348.306490] [] __alloc_pages_slowpath+0x1af/0xa10 > > [93348.306501] [] __alloc_pages_nodemask+0x250/0x290 > > [93348.306511] [] cache_grow_begin+0x8d/0x540 > > [93348.306520] [] fallback_alloc+0x161/0x200 > > [93348.306530] [] __kmalloc+0x1d2/0x570 > > [93348.306589] [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] > > Hmm. That's kmalloc itself falling back after already failing to grow > the slab cache earlier (the earlier allocations *were* done with > NOWARN afaik). > > It does look like nfsdstarts out by allocating the hash table with one > single fairly big allocation, and has no fallback position. > > I suspect the code expects to be started at boot time, when this just > isn't an issue. The fact that you loaded the nfsd kernel module with > memory already fragmented after heavy use is likely why nobody else > has seen this. > > Adding the nfsd people to the cc, because just from a robustness > standpoint I suspect it would be better if the code did something like > > (a) shrink the hash table if the allocation fails (we've got some > examples of that elsewhere) > > or > > (b) fall back on a vmalloc allocation (that's certainly the simpler model) > > We do have a "kvfree()" helper function for the "free either a kmalloc > or vmalloc allocation" but we don't actually have a good helper > pattern for the allocation side. People just do it by hand, at least > partly because we have so many different ways to allocate things - > zeroing, non-zeroing, node-specific or not, atomic or not (atomic > cannot fall back to vmalloc, obviously) etc etc. > > Bruce, Jeff, comments? > > Linus Yeah, that makes total sense. Hmm...we _do_ already auto-size the hash at init time already, so shrinking it downward and retrying if the allocation fails wouldn't be hard to do. Maybe I can just cut it in half and throw a pr_warn to tell the admin in that case. In any case...I'll take a look at how we can improve it. Thanks for the heads-up! -- Jeff Layton
Re: OOM detection regressions since 4.7
On Mon, Aug 29, 2016 at 7:52 AM, Olaf Heringwrote: > > Today I noticed the nfsserver was disabled, probably since months already. > Starting it gives a OOM, not sure if this is new with 4.7+. That's not an oom, that's just an allocation failure. And with order-4, that's actually pretty normal. Nobody should use order-4 (that's 16 contiguous pages, fragmentation can easily make that hard - *much* harder than the small order-2 or order-2 cases that we should largely be able to rely on). In fact, people who do multi-order allocations should always have a fallback, and use __GFP_NOWARN. > [93348.306406] Call Trace: > [93348.306490] [] __alloc_pages_slowpath+0x1af/0xa10 > [93348.306501] [] __alloc_pages_nodemask+0x250/0x290 > [93348.306511] [] cache_grow_begin+0x8d/0x540 > [93348.306520] [] fallback_alloc+0x161/0x200 > [93348.306530] [] __kmalloc+0x1d2/0x570 > [93348.306589] [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] Hmm. That's kmalloc itself falling back after already failing to grow the slab cache earlier (the earlier allocations *were* done with NOWARN afaik). It does look like nfsdstarts out by allocating the hash table with one single fairly big allocation, and has no fallback position. I suspect the code expects to be started at boot time, when this just isn't an issue. The fact that you loaded the nfsd kernel module with memory already fragmented after heavy use is likely why nobody else has seen this. Adding the nfsd people to the cc, because just from a robustness standpoint I suspect it would be better if the code did something like (a) shrink the hash table if the allocation fails (we've got some examples of that elsewhere) or (b) fall back on a vmalloc allocation (that's certainly the simpler model) We do have a "kvfree()" helper function for the "free either a kmalloc or vmalloc allocation" but we don't actually have a good helper pattern for the allocation side. People just do it by hand, at least partly because we have so many different ways to allocate things - zeroing, non-zeroing, node-specific or not, atomic or not (atomic cannot fall back to vmalloc, obviously) etc etc. Bruce, Jeff, comments? Linus
Re: OOM detection regressions since 4.7
On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering wrote: > > Today I noticed the nfsserver was disabled, probably since months already. > Starting it gives a OOM, not sure if this is new with 4.7+. That's not an oom, that's just an allocation failure. And with order-4, that's actually pretty normal. Nobody should use order-4 (that's 16 contiguous pages, fragmentation can easily make that hard - *much* harder than the small order-2 or order-2 cases that we should largely be able to rely on). In fact, people who do multi-order allocations should always have a fallback, and use __GFP_NOWARN. > [93348.306406] Call Trace: > [93348.306490] [] __alloc_pages_slowpath+0x1af/0xa10 > [93348.306501] [] __alloc_pages_nodemask+0x250/0x290 > [93348.306511] [] cache_grow_begin+0x8d/0x540 > [93348.306520] [] fallback_alloc+0x161/0x200 > [93348.306530] [] __kmalloc+0x1d2/0x570 > [93348.306589] [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] Hmm. That's kmalloc itself falling back after already failing to grow the slab cache earlier (the earlier allocations *were* done with NOWARN afaik). It does look like nfsdstarts out by allocating the hash table with one single fairly big allocation, and has no fallback position. I suspect the code expects to be started at boot time, when this just isn't an issue. The fact that you loaded the nfsd kernel module with memory already fragmented after heavy use is likely why nobody else has seen this. Adding the nfsd people to the cc, because just from a robustness standpoint I suspect it would be better if the code did something like (a) shrink the hash table if the allocation fails (we've got some examples of that elsewhere) or (b) fall back on a vmalloc allocation (that's certainly the simpler model) We do have a "kvfree()" helper function for the "free either a kmalloc or vmalloc allocation" but we don't actually have a good helper pattern for the allocation side. People just do it by hand, at least partly because we have so many different ways to allocate things - zeroing, non-zeroing, node-specific or not, atomic or not (atomic cannot fall back to vmalloc, obviously) etc etc. Bruce, Jeff, comments? Linus
Re: OOM detection regressions since 4.7
On Mon, Aug 29, Michal Hocko wrote: > On Mon 29-08-16 16:52:03, Olaf Hering wrote: > > I ran rc3 for a few hours on Friday amd FireFox was not killed. > > Now rc3 is running for a day with the usual workload and FireFox is > > still running. > Is the patch > (http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied? Yes. Tested-by: Olaf HeringOlaf signature.asc Description: PGP signature
Re: OOM detection regressions since 4.7
On Mon, Aug 29, Michal Hocko wrote: > On Mon 29-08-16 16:52:03, Olaf Hering wrote: > > I ran rc3 for a few hours on Friday amd FireFox was not killed. > > Now rc3 is running for a day with the usual workload and FireFox is > > still running. > Is the patch > (http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied? Yes. Tested-by: Olaf Hering Olaf signature.asc Description: PGP signature
Re: OOM detection regressions since 4.7
On Mon 29-08-16 16:52:03, Olaf Hering wrote: > On Thu, Aug 25, Olaf Hering wrote: > > > On Thu, Aug 25, Michal Hocko wrote: > > > > > Any luck with the testing of this patch? > > I ran rc3 for a few hours on Friday amd FireFox was not killed. > Now rc3 is running for a day with the usual workload and FireFox is > still running. Is the patch (http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied? > Today I noticed the nfsserver was disabled, probably since months already. > Starting it gives a OOM, not sure if this is new with 4.7+. > Full dmesg attached. > [93348.306369] modprobe: page allocation failure: order:4, > mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK) ok so order-4 (COSTLY allocation) has failed because [...] > [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) > 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = > 15908kB > [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) > 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = > 133372kB > [93348.313822] Node 0 Normal: 14003*4kB (UME) 25*8kB (UME) 2*16kB (UM) 0*32kB > 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56244kB the memory is too fragmented for such a large allocation. Failing order-4 requests is not so severe because we do not invoke the oom killer if they fail. Especially without GFP_REPEAT we do not even try too hard. Recent oom detection changes shouldn't change this behavior. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon 29-08-16 16:52:03, Olaf Hering wrote: > On Thu, Aug 25, Olaf Hering wrote: > > > On Thu, Aug 25, Michal Hocko wrote: > > > > > Any luck with the testing of this patch? > > I ran rc3 for a few hours on Friday amd FireFox was not killed. > Now rc3 is running for a day with the usual workload and FireFox is > still running. Is the patch (http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied? > Today I noticed the nfsserver was disabled, probably since months already. > Starting it gives a OOM, not sure if this is new with 4.7+. > Full dmesg attached. > [93348.306369] modprobe: page allocation failure: order:4, > mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK) ok so order-4 (COSTLY allocation) has failed because [...] > [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) > 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = > 15908kB > [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) > 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = > 133372kB > [93348.313822] Node 0 Normal: 14003*4kB (UME) 25*8kB (UME) 2*16kB (UM) 0*32kB > 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56244kB the memory is too fragmented for such a large allocation. Failing order-4 requests is not so severe because we do not invoke the oom killer if they fail. Especially without GFP_REPEAT we do not even try too hard. Recent oom detection changes shouldn't change this behavior. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon, Aug 29, Olaf Hering wrote: > Full dmesg attached. Now.. dmesg-4.8.0-rc3-3.bug994066-default.txt.gz Description: GNU Zip compressed data signature.asc Description: PGP signature
Re: OOM detection regressions since 4.7
On Mon, Aug 29, Olaf Hering wrote: > Full dmesg attached. Now.. dmesg-4.8.0-rc3-3.bug994066-default.txt.gz Description: GNU Zip compressed data signature.asc Description: PGP signature
Re: OOM detection regressions since 4.7
On Thu, Aug 25, Olaf Hering wrote: > On Thu, Aug 25, Michal Hocko wrote: > > > Any luck with the testing of this patch? I ran rc3 for a few hours on Friday amd FireFox was not killed. Now rc3 is running for a day with the usual workload and FireFox is still running. Today I noticed the nfsserver was disabled, probably since months already. Starting it gives a OOM, not sure if this is new with 4.7+. Full dmesg attached. [0.00] Linux version 4.8.0-rc3-3.bug994066-default (geeko@buildhost) (gcc version 6.1.1 20160815 [gcc-6-branch revision 239479] (SUSE Linux) ) #1 SMP PREEMPT Mon Aug 22 14:52:18 UTC 2016 (c0d2ef5) [64378.582489] tun: Universal TUN/TAP device driver, 1.6 [64378.582493] tun: (C) 1999-2004 Max Krasnyansky[93347.645123] RPC: Registered named UNIX socket transport module. [93347.645128] RPC: Registered udp transport module. [93347.645130] RPC: Registered tcp transport module. [93347.645132] RPC: Registered tcp NFSv4.1 backchannel transport module. [93348.227828] Installing knfsd (copyright (C) 1996 o...@monad.swb.de). [93348.306369] modprobe: page allocation failure: order:4, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK) [93348.306379] CPU: 2 PID: 30467 Comm: modprobe Not tainted 4.8.0-rc3-3.bug994066-default #1 [93348.306382] Hardware name: Hewlett-Packard HP ProBook 6555b/1455, BIOS 68DTM Ver. F.21 06/14/2012 [93348.306386] 813a2952 0004 88003fb6ba30 [93348.306394] 81198a4b 026040cf 026040c1 88003fb6c000 [93348.306400] 0004 88003fb6baac 026040c0 0040 [93348.306406] Call Trace: [93348.306437] [] dump_trace+0x5e/0x310 [93348.306449] [] show_stack_log_lvl+0x11b/0x1a0 [93348.306459] [] show_stack+0x21/0x40 [93348.306468] [] dump_stack+0x5c/0x7a [93348.306478] [] warn_alloc_failed+0xdb/0x150 [93348.306490] [] __alloc_pages_slowpath+0x1af/0xa10 [93348.306501] [] __alloc_pages_nodemask+0x250/0x290 [93348.306511] [] cache_grow_begin+0x8d/0x540 [93348.306520] [] fallback_alloc+0x161/0x200 [93348.306530] [] __kmalloc+0x1d2/0x570 [93348.306589] [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] [93348.306649] [] init_nfsd+0x56/0xea0 [nfsd] [93348.306664] [] do_one_initcall+0x4b/0x180 [93348.306674] [] do_init_module+0x5b/0x1fe [93348.306684] [] load_module+0x1a75/0x1d00 [93348.306695] [] SYSC_finit_module+0xa4/0xe0 [93348.306705] [] entry_SYSCALL_64_fastpath+0x1e/0xa8 [93348.313626] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xa8 [93348.313629] Leftover inexact backtrace: [93348.313691] Mem-Info: [93348.313704] active_anon:467209 inactive_anon:125491 isolated_anon:0 active_file:264880 inactive_file:166389 isolated_file:0 unevictable:8 dirty:250 writeback:0 unstable:0 slab_reclaimable:796425 slab_unreclaimable:34803 mapped:54783 shmem:24119 pagetables:9083 bounce:0 free:51321 free_pcp:68 free_cma:0 [93348.313717] Node 0 active_anon:1868836kB inactive_anon:501964kB active_file:1059520kB inactive_file:665556kB unevictable:32kB isolated(anon):0kB isolated(file):0kB mapped:219132kB dirty:1000kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 749568kB anon_thp: 96476kB writeback_tmp:0kB unstable:0kB pages_scanned:24 all_unreclaimable? no [93348.313719] Node 0 DMA free:15908kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [93348.313729] lowmem_reserve[]: 0 2626 7621 7621 7621 [93348.313745] Node 0 DMA32 free:133192kB min:23244kB low:29052kB high:34860kB active_anon:642152kB inactive_anon:119848kB active_file:257900kB inactive_file:116560kB unevictable:0kB writepending:292kB present:2847412kB managed:2766832kB mlocked:0kB slab_reclaimable:1418576kB slab_unreclaimable:39004kB kernel_stack:256kB pagetables:1448kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB [93348.313755] lowmem_reserve[]: 0 0 4994 4994 4994 [93348.313762] Node 0 Normal free:56184kB min:44200kB low:55248kB high:66296kB active_anon:1226576kB inactive_anon:382200kB active_file:801508kB inactive_file:548992kB unevictable:32kB writepending:536kB present:5242880kB managed:5114880kB mlocked:32kB slab_reclaimable:1767124kB slab_unreclaimable:100208kB kernel_stack:9104kB pagetables:34884kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB [93348.313771] lowmem_reserve[]: 0 0 0 0 0 [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133372kB [93348.313822] Node 0
Re: OOM detection regressions since 4.7
On Thu, Aug 25, Olaf Hering wrote: > On Thu, Aug 25, Michal Hocko wrote: > > > Any luck with the testing of this patch? I ran rc3 for a few hours on Friday amd FireFox was not killed. Now rc3 is running for a day with the usual workload and FireFox is still running. Today I noticed the nfsserver was disabled, probably since months already. Starting it gives a OOM, not sure if this is new with 4.7+. Full dmesg attached. [0.00] Linux version 4.8.0-rc3-3.bug994066-default (geeko@buildhost) (gcc version 6.1.1 20160815 [gcc-6-branch revision 239479] (SUSE Linux) ) #1 SMP PREEMPT Mon Aug 22 14:52:18 UTC 2016 (c0d2ef5) [64378.582489] tun: Universal TUN/TAP device driver, 1.6 [64378.582493] tun: (C) 1999-2004 Max Krasnyansky [93347.645123] RPC: Registered named UNIX socket transport module. [93347.645128] RPC: Registered udp transport module. [93347.645130] RPC: Registered tcp transport module. [93347.645132] RPC: Registered tcp NFSv4.1 backchannel transport module. [93348.227828] Installing knfsd (copyright (C) 1996 o...@monad.swb.de). [93348.306369] modprobe: page allocation failure: order:4, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK) [93348.306379] CPU: 2 PID: 30467 Comm: modprobe Not tainted 4.8.0-rc3-3.bug994066-default #1 [93348.306382] Hardware name: Hewlett-Packard HP ProBook 6555b/1455, BIOS 68DTM Ver. F.21 06/14/2012 [93348.306386] 813a2952 0004 88003fb6ba30 [93348.306394] 81198a4b 026040cf 026040c1 88003fb6c000 [93348.306400] 0004 88003fb6baac 026040c0 0040 [93348.306406] Call Trace: [93348.306437] [] dump_trace+0x5e/0x310 [93348.306449] [] show_stack_log_lvl+0x11b/0x1a0 [93348.306459] [] show_stack+0x21/0x40 [93348.306468] [] dump_stack+0x5c/0x7a [93348.306478] [] warn_alloc_failed+0xdb/0x150 [93348.306490] [] __alloc_pages_slowpath+0x1af/0xa10 [93348.306501] [] __alloc_pages_nodemask+0x250/0x290 [93348.306511] [] cache_grow_begin+0x8d/0x540 [93348.306520] [] fallback_alloc+0x161/0x200 [93348.306530] [] __kmalloc+0x1d2/0x570 [93348.306589] [] nfsd_reply_cache_init+0xaa/0x110 [nfsd] [93348.306649] [] init_nfsd+0x56/0xea0 [nfsd] [93348.306664] [] do_one_initcall+0x4b/0x180 [93348.306674] [] do_init_module+0x5b/0x1fe [93348.306684] [] load_module+0x1a75/0x1d00 [93348.306695] [] SYSC_finit_module+0xa4/0xe0 [93348.306705] [] entry_SYSCALL_64_fastpath+0x1e/0xa8 [93348.313626] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xa8 [93348.313629] Leftover inexact backtrace: [93348.313691] Mem-Info: [93348.313704] active_anon:467209 inactive_anon:125491 isolated_anon:0 active_file:264880 inactive_file:166389 isolated_file:0 unevictable:8 dirty:250 writeback:0 unstable:0 slab_reclaimable:796425 slab_unreclaimable:34803 mapped:54783 shmem:24119 pagetables:9083 bounce:0 free:51321 free_pcp:68 free_cma:0 [93348.313717] Node 0 active_anon:1868836kB inactive_anon:501964kB active_file:1059520kB inactive_file:665556kB unevictable:32kB isolated(anon):0kB isolated(file):0kB mapped:219132kB dirty:1000kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 749568kB anon_thp: 96476kB writeback_tmp:0kB unstable:0kB pages_scanned:24 all_unreclaimable? no [93348.313719] Node 0 DMA free:15908kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [93348.313729] lowmem_reserve[]: 0 2626 7621 7621 7621 [93348.313745] Node 0 DMA32 free:133192kB min:23244kB low:29052kB high:34860kB active_anon:642152kB inactive_anon:119848kB active_file:257900kB inactive_file:116560kB unevictable:0kB writepending:292kB present:2847412kB managed:2766832kB mlocked:0kB slab_reclaimable:1418576kB slab_unreclaimable:39004kB kernel_stack:256kB pagetables:1448kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB [93348.313755] lowmem_reserve[]: 0 0 4994 4994 4994 [93348.313762] Node 0 Normal free:56184kB min:44200kB low:55248kB high:66296kB active_anon:1226576kB inactive_anon:382200kB active_file:801508kB inactive_file:548992kB unevictable:32kB writepending:536kB present:5242880kB managed:5114880kB mlocked:32kB slab_reclaimable:1767124kB slab_unreclaimable:100208kB kernel_stack:9104kB pagetables:34884kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB [93348.313771] lowmem_reserve[]: 0 0 0 0 0 [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133372kB [93348.313822] Node 0 Normal: 14003*4kB
Re: OOM detection regressions since 4.7
On Thursday 25 of August 2016, Michal Hocko wrote: > On Tue 23-08-16 09:43:39, Michal Hocko wrote: > > On Mon 22-08-16 15:05:17, Andrew Morton wrote: > > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hockowrote: > > > > Of course, if Linus/Andrew doesn't like to take those compaction > > > > improvements this late then I will ask to merge the partial revert to > > > > Linus tree as well and then there is not much to discuss. > > > > > > This sounds like the prudent option. Can we get 4.8 working > > > well-enough, backport that into 4.7.x and worry about the fancier stuff > > > for 4.9? > > > > OK, fair enough. > > > > I would really appreciate if the original reporters could retest with > > this patch on top of the current Linus tree. > > Any luck with the testing of this patch? Here my "rm -rf && cp -al" 10x in parallel test finished without OOM, so Tested-by: Arkadiusz Miśkiewicz -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
Re: OOM detection regressions since 4.7
On Thursday 25 of August 2016, Michal Hocko wrote: > On Tue 23-08-16 09:43:39, Michal Hocko wrote: > > On Mon 22-08-16 15:05:17, Andrew Morton wrote: > > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko wrote: > > > > Of course, if Linus/Andrew doesn't like to take those compaction > > > > improvements this late then I will ask to merge the partial revert to > > > > Linus tree as well and then there is not much to discuss. > > > > > > This sounds like the prudent option. Can we get 4.8 working > > > well-enough, backport that into 4.7.x and worry about the fancier stuff > > > for 4.9? > > > > OK, fair enough. > > > > I would really appreciate if the original reporters could retest with > > this patch on top of the current Linus tree. > > Any luck with the testing of this patch? Here my "rm -rf && cp -al" 10x in parallel test finished without OOM, so Tested-by: Arkadiusz Miśkiewicz -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
Re: OOM detection regressions since 4.7
On 25.08.2016 23:26, Michal Hocko wrote: On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote: [...] This worked for me for about 12 hours of my torture test. Logs are at https://urldefense.proofpoint.com/v2/url?u=https-3A__filebin.net_2rfah407nbhzs69e_OOM-5F4.8.0-2Drc2-5Fp1.tar.bz2=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=xBE9zOUuzzrfyIgW70g1kmSzqiGPNXjBnN_zvF4eStQ=jdGSxmrQNhIx4cjVDsyyAA0K83hANgWXu1aFBDh_1B4= . Thanks! Can we add your Tested-by: Ralf-Peter Rohbeckto the patch? Sure. Ralf-Peter -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.
Re: OOM detection regressions since 4.7
On 25.08.2016 23:26, Michal Hocko wrote: On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote: [...] This worked for me for about 12 hours of my torture test. Logs are at https://urldefense.proofpoint.com/v2/url?u=https-3A__filebin.net_2rfah407nbhzs69e_OOM-5F4.8.0-2Drc2-5Fp1.tar.bz2=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=xBE9zOUuzzrfyIgW70g1kmSzqiGPNXjBnN_zvF4eStQ=jdGSxmrQNhIx4cjVDsyyAA0K83hANgWXu1aFBDh_1B4= . Thanks! Can we add your Tested-by: Ralf-Peter Rohbeck to the patch? Sure. Ralf-Peter -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.
Re: OOM detection regressions since 4.7
On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote: [...] > This worked for me for about 12 hours of my torture test. Logs are at > https://filebin.net/2rfah407nbhzs69e/OOM_4.8.0-rc2_p1.tar.bz2. Thanks! Can we add your Tested-by: Ralf-Peter Rohbeckto the patch? -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote: [...] > This worked for me for about 12 hours of my torture test. Logs are at > https://filebin.net/2rfah407nbhzs69e/OOM_4.8.0-rc2_p1.tar.bz2. Thanks! Can we add your Tested-by: Ralf-Peter Rohbeck to the patch? -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On 23.08.2016 00:43, Michal Hocko wrote: OK, fair enough. I would really appreciate if the original reporters could retest with this patch on top of the current Linus tree. The stable backport posted earlier doesn't apply on the current master cleanly but the change is essentially same. mmotm tree then can revert this patch before Vlastimil series is applied because that code is touching the currently removed code. --- From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Mon, 22 Aug 2016 10:52:06 +0200 Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high order request There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the late rc and stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress and we are not getting OOM for order-0 pages. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_20160810091226.6709-2D1-2Dvbabka-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=sJBmU_ySuE2OhXkEeFyTfUr05xjB-mO4aQ5yl4w8z1M= [2] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_f7a9ea9d-2Dbb88-2Dbfd6-2De340-2D3a933559305a-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=9oXRJsI8kr1rfMU9tAb9q0-8YlBCZO0XCCFRo0ASjlg= Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Signed-off-by: Michal Hocko --- mm/page_alloc.c | 51 ++- 1 file changed, 2 insertions(+), 49 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fbe73a6fe4b..7791a03f8deb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } -static inline bool -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, -enum compact_result compact_result, -enum compact_priority *compact_priority, -int compaction_retries) -{ - int max_retries = MAX_COMPACT_RETRIES; - - if (!order) - return false; - - /* -* compaction considers all the zone as desperately out of memory -* so it doesn't really make much sense to retry except when the -* failure could be caused by insufficient priority -*/ - if (compaction_failed(compact_result)) { - if (*compact_priority > MIN_COMPACT_PRIORITY) { - (*compact_priority)--; - return true; - } - return false; - } - - /* -* make sure the compaction wasn't deferred or didn't bail out early -* due to locks contention before we declare that we should give up. -* But do not retry if the given zonelist is not suitable for -* compaction. -*/ - if (compaction_withdrawn(compact_result)) - return compaction_zonelist_suitable(ac, order, alloc_flags); - - /* -* !costly requests are much more important than __GFP_REPEAT -* costly ones because they are de facto nofail and invoke OOM -* killer to move on while costly can fail and users are ready -* to cope with that. 1/4 retries is rather arbitrary but we -* would need much more detailed feedback from compaction to -* make a better decision. -*/ - if (order > PAGE_ALLOC_COSTLY_ORDER) - max_retries /= 4; - if (compaction_retries <= max_retries) - return true; - - return false; -} #else static inline struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, @@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } +#endif /* CONFIG_COMPACTION */ + static inline bool should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags, enum compact_result compact_result, @@ -3221,7 +3175,6 @@
Re: OOM detection regressions since 4.7
On 23.08.2016 00:43, Michal Hocko wrote: OK, fair enough. I would really appreciate if the original reporters could retest with this patch on top of the current Linus tree. The stable backport posted earlier doesn't apply on the current master cleanly but the change is essentially same. mmotm tree then can revert this patch before Vlastimil series is applied because that code is touching the currently removed code. --- From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 22 Aug 2016 10:52:06 +0200 Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high order request There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the late rc and stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress and we are not getting OOM for order-0 pages. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_20160810091226.6709-2D1-2Dvbabka-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=sJBmU_ySuE2OhXkEeFyTfUr05xjB-mO4aQ5yl4w8z1M= [2] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_f7a9ea9d-2Dbb88-2Dbfd6-2De340-2D3a933559305a-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=9oXRJsI8kr1rfMU9tAb9q0-8YlBCZO0XCCFRo0ASjlg= Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Signed-off-by: Michal Hocko --- mm/page_alloc.c | 51 ++- 1 file changed, 2 insertions(+), 49 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fbe73a6fe4b..7791a03f8deb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } -static inline bool -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, -enum compact_result compact_result, -enum compact_priority *compact_priority, -int compaction_retries) -{ - int max_retries = MAX_COMPACT_RETRIES; - - if (!order) - return false; - - /* -* compaction considers all the zone as desperately out of memory -* so it doesn't really make much sense to retry except when the -* failure could be caused by insufficient priority -*/ - if (compaction_failed(compact_result)) { - if (*compact_priority > MIN_COMPACT_PRIORITY) { - (*compact_priority)--; - return true; - } - return false; - } - - /* -* make sure the compaction wasn't deferred or didn't bail out early -* due to locks contention before we declare that we should give up. -* But do not retry if the given zonelist is not suitable for -* compaction. -*/ - if (compaction_withdrawn(compact_result)) - return compaction_zonelist_suitable(ac, order, alloc_flags); - - /* -* !costly requests are much more important than __GFP_REPEAT -* costly ones because they are de facto nofail and invoke OOM -* killer to move on while costly can fail and users are ready -* to cope with that. 1/4 retries is rather arbitrary but we -* would need much more detailed feedback from compaction to -* make a better decision. -*/ - if (order > PAGE_ALLOC_COSTLY_ORDER) - max_retries /= 4; - if (compaction_retries <= max_retries) - return true; - - return false; -} #else static inline struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, @@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } +#endif /* CONFIG_COMPACTION */ + static inline bool should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags, enum compact_result compact_result, @@ -3221,7 +3175,6 @@ should_compact_retry(struct
Re: OOM detection regressions since 4.7
On Tue 23-08-16 09:43:39, Michal Hocko wrote: > On Mon 22-08-16 15:05:17, Andrew Morton wrote: > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hockowrote: > > > > > Of course, if Linus/Andrew doesn't like to take those compaction > > > improvements this late then I will ask to merge the partial revert to > > > Linus tree as well and then there is not much to discuss. > > > > This sounds like the prudent option. Can we get 4.8 working > > well-enough, backport that into 4.7.x and worry about the fancier stuff > > for 4.9? > > OK, fair enough. > > I would really appreciate if the original reporters could retest with > this patch on top of the current Linus tree. Any luck with the testing of this patch? -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Tue 23-08-16 09:43:39, Michal Hocko wrote: > On Mon 22-08-16 15:05:17, Andrew Morton wrote: > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko wrote: > > > > > Of course, if Linus/Andrew doesn't like to take those compaction > > > improvements this late then I will ask to merge the partial revert to > > > Linus tree as well and then there is not much to discuss. > > > > This sounds like the prudent option. Can we get 4.8 working > > well-enough, backport that into 4.7.x and worry about the fancier stuff > > for 4.9? > > OK, fair enough. > > I would really appreciate if the original reporters could retest with > this patch on top of the current Linus tree. Any luck with the testing of this patch? -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Thu, Aug 25, Michal Hocko wrote: > Any luck with the testing of this patch? Not this week, sorry. Olaf signature.asc Description: PGP signature
Re: OOM detection regressions since 4.7
On Thu, Aug 25, Michal Hocko wrote: > Any luck with the testing of this patch? Not this week, sorry. Olaf signature.asc Description: PGP signature
Re: OOM detection regressions since 4.7
2016-08-24 16:04 GMT+09:00 Michal Hocko: > On Wed 24-08-16 14:01:57, Joonsoo Kim wrote: >> Looks like my mail client eat my reply so I resend. >> >> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote: >> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: >> > [...] >> > > Hello, Michal. >> > > >> > > I agree with partial revert but revert should be a different form. >> > > Below change try to reuse should_compact_retry() version for >> > > !CONFIG_COMPACTION but it turned out that it also causes regression in >> > > Markus report [1]. >> > >> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high >> > order workloads that calling any change in that behavior a regression >> > is little bit exaggerated. Disabling compaction should have a very >> > strong reason. I haven't heard any so far. I am even wondering whether >> > there is a legitimate reason for that these days. >> > >> > > Theoretical reason for this regression is that it would stop retry >> > > even if there are enough lru pages. It only checks if freepage >> > > excesses min watermark or not for retry decision. To prevent >> > > pre-mature OOM killer, we need to keep allocation loop when there are >> > > enough lru pages. So, logic should be something like that. >> > > >> > > should_compact_retry() >> > > { >> > > for_each_zone_zonelist_nodemask { >> > > available = zone_reclaimable_pages(zone); >> > > available += zone_page_state_snapshot(zone, >> > > NR_FREE_PAGES); >> > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), >> > > ac_classzone_idx(ac), alloc_flags, available)) >> > > return true; >> > > >> > > } >> > > } >> > > >> > > I suggested it before and current situation looks like it is indeed >> > > needed. >> > >> > this just opens doors for an unbounded reclaim/threshing becacause >> > you can reclaim as much as you like and there is no guarantee of a >> > forward progress. The reason why !COMPACTION should_compact_retry only >> > checks for the min_wmark without the reclaimable bias is that this will >> > guarantee a retry if we are failing due to high order wmark check rather >> > than a lack of memory. This condition is guaranteed to converge and the >> > probability of the unbounded reclaim is much more reduced. >> >> In case of a lack of memory with a lot of reclaimable lru pages, why >> do we stop reclaim/compaction? >> >> With your partial reverting patch, allocation logic would be like as >> following. >> >> Assume following situation: >> o a lot of reclaimable lru pages >> o no order-2 freepage >> o not enough order-0 freepage for min watermark >> o order-2 allocation >> >> 1. order-2 allocation failed due to min watermark >> 2. go to reclaim/compaction >> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still >> min watermark isn't met for order-0 >> 4. compaction is skipped due to not enough freepage >> 5. should_reclaim_retry() returns false because min watermark for >> order-2 page isn't met >> 6. should_compact_retry() returns false because min watermark for >> order-0 page isn't met >> 6. allocation is failed without any retry and OOM is invoked. > > If the direct reclaim is not able to get us over min wmark for order-0 > then we would be likely to hit the oom even for order-0 requests. No, this situation is that direct reclaim can get us over min wmark for order-0 but it needs retry. IIUC, direct reclaim would not reclaim enough memory at once. It tries to reclaim small amount of lru pages and break out to check watermark. >> Is it what you want? >> >> And, please elaborate more on how your logic guarantee to converge. >> After order-0 freepage exceed min watermark, there is no way to stop >> reclaim/threshing. Number of freepage just increase monotonically and >> retry cannot be stopped until order-2 allocation succeed. Am I missing >> something? > > My statement was imprecise at best. You are right that there is no > guarantee to fullfil order-2 request. What I meant to say is that we > should converge when we are getting out of memory (aka even order-0 > would have hard time to succeed). should_reclaim_retry does that by > the back off scaling of the reclaimable pages. should_compact_retry > would have to do the same thing which would effectively turn it into > should_reclaim_retry. So, I suggested to change should_reclaim_retry() for high order request, before. >> > > And, I still think that your OOM detection rework has some flaws. >> > > >> > > 1) It doesn't consider freeable objects that can be freed by >> > > shrink_slab(). >> > > There are many subsystems that cache many objects and they will be >> > > freed by shrink_slab() interface. But, you don't account them when >> > > making the OOM decision. >> > >> > I fully rely on the reclaim and compaction feedback. And that is the >> > place where we should strive for improvements. So
Re: OOM detection regressions since 4.7
2016-08-24 16:04 GMT+09:00 Michal Hocko : > On Wed 24-08-16 14:01:57, Joonsoo Kim wrote: >> Looks like my mail client eat my reply so I resend. >> >> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote: >> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: >> > [...] >> > > Hello, Michal. >> > > >> > > I agree with partial revert but revert should be a different form. >> > > Below change try to reuse should_compact_retry() version for >> > > !CONFIG_COMPACTION but it turned out that it also causes regression in >> > > Markus report [1]. >> > >> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high >> > order workloads that calling any change in that behavior a regression >> > is little bit exaggerated. Disabling compaction should have a very >> > strong reason. I haven't heard any so far. I am even wondering whether >> > there is a legitimate reason for that these days. >> > >> > > Theoretical reason for this regression is that it would stop retry >> > > even if there are enough lru pages. It only checks if freepage >> > > excesses min watermark or not for retry decision. To prevent >> > > pre-mature OOM killer, we need to keep allocation loop when there are >> > > enough lru pages. So, logic should be something like that. >> > > >> > > should_compact_retry() >> > > { >> > > for_each_zone_zonelist_nodemask { >> > > available = zone_reclaimable_pages(zone); >> > > available += zone_page_state_snapshot(zone, >> > > NR_FREE_PAGES); >> > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), >> > > ac_classzone_idx(ac), alloc_flags, available)) >> > > return true; >> > > >> > > } >> > > } >> > > >> > > I suggested it before and current situation looks like it is indeed >> > > needed. >> > >> > this just opens doors for an unbounded reclaim/threshing becacause >> > you can reclaim as much as you like and there is no guarantee of a >> > forward progress. The reason why !COMPACTION should_compact_retry only >> > checks for the min_wmark without the reclaimable bias is that this will >> > guarantee a retry if we are failing due to high order wmark check rather >> > than a lack of memory. This condition is guaranteed to converge and the >> > probability of the unbounded reclaim is much more reduced. >> >> In case of a lack of memory with a lot of reclaimable lru pages, why >> do we stop reclaim/compaction? >> >> With your partial reverting patch, allocation logic would be like as >> following. >> >> Assume following situation: >> o a lot of reclaimable lru pages >> o no order-2 freepage >> o not enough order-0 freepage for min watermark >> o order-2 allocation >> >> 1. order-2 allocation failed due to min watermark >> 2. go to reclaim/compaction >> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still >> min watermark isn't met for order-0 >> 4. compaction is skipped due to not enough freepage >> 5. should_reclaim_retry() returns false because min watermark for >> order-2 page isn't met >> 6. should_compact_retry() returns false because min watermark for >> order-0 page isn't met >> 6. allocation is failed without any retry and OOM is invoked. > > If the direct reclaim is not able to get us over min wmark for order-0 > then we would be likely to hit the oom even for order-0 requests. No, this situation is that direct reclaim can get us over min wmark for order-0 but it needs retry. IIUC, direct reclaim would not reclaim enough memory at once. It tries to reclaim small amount of lru pages and break out to check watermark. >> Is it what you want? >> >> And, please elaborate more on how your logic guarantee to converge. >> After order-0 freepage exceed min watermark, there is no way to stop >> reclaim/threshing. Number of freepage just increase monotonically and >> retry cannot be stopped until order-2 allocation succeed. Am I missing >> something? > > My statement was imprecise at best. You are right that there is no > guarantee to fullfil order-2 request. What I meant to say is that we > should converge when we are getting out of memory (aka even order-0 > would have hard time to succeed). should_reclaim_retry does that by > the back off scaling of the reclaimable pages. should_compact_retry > would have to do the same thing which would effectively turn it into > should_reclaim_retry. So, I suggested to change should_reclaim_retry() for high order request, before. >> > > And, I still think that your OOM detection rework has some flaws. >> > > >> > > 1) It doesn't consider freeable objects that can be freed by >> > > shrink_slab(). >> > > There are many subsystems that cache many objects and they will be >> > > freed by shrink_slab() interface. But, you don't account them when >> > > making the OOM decision. >> > >> > I fully rely on the reclaim and compaction feedback. And that is the >> > place where we should strive for improvements. So if we are growing
Re: OOM detection regressions since 4.7
Looks like my mail client eat my reply so I resend. On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote: > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > [...] > > Hello, Michal. > > > > I agree with partial revert but revert should be a different form. > > Below change try to reuse should_compact_retry() version for > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > Markus report [1]. > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > order workloads that calling any change in that behavior a regression > is little bit exaggerated. Disabling compaction should have a very > strong reason. I haven't heard any so far. I am even wondering whether > there is a legitimate reason for that these days. > > > Theoretical reason for this regression is that it would stop retry > > even if there are enough lru pages. It only checks if freepage > > excesses min watermark or not for retry decision. To prevent > > pre-mature OOM killer, we need to keep allocation loop when there are > > enough lru pages. So, logic should be something like that. > > > > should_compact_retry() > > { > > for_each_zone_zonelist_nodemask { > > available = zone_reclaimable_pages(zone); > > available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), > > ac_classzone_idx(ac), alloc_flags, available)) > > return true; > > > > } > > } > > > > I suggested it before and current situation looks like it is indeed > > needed. > > this just opens doors for an unbounded reclaim/threshing becacause > you can reclaim as much as you like and there is no guarantee of a > forward progress. The reason why !COMPACTION should_compact_retry only > checks for the min_wmark without the reclaimable bias is that this will > guarantee a retry if we are failing due to high order wmark check rather > than a lack of memory. This condition is guaranteed to converge and the > probability of the unbounded reclaim is much more reduced. In case of a lack of memory with a lot of reclaimable lru pages, why do we stop reclaim/compaction? With your partial reverting patch, allocation logic would be like as following. Assume following situation: o a lot of reclaimable lru pages o no order-2 freepage o not enough order-0 freepage for min watermark o order-2 allocation 1. order-2 allocation failed due to min watermark 2. go to reclaim/compaction 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still min watermark isn't met for order-0 4. compaction is skipped due to not enough freepage 5. should_reclaim_retry() returns false because min watermark for order-2 page isn't met 6. should_compact_retry() returns false because min watermark for order-0 page isn't met 6. allocation is failed without any retry and OOM is invoked. Is it what you want? And, please elaborate more on how your logic guarantee to converge. After order-0 freepage exceed min watermark, there is no way to stop reclaim/threshing. Number of freepage just increase monotonically and retry cannot be stopped until order-2 allocation succeed. Am I missing something? > > And, I still think that your OOM detection rework has some flaws. > > > > 1) It doesn't consider freeable objects that can be freed by shrink_slab(). > > There are many subsystems that cache many objects and they will be > > freed by shrink_slab() interface. But, you don't account them when > > making the OOM decision. > > I fully rely on the reclaim and compaction feedback. And that is the > place where we should strive for improvements. So if we are growing way > too many slab objects we should take care about that in the slab reclaim > which is tightly coupled with the LRU reclaim rather than up the layer > in the page allocator. No. slab shrink logic which is tightly coupled with the LRU reclaim totally makes sense. What doesn't makes sense is the way of using these functionality and utilizing these freebacks on your OOM detection rework. For example, compaction will do it's best with current resource. But, as I said before, compaction will be more powerful if the system has more free memory. Your logic just guarantee to give it to minimum amount of free memory to run so I don't think it's result is reliable to determine if we are in OOM or not. And, your logic doesn't consider how many pages can be freed by slab shrink. As I said before, there would exist high order reclaimable page or we can make high order freepage by actual free. Most importantly, I think that it is fundamentally impossible to anticipate if we can make high order freepage or not by snapshot of information about number of freeable page. So, your logic rely on compaction but there are many types of pages that cannot be migrated by compaction but can be reclaimed. So, fully relying on compaction result for OOM decision
Re: OOM detection regressions since 4.7
On Wed 24-08-16 14:01:57, Joonsoo Kim wrote: > Looks like my mail client eat my reply so I resend. > > On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote: > > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > > [...] > > > Hello, Michal. > > > > > > I agree with partial revert but revert should be a different form. > > > Below change try to reuse should_compact_retry() version for > > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > > Markus report [1]. > > > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > > order workloads that calling any change in that behavior a regression > > is little bit exaggerated. Disabling compaction should have a very > > strong reason. I haven't heard any so far. I am even wondering whether > > there is a legitimate reason for that these days. > > > > > Theoretical reason for this regression is that it would stop retry > > > even if there are enough lru pages. It only checks if freepage > > > excesses min watermark or not for retry decision. To prevent > > > pre-mature OOM killer, we need to keep allocation loop when there are > > > enough lru pages. So, logic should be something like that. > > > > > > should_compact_retry() > > > { > > > for_each_zone_zonelist_nodemask { > > > available = zone_reclaimable_pages(zone); > > > available += zone_page_state_snapshot(zone, > > > NR_FREE_PAGES); > > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), > > > ac_classzone_idx(ac), alloc_flags, available)) > > > return true; > > > > > > } > > > } > > > > > > I suggested it before and current situation looks like it is indeed > > > needed. > > > > this just opens doors for an unbounded reclaim/threshing becacause > > you can reclaim as much as you like and there is no guarantee of a > > forward progress. The reason why !COMPACTION should_compact_retry only > > checks for the min_wmark without the reclaimable bias is that this will > > guarantee a retry if we are failing due to high order wmark check rather > > than a lack of memory. This condition is guaranteed to converge and the > > probability of the unbounded reclaim is much more reduced. > > In case of a lack of memory with a lot of reclaimable lru pages, why > do we stop reclaim/compaction? > > With your partial reverting patch, allocation logic would be like as > following. > > Assume following situation: > o a lot of reclaimable lru pages > o no order-2 freepage > o not enough order-0 freepage for min watermark > o order-2 allocation > > 1. order-2 allocation failed due to min watermark > 2. go to reclaim/compaction > 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still > min watermark isn't met for order-0 > 4. compaction is skipped due to not enough freepage > 5. should_reclaim_retry() returns false because min watermark for > order-2 page isn't met > 6. should_compact_retry() returns false because min watermark for > order-0 page isn't met > 6. allocation is failed without any retry and OOM is invoked. If the direct reclaim is not able to get us over min wmark for order-0 then we would be likely to hit the oom even for order-0 requests. > Is it what you want? > > And, please elaborate more on how your logic guarantee to converge. > After order-0 freepage exceed min watermark, there is no way to stop > reclaim/threshing. Number of freepage just increase monotonically and > retry cannot be stopped until order-2 allocation succeed. Am I missing > something? My statement was imprecise at best. You are right that there is no guarantee to fullfil order-2 request. What I meant to say is that we should converge when we are getting out of memory (aka even order-0 would have hard time to succeed). should_reclaim_retry does that by the back off scaling of the reclaimable pages. should_compact_retry would have to do the same thing which would effectively turn it into should_reclaim_retry. > > > And, I still think that your OOM detection rework has some flaws. > > > > > > 1) It doesn't consider freeable objects that can be freed by > > > shrink_slab(). > > > There are many subsystems that cache many objects and they will be > > > freed by shrink_slab() interface. But, you don't account them when > > > making the OOM decision. > > > > I fully rely on the reclaim and compaction feedback. And that is the > > place where we should strive for improvements. So if we are growing way > > too many slab objects we should take care about that in the slab reclaim > > which is tightly coupled with the LRU reclaim rather than up the layer > > in the page allocator. > > No. slab shrink logic which is tightly coupled with the LRU reclaim > totally makes sense. Once the number of slab object is much larger than LRU pages (what we have seen in some oom reports) then the way how they are coupled just stops making a sense because the current
Re: OOM detection regressions since 4.7
Looks like my mail client eat my reply so I resend. On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote: > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > [...] > > Hello, Michal. > > > > I agree with partial revert but revert should be a different form. > > Below change try to reuse should_compact_retry() version for > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > Markus report [1]. > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > order workloads that calling any change in that behavior a regression > is little bit exaggerated. Disabling compaction should have a very > strong reason. I haven't heard any so far. I am even wondering whether > there is a legitimate reason for that these days. > > > Theoretical reason for this regression is that it would stop retry > > even if there are enough lru pages. It only checks if freepage > > excesses min watermark or not for retry decision. To prevent > > pre-mature OOM killer, we need to keep allocation loop when there are > > enough lru pages. So, logic should be something like that. > > > > should_compact_retry() > > { > > for_each_zone_zonelist_nodemask { > > available = zone_reclaimable_pages(zone); > > available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), > > ac_classzone_idx(ac), alloc_flags, available)) > > return true; > > > > } > > } > > > > I suggested it before and current situation looks like it is indeed > > needed. > > this just opens doors for an unbounded reclaim/threshing becacause > you can reclaim as much as you like and there is no guarantee of a > forward progress. The reason why !COMPACTION should_compact_retry only > checks for the min_wmark without the reclaimable bias is that this will > guarantee a retry if we are failing due to high order wmark check rather > than a lack of memory. This condition is guaranteed to converge and the > probability of the unbounded reclaim is much more reduced. In case of a lack of memory with a lot of reclaimable lru pages, why do we stop reclaim/compaction? With your partial reverting patch, allocation logic would be like as following. Assume following situation: o a lot of reclaimable lru pages o no order-2 freepage o not enough order-0 freepage for min watermark o order-2 allocation 1. order-2 allocation failed due to min watermark 2. go to reclaim/compaction 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still min watermark isn't met for order-0 4. compaction is skipped due to not enough freepage 5. should_reclaim_retry() returns false because min watermark for order-2 page isn't met 6. should_compact_retry() returns false because min watermark for order-0 page isn't met 6. allocation is failed without any retry and OOM is invoked. Is it what you want? And, please elaborate more on how your logic guarantee to converge. After order-0 freepage exceed min watermark, there is no way to stop reclaim/threshing. Number of freepage just increase monotonically and retry cannot be stopped until order-2 allocation succeed. Am I missing something? > > And, I still think that your OOM detection rework has some flaws. > > > > 1) It doesn't consider freeable objects that can be freed by shrink_slab(). > > There are many subsystems that cache many objects and they will be > > freed by shrink_slab() interface. But, you don't account them when > > making the OOM decision. > > I fully rely on the reclaim and compaction feedback. And that is the > place where we should strive for improvements. So if we are growing way > too many slab objects we should take care about that in the slab reclaim > which is tightly coupled with the LRU reclaim rather than up the layer > in the page allocator. No. slab shrink logic which is tightly coupled with the LRU reclaim totally makes sense. What doesn't makes sense is the way of using these functionality and utilizing these freebacks on your OOM detection rework. For example, compaction will do it's best with current resource. But, as I said before, compaction will be more powerful if the system has more free memory. Your logic just guarantee to give it to minimum amount of free memory to run so I don't think it's result is reliable to determine if we are in OOM or not. And, your logic doesn't consider how many pages can be freed by slab shrink. As I said before, there would exist high order reclaimable page or we can make high order freepage by actual free. Most importantly, I think that it is fundamentally impossible to anticipate if we can make high order freepage or not by snapshot of information about number of freeable page. So, your logic rely on compaction but there are many types of pages that cannot be migrated by compaction but can be reclaimed. So, fully relying on compaction result for OOM decision
Re: OOM detection regressions since 4.7
On Wed 24-08-16 14:01:57, Joonsoo Kim wrote: > Looks like my mail client eat my reply so I resend. > > On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote: > > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > > [...] > > > Hello, Michal. > > > > > > I agree with partial revert but revert should be a different form. > > > Below change try to reuse should_compact_retry() version for > > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > > Markus report [1]. > > > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > > order workloads that calling any change in that behavior a regression > > is little bit exaggerated. Disabling compaction should have a very > > strong reason. I haven't heard any so far. I am even wondering whether > > there is a legitimate reason for that these days. > > > > > Theoretical reason for this regression is that it would stop retry > > > even if there are enough lru pages. It only checks if freepage > > > excesses min watermark or not for retry decision. To prevent > > > pre-mature OOM killer, we need to keep allocation loop when there are > > > enough lru pages. So, logic should be something like that. > > > > > > should_compact_retry() > > > { > > > for_each_zone_zonelist_nodemask { > > > available = zone_reclaimable_pages(zone); > > > available += zone_page_state_snapshot(zone, > > > NR_FREE_PAGES); > > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), > > > ac_classzone_idx(ac), alloc_flags, available)) > > > return true; > > > > > > } > > > } > > > > > > I suggested it before and current situation looks like it is indeed > > > needed. > > > > this just opens doors for an unbounded reclaim/threshing becacause > > you can reclaim as much as you like and there is no guarantee of a > > forward progress. The reason why !COMPACTION should_compact_retry only > > checks for the min_wmark without the reclaimable bias is that this will > > guarantee a retry if we are failing due to high order wmark check rather > > than a lack of memory. This condition is guaranteed to converge and the > > probability of the unbounded reclaim is much more reduced. > > In case of a lack of memory with a lot of reclaimable lru pages, why > do we stop reclaim/compaction? > > With your partial reverting patch, allocation logic would be like as > following. > > Assume following situation: > o a lot of reclaimable lru pages > o no order-2 freepage > o not enough order-0 freepage for min watermark > o order-2 allocation > > 1. order-2 allocation failed due to min watermark > 2. go to reclaim/compaction > 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still > min watermark isn't met for order-0 > 4. compaction is skipped due to not enough freepage > 5. should_reclaim_retry() returns false because min watermark for > order-2 page isn't met > 6. should_compact_retry() returns false because min watermark for > order-0 page isn't met > 6. allocation is failed without any retry and OOM is invoked. If the direct reclaim is not able to get us over min wmark for order-0 then we would be likely to hit the oom even for order-0 requests. > Is it what you want? > > And, please elaborate more on how your logic guarantee to converge. > After order-0 freepage exceed min watermark, there is no way to stop > reclaim/threshing. Number of freepage just increase monotonically and > retry cannot be stopped until order-2 allocation succeed. Am I missing > something? My statement was imprecise at best. You are right that there is no guarantee to fullfil order-2 request. What I meant to say is that we should converge when we are getting out of memory (aka even order-0 would have hard time to succeed). should_reclaim_retry does that by the back off scaling of the reclaimable pages. should_compact_retry would have to do the same thing which would effectively turn it into should_reclaim_retry. > > > And, I still think that your OOM detection rework has some flaws. > > > > > > 1) It doesn't consider freeable objects that can be freed by > > > shrink_slab(). > > > There are many subsystems that cache many objects and they will be > > > freed by shrink_slab() interface. But, you don't account them when > > > making the OOM decision. > > > > I fully rely on the reclaim and compaction feedback. And that is the > > place where we should strive for improvements. So if we are growing way > > too many slab objects we should take care about that in the slab reclaim > > which is tightly coupled with the LRU reclaim rather than up the layer > > in the page allocator. > > No. slab shrink logic which is tightly coupled with the LRU reclaim > totally makes sense. Once the number of slab object is much larger than LRU pages (what we have seen in some oom reports) then the way how they are coupled just stops making a sense because the current
Re: OOM detection regressions since 4.7
On Tue 23-08-16 15:08:05, Linus Torvalds wrote: > On Tue, Aug 23, 2016 at 3:33 AM, Michal Hockowrote: > > > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > > order workloads that calling any change in that behavior a regression > > is little bit exaggerated. > > Well, the thread info allocations certainly haven't been big problems > before. So regressing those would seem to be a real regression. > > What happened? We've done the order-2 allocation for the stack since > May 2014, so that isn't new. Did we cut off retries for low orders? Yes, with the original implementation the number of reclaim retries is basically unbounded and as long as we have a reclaim progress. This has changed to be a bounded process. Without the compaction this means that we were reclaim as long as an order-2 page was formed. > So I would not say that it's an exaggeration to say that order-2 > allocations failing is a regression. I would agree with you with COMPACTION enabled but with compaction disabled which should be really limited to !MMU configurations I think there is not much we can do. Well, we could simply retry for ever without invoking OOM killer for higher order request for this config option and rely on order-0 to hit the OOM. Do we want that though? I do not remember anybody with !MMU to complain. Markus had COMPACTION disabled accidentally. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Tue 23-08-16 15:08:05, Linus Torvalds wrote: > On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko wrote: > > > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > > order workloads that calling any change in that behavior a regression > > is little bit exaggerated. > > Well, the thread info allocations certainly haven't been big problems > before. So regressing those would seem to be a real regression. > > What happened? We've done the order-2 allocation for the stack since > May 2014, so that isn't new. Did we cut off retries for low orders? Yes, with the original implementation the number of reclaim retries is basically unbounded and as long as we have a reclaim progress. This has changed to be a bounded process. Without the compaction this means that we were reclaim as long as an order-2 page was formed. > So I would not say that it's an exaggeration to say that order-2 > allocations failing is a regression. I would agree with you with COMPACTION enabled but with compaction disabled which should be really limited to !MMU configurations I think there is not much we can do. Well, we could simply retry for ever without invoking OOM killer for higher order request for this config option and rely on order-0 to hit the OOM. Do we want that though? I do not remember anybody with !MMU to complain. Markus had COMPACTION disabled accidentally. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Tue, Aug 23, 2016 at 3:33 AM, Michal Hockowrote: > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > order workloads that calling any change in that behavior a regression > is little bit exaggerated. Well, the thread info allocations certainly haven't been big problems before. So regressing those would seem to be a real regression. What happened? We've done the order-2 allocation for the stack since May 2014, so that isn't new. Did we cut off retries for low orders? So I would not say that it's an exaggeration to say that order-2 allocations failing is a regression. Yes, yes, for 4.9 we may well end up using vmalloc for the kernel stack, but there are certainly other things that want low-order (non-hugepage) allocations. Like kmalloc(), which often ends up using small orders just to pack data more efficiently (allocating a single page can be hugely wasteful even if the individual allocations are smaller than that - so allocating a few pages and packing more allocations into it helps fight internal fragmentation) So this definitely needs to be fixed for 4.7 (and apparently there's a few patches still pending even for 4.8) Linus
Re: OOM detection regressions since 4.7
On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko wrote: > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > order workloads that calling any change in that behavior a regression > is little bit exaggerated. Well, the thread info allocations certainly haven't been big problems before. So regressing those would seem to be a real regression. What happened? We've done the order-2 allocation for the stack since May 2014, so that isn't new. Did we cut off retries for low orders? So I would not say that it's an exaggeration to say that order-2 allocations failing is a regression. Yes, yes, for 4.9 we may well end up using vmalloc for the kernel stack, but there are certainly other things that want low-order (non-hugepage) allocations. Like kmalloc(), which often ends up using small orders just to pack data more efficiently (allocating a single page can be hugely wasteful even if the individual allocations are smaller than that - so allocating a few pages and packing more allocations into it helps fight internal fragmentation) So this definitely needs to be fixed for 4.7 (and apparently there's a few patches still pending even for 4.8) Linus
Re: OOM detection regressions since 4.7
On Mon 22-08-16 15:05:17, Andrew Morton wrote: > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hockowrote: > > > Of course, if Linus/Andrew doesn't like to take those compaction > > improvements this late then I will ask to merge the partial revert to > > Linus tree as well and then there is not much to discuss. > > This sounds like the prudent option. Can we get 4.8 working > well-enough, backport that into 4.7.x and worry about the fancier stuff > for 4.9? OK, fair enough. I would really appreciate if the original reporters could retest with this patch on top of the current Linus tree. The stable backport posted earlier doesn't apply on the current master cleanly but the change is essentially same. mmotm tree then can revert this patch before Vlastimil series is applied because that code is touching the currently removed code. --- >From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 22 Aug 2016 10:52:06 +0200 Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high order request There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the late rc and stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress and we are not getting OOM for order-0 pages. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Signed-off-by: Michal Hocko --- mm/page_alloc.c | 51 ++- 1 file changed, 2 insertions(+), 49 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fbe73a6fe4b..7791a03f8deb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } -static inline bool -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, -enum compact_result compact_result, -enum compact_priority *compact_priority, -int compaction_retries) -{ - int max_retries = MAX_COMPACT_RETRIES; - - if (!order) - return false; - - /* -* compaction considers all the zone as desperately out of memory -* so it doesn't really make much sense to retry except when the -* failure could be caused by insufficient priority -*/ - if (compaction_failed(compact_result)) { - if (*compact_priority > MIN_COMPACT_PRIORITY) { - (*compact_priority)--; - return true; - } - return false; - } - - /* -* make sure the compaction wasn't deferred or didn't bail out early -* due to locks contention before we declare that we should give up. -* But do not retry if the given zonelist is not suitable for -* compaction. -*/ - if (compaction_withdrawn(compact_result)) - return compaction_zonelist_suitable(ac, order, alloc_flags); - - /* -* !costly requests are much more important than __GFP_REPEAT -* costly ones because they are de facto nofail and invoke OOM -* killer to move on while costly can fail and users are ready -* to cope with that. 1/4 retries is rather arbitrary but we -* would need much more detailed feedback from compaction to -* make a better decision. -*/ - if (order > PAGE_ALLOC_COSTLY_ORDER) - max_retries /= 4; - if (compaction_retries <= max_retries) - return true; - - return false; -} #else static inline struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, @@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } +#endif /* CONFIG_COMPACTION */ + static inline bool should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags, enum compact_result compact_result, @@ -3221,7 +3175,6 @@ should_compact_retry(struct alloc_context *ac,
Re: OOM detection regressions since 4.7
On Mon 22-08-16 15:05:17, Andrew Morton wrote: > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko wrote: > > > Of course, if Linus/Andrew doesn't like to take those compaction > > improvements this late then I will ask to merge the partial revert to > > Linus tree as well and then there is not much to discuss. > > This sounds like the prudent option. Can we get 4.8 working > well-enough, backport that into 4.7.x and worry about the fancier stuff > for 4.9? OK, fair enough. I would really appreciate if the original reporters could retest with this patch on top of the current Linus tree. The stable backport posted earlier doesn't apply on the current master cleanly but the change is essentially same. mmotm tree then can revert this patch before Vlastimil series is applied because that code is touching the currently removed code. --- >From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 22 Aug 2016 10:52:06 +0200 Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high order request There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the late rc and stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress and we are not getting OOM for order-0 pages. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Signed-off-by: Michal Hocko --- mm/page_alloc.c | 51 ++- 1 file changed, 2 insertions(+), 49 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fbe73a6fe4b..7791a03f8deb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } -static inline bool -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, -enum compact_result compact_result, -enum compact_priority *compact_priority, -int compaction_retries) -{ - int max_retries = MAX_COMPACT_RETRIES; - - if (!order) - return false; - - /* -* compaction considers all the zone as desperately out of memory -* so it doesn't really make much sense to retry except when the -* failure could be caused by insufficient priority -*/ - if (compaction_failed(compact_result)) { - if (*compact_priority > MIN_COMPACT_PRIORITY) { - (*compact_priority)--; - return true; - } - return false; - } - - /* -* make sure the compaction wasn't deferred or didn't bail out early -* due to locks contention before we declare that we should give up. -* But do not retry if the given zonelist is not suitable for -* compaction. -*/ - if (compaction_withdrawn(compact_result)) - return compaction_zonelist_suitable(ac, order, alloc_flags); - - /* -* !costly requests are much more important than __GFP_REPEAT -* costly ones because they are de facto nofail and invoke OOM -* killer to move on while costly can fail and users are ready -* to cope with that. 1/4 retries is rather arbitrary but we -* would need much more detailed feedback from compaction to -* make a better decision. -*/ - if (order > PAGE_ALLOC_COSTLY_ORDER) - max_retries /= 4; - if (compaction_retries <= max_retries) - return true; - - return false; -} #else static inline struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, @@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } +#endif /* CONFIG_COMPACTION */ + static inline bool should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags, enum compact_result compact_result, @@ -3221,7 +3175,6 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla }
Re: OOM detection regressions since 4.7
On Tue 23-08-16 09:40:14, Markus Trippelsdorf wrote: > On 2016.08.23 at 09:33 +0200, Michal Hocko wrote: > > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > > [...] > > > Hello, Michal. > > > > > > I agree with partial revert but revert should be a different form. > > > Below change try to reuse should_compact_retry() version for > > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > > Markus report [1]. > > > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > > order workloads that calling any change in that behavior a regression > > is little bit exaggerated. Disabling compaction should have a very > > strong reason. I haven't heard any so far. I am even wondering whether > > there is a legitimate reason for that these days. > > BTW, the current config description: > > CONFIG_COMPACTION: > Allows the compaction of memory for the allocation of huge pages. > > doesn't make it clear to the user that this is an essential feature. Yes I plan to send a clarification patch. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Tue 23-08-16 09:40:14, Markus Trippelsdorf wrote: > On 2016.08.23 at 09:33 +0200, Michal Hocko wrote: > > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > > [...] > > > Hello, Michal. > > > > > > I agree with partial revert but revert should be a different form. > > > Below change try to reuse should_compact_retry() version for > > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > > Markus report [1]. > > > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > > order workloads that calling any change in that behavior a regression > > is little bit exaggerated. Disabling compaction should have a very > > strong reason. I haven't heard any so far. I am even wondering whether > > there is a legitimate reason for that these days. > > BTW, the current config description: > > CONFIG_COMPACTION: > Allows the compaction of memory for the allocation of huge pages. > > doesn't make it clear to the user that this is an essential feature. Yes I plan to send a clarification patch. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On 2016.08.23 at 09:33 +0200, Michal Hocko wrote: > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > [...] > > Hello, Michal. > > > > I agree with partial revert but revert should be a different form. > > Below change try to reuse should_compact_retry() version for > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > Markus report [1]. > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > order workloads that calling any change in that behavior a regression > is little bit exaggerated. Disabling compaction should have a very > strong reason. I haven't heard any so far. I am even wondering whether > there is a legitimate reason for that these days. BTW, the current config description: CONFIG_COMPACTION: Allows the compaction of memory for the allocation of huge pages. doesn't make it clear to the user that this is an essential feature. -- Markus
Re: OOM detection regressions since 4.7
On 2016.08.23 at 09:33 +0200, Michal Hocko wrote: > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: > [...] > > Hello, Michal. > > > > I agree with partial revert but revert should be a different form. > > Below change try to reuse should_compact_retry() version for > > !CONFIG_COMPACTION but it turned out that it also causes regression in > > Markus report [1]. > > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high > order workloads that calling any change in that behavior a regression > is little bit exaggerated. Disabling compaction should have a very > strong reason. I haven't heard any so far. I am even wondering whether > there is a legitimate reason for that these days. BTW, the current config description: CONFIG_COMPACTION: Allows the compaction of memory for the allocation of huge pages. doesn't make it clear to the user that this is an essential feature. -- Markus
Re: OOM detection regressions since 4.7
On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: [...] > Hello, Michal. > > I agree with partial revert but revert should be a different form. > Below change try to reuse should_compact_retry() version for > !CONFIG_COMPACTION but it turned out that it also causes regression in > Markus report [1]. I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high order workloads that calling any change in that behavior a regression is little bit exaggerated. Disabling compaction should have a very strong reason. I haven't heard any so far. I am even wondering whether there is a legitimate reason for that these days. > Theoretical reason for this regression is that it would stop retry > even if there are enough lru pages. It only checks if freepage > excesses min watermark or not for retry decision. To prevent > pre-mature OOM killer, we need to keep allocation loop when there are > enough lru pages. So, logic should be something like that. > > should_compact_retry() > { > for_each_zone_zonelist_nodemask { > available = zone_reclaimable_pages(zone); > available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), > ac_classzone_idx(ac), alloc_flags, available)) > return true; > > } > } > > I suggested it before and current situation looks like it is indeed > needed. this just opens doors for an unbounded reclaim/threshing becacause you can reclaim as much as you like and there is no guarantee of a forward progress. The reason why !COMPACTION should_compact_retry only checks for the min_wmark without the reclaimable bias is that this will guarantee a retry if we are failing due to high order wmark check rather than a lack of memory. This condition is guaranteed to converge and the probability of the unbounded reclaim is much more reduced. > And, I still think that your OOM detection rework has some flaws. > > 1) It doesn't consider freeable objects that can be freed by shrink_slab(). > There are many subsystems that cache many objects and they will be > freed by shrink_slab() interface. But, you don't account them when > making the OOM decision. I fully rely on the reclaim and compaction feedback. And that is the place where we should strive for improvements. So if we are growing way too many slab objects we should take care about that in the slab reclaim which is tightly coupled with the LRU reclaim rather than up the layer in the page allocator. > Think about following situation that we are trying to find order-2 > freepage and some subsystem has order-2 freepage. It can be freed by > shrink_slab(). Your logic doesn't guarantee that shrink_slab() is > invoked to free this order-2 freepage in that subsystem. OOM would be > triggered when compaction fails even if there is a order-2 freeable > page. I think that if decision is made before whole lru list is > scanned and then shrink_slab() is invoked for whole freeable objects, > it would cause pre-mature OOM. I do not see why we would need to scan through the whole LRU list when we are under a high order pressure. It is true, though, that slab shrinkers can and should be more sensitive to the requested order to help release higher order pages preferably. > It seems that you already knows this issue [2]. > > 2) 'OOM detection rework' depends on compaction too much. Compaction > algorithm is racy and has some limitation. It's failure doesn't mean we > are in OOM situation. As long as this is the only reliable source of higher order pages then we do not have any other choice in order to have deterministic behavior. > Even if Vlastimil's patchset and mine is > applied, it is still possible that compaction scanner cannot find enough > freepage due to race condition and return pre-mature failure. To > reduce this race effect, I hope to give more chances to retry even if > full compaction is failed. Than we can improve compaction_failed() heuristic and do not call it the end of the day after a single attempt to get a high order page after scanning the whole memory. But to me this all sounds like an internal implementation detail of the compaction and the OOM detection in the page allocator should be as much independent on it as possible - same as it is independent on the internal reclaim decisions. That was the whole point of my rework. To actually melt "do something as long as at least a single page is reclaimed" into an actual algorithm which can be measured and reason about. > We can remove this heuristic when we make sure that compaction is > stable enough. How do we know that, though, if we do not rely on it? Artificial tests do not exhibit those corner cases. I was bashing my testing systems to cause as much fragmentation as possible, yet I wasn't able to trigger issues reported recently by real world workloads. Do not take me wrong, I understand your concerns but OOM
Re: OOM detection regressions since 4.7
On Tue 23-08-16 13:52:45, Joonsoo Kim wrote: [...] > Hello, Michal. > > I agree with partial revert but revert should be a different form. > Below change try to reuse should_compact_retry() version for > !CONFIG_COMPACTION but it turned out that it also causes regression in > Markus report [1]. I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high order workloads that calling any change in that behavior a regression is little bit exaggerated. Disabling compaction should have a very strong reason. I haven't heard any so far. I am even wondering whether there is a legitimate reason for that these days. > Theoretical reason for this regression is that it would stop retry > even if there are enough lru pages. It only checks if freepage > excesses min watermark or not for retry decision. To prevent > pre-mature OOM killer, we need to keep allocation loop when there are > enough lru pages. So, logic should be something like that. > > should_compact_retry() > { > for_each_zone_zonelist_nodemask { > available = zone_reclaimable_pages(zone); > available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), > ac_classzone_idx(ac), alloc_flags, available)) > return true; > > } > } > > I suggested it before and current situation looks like it is indeed > needed. this just opens doors for an unbounded reclaim/threshing becacause you can reclaim as much as you like and there is no guarantee of a forward progress. The reason why !COMPACTION should_compact_retry only checks for the min_wmark without the reclaimable bias is that this will guarantee a retry if we are failing due to high order wmark check rather than a lack of memory. This condition is guaranteed to converge and the probability of the unbounded reclaim is much more reduced. > And, I still think that your OOM detection rework has some flaws. > > 1) It doesn't consider freeable objects that can be freed by shrink_slab(). > There are many subsystems that cache many objects and they will be > freed by shrink_slab() interface. But, you don't account them when > making the OOM decision. I fully rely on the reclaim and compaction feedback. And that is the place where we should strive for improvements. So if we are growing way too many slab objects we should take care about that in the slab reclaim which is tightly coupled with the LRU reclaim rather than up the layer in the page allocator. > Think about following situation that we are trying to find order-2 > freepage and some subsystem has order-2 freepage. It can be freed by > shrink_slab(). Your logic doesn't guarantee that shrink_slab() is > invoked to free this order-2 freepage in that subsystem. OOM would be > triggered when compaction fails even if there is a order-2 freeable > page. I think that if decision is made before whole lru list is > scanned and then shrink_slab() is invoked for whole freeable objects, > it would cause pre-mature OOM. I do not see why we would need to scan through the whole LRU list when we are under a high order pressure. It is true, though, that slab shrinkers can and should be more sensitive to the requested order to help release higher order pages preferably. > It seems that you already knows this issue [2]. > > 2) 'OOM detection rework' depends on compaction too much. Compaction > algorithm is racy and has some limitation. It's failure doesn't mean we > are in OOM situation. As long as this is the only reliable source of higher order pages then we do not have any other choice in order to have deterministic behavior. > Even if Vlastimil's patchset and mine is > applied, it is still possible that compaction scanner cannot find enough > freepage due to race condition and return pre-mature failure. To > reduce this race effect, I hope to give more chances to retry even if > full compaction is failed. Than we can improve compaction_failed() heuristic and do not call it the end of the day after a single attempt to get a high order page after scanning the whole memory. But to me this all sounds like an internal implementation detail of the compaction and the OOM detection in the page allocator should be as much independent on it as possible - same as it is independent on the internal reclaim decisions. That was the whole point of my rework. To actually melt "do something as long as at least a single page is reclaimed" into an actual algorithm which can be measured and reason about. > We can remove this heuristic when we make sure that compaction is > stable enough. How do we know that, though, if we do not rely on it? Artificial tests do not exhibit those corner cases. I was bashing my testing systems to cause as much fragmentation as possible, yet I wasn't able to trigger issues reported recently by real world workloads. Do not take me wrong, I understand your concerns but OOM
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 11:32:49AM +0200, Michal Hocko wrote: > Hi, > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > killer invocations since 4.7 which contains oom detection rework. All of > them were for order-2 (kernel stack) alloaction requests failing because > of a high fragmentation and compaction failing to make any forward > progress. While investigating this we have found out that the compaction > just gives up too early. Vlastimil has been working on compaction > improvement for quite some time and his series [6] is already sitting > in mmotm tree. This already helps a lot because it drops some heuristics > which are more aimed at lower latencies for high orders rather than > reliability. Joonsoo has then identified further problem with too many > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > on top of his series [8] which is also in the mmotm tree now. > > That being said, the regression is real and should be fixed for 4.7 > stable users. [6][8] was reported to help and ooms are no longer > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > for mergeing those patches and have them in 4.8. For 4.7 I would go > with a partial revert of the detection rework for high order requests > (see patch below). This patch is really trivial. If those compaction > improvements are just too large for 4.8 then we can use the same patch > as for 4.7 stable for now and revert it in 4.9 after compaction changes > are merged. > > Thoughts? > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > --- > >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > From: Michal Hocko> Date: Mon, 22 Aug 2016 10:52:06 +0200 > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high > order request > > There have been several reports about pre-mature OOM killer invocation > in 4.7 kernel when order-2 allocation request (for the kernel stack) > invoked OOM killer even during basic workloads (light IO or even kernel > compile on some filesystems). In all reported cases the memory is > fragmented and there are no order-2+ pages available. There is usually > a large amount of slab memory (usually dentries/inodes) and further > debugging has shown that there are way too many unmovable blocks which > are skipped during the compaction. Multiple reporters have confirmed that > the current linux-next which includes [1] and [2] helped and OOMs are > not reproducible anymore. A simpler fix for the stable is to simply > ignore the compaction feedback and retry as long as there is a reclaim > progress for high order requests which we used to do before. We already > do that for CONFING_COMPACTION=n so let's reuse the same code when > compaction is enabled as well. Hello, Michal. I agree with partial revert but revert should be a different form. Below change try to reuse should_compact_retry() version for !CONFIG_COMPACTION but it turned out that it also causes regression in Markus report [1]. Theoretical reason for this regression is that it would stop retry even if there are enough lru pages. It only checks if freepage excesses min watermark or not for retry decision. To prevent pre-mature OOM killer, we need to keep allocation loop when there are enough lru pages. So, logic should be something like that. should_compact_retry() { for_each_zone_zonelist_nodemask { available = zone_reclaimable_pages(zone); available += zone_page_state_snapshot(zone, NR_FREE_PAGES); if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), ac_classzone_idx(ac), alloc_flags, available)) return true; } } I suggested it before and current situation looks like it is indeed needed. And, I still think that your OOM detection rework has some flaws. 1) It doesn't consider freeable objects that can be freed by shrink_slab(). There are many subsystems that cache many objects and they will be freed by shrink_slab() interface. But, you don't account them when making the OOM decision. Think about following situation that we are trying to find order-2 freepage and some subsystem has order-2 freepage. It can be freed by shrink_slab(). Your logic doesn't guarantee that shrink_slab() is invoked to free this order-2 freepage in that subsystem. OOM would be triggered when compaction fails even if there is a order-2 freeable page. I think that
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 11:32:49AM +0200, Michal Hocko wrote: > Hi, > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > killer invocations since 4.7 which contains oom detection rework. All of > them were for order-2 (kernel stack) alloaction requests failing because > of a high fragmentation and compaction failing to make any forward > progress. While investigating this we have found out that the compaction > just gives up too early. Vlastimil has been working on compaction > improvement for quite some time and his series [6] is already sitting > in mmotm tree. This already helps a lot because it drops some heuristics > which are more aimed at lower latencies for high orders rather than > reliability. Joonsoo has then identified further problem with too many > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > on top of his series [8] which is also in the mmotm tree now. > > That being said, the regression is real and should be fixed for 4.7 > stable users. [6][8] was reported to help and ooms are no longer > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > for mergeing those patches and have them in 4.8. For 4.7 I would go > with a partial revert of the detection rework for high order requests > (see patch below). This patch is really trivial. If those compaction > improvements are just too large for 4.8 then we can use the same patch > as for 4.7 stable for now and revert it in 4.9 after compaction changes > are merged. > > Thoughts? > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > --- > >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 22 Aug 2016 10:52:06 +0200 > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high > order request > > There have been several reports about pre-mature OOM killer invocation > in 4.7 kernel when order-2 allocation request (for the kernel stack) > invoked OOM killer even during basic workloads (light IO or even kernel > compile on some filesystems). In all reported cases the memory is > fragmented and there are no order-2+ pages available. There is usually > a large amount of slab memory (usually dentries/inodes) and further > debugging has shown that there are way too many unmovable blocks which > are skipped during the compaction. Multiple reporters have confirmed that > the current linux-next which includes [1] and [2] helped and OOMs are > not reproducible anymore. A simpler fix for the stable is to simply > ignore the compaction feedback and retry as long as there is a reclaim > progress for high order requests which we used to do before. We already > do that for CONFING_COMPACTION=n so let's reuse the same code when > compaction is enabled as well. Hello, Michal. I agree with partial revert but revert should be a different form. Below change try to reuse should_compact_retry() version for !CONFIG_COMPACTION but it turned out that it also causes regression in Markus report [1]. Theoretical reason for this regression is that it would stop retry even if there are enough lru pages. It only checks if freepage excesses min watermark or not for retry decision. To prevent pre-mature OOM killer, we need to keep allocation loop when there are enough lru pages. So, logic should be something like that. should_compact_retry() { for_each_zone_zonelist_nodemask { available = zone_reclaimable_pages(zone); available += zone_page_state_snapshot(zone, NR_FREE_PAGES); if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone), ac_classzone_idx(ac), alloc_flags, available)) return true; } } I suggested it before and current situation looks like it is indeed needed. And, I still think that your OOM detection rework has some flaws. 1) It doesn't consider freeable objects that can be freed by shrink_slab(). There are many subsystems that cache many objects and they will be freed by shrink_slab() interface. But, you don't account them when making the OOM decision. Think about following situation that we are trying to find order-2 freepage and some subsystem has order-2 freepage. It can be freed by shrink_slab(). Your logic doesn't guarantee that shrink_slab() is invoked to free this order-2 freepage in that subsystem. OOM would be triggered when compaction fails even if there is a order-2 freeable page. I think that if decision is
Re: OOM detection regressions since 4.7
On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hockowrote: > Of course, if Linus/Andrew doesn't like to take those compaction > improvements this late then I will ask to merge the partial revert to > Linus tree as well and then there is not much to discuss. This sounds like the prudent option. Can we get 4.8 working well-enough, backport that into 4.7.x and worry about the fancier stuff for 4.9?
Re: OOM detection regressions since 4.7
On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko wrote: > Of course, if Linus/Andrew doesn't like to take those compaction > improvements this late then I will ask to merge the partial revert to > Linus tree as well and then there is not much to discuss. This sounds like the prudent option. Can we get 4.8 working well-enough, backport that into 4.7.x and worry about the fancier stuff for 4.9?
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 03:42:28PM +0200, Michal Hocko wrote: > On Mon 22-08-16 09:31:14, Greg KH wrote: > > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote: > > > On Mon 22-08-16 06:05:28, Greg KH wrote: > > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > > > [...] > > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 > > > > > > 2001 > > > > > > From: Michal Hocko> > > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation > > > > > > for high > > > > > > order request > > > > > > > > > > > > There have been several reports about pre-mature OOM killer > > > > > > invocation > > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > > > > invoked OOM killer even during basic workloads (light IO or even > > > > > > kernel > > > > > > compile on some filesystems). In all reported cases the memory is > > > > > > fragmented and there are no order-2+ pages available. There is > > > > > > usually > > > > > > a large amount of slab memory (usually dentries/inodes) and further > > > > > > debugging has shown that there are way too many unmovable blocks > > > > > > which > > > > > > are skipped during the compaction. Multiple reporters have > > > > > > confirmed that > > > > > > the current linux-next which includes [1] and [2] helped and OOMs > > > > > > are > > > > > > not reproducible anymore. A simpler fix for the stable is to simply > > > > > > ignore the compaction feedback and retry as long as there is a > > > > > > reclaim > > > > > > progress for high order requests which we used to do before. We > > > > > > already > > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > > > > compaction is enabled as well. > > > > > > > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > > > > [2] > > > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > > > > Signed-off-by: Michal Hocko > > > > > > --- > > > > > > mm/page_alloc.c | 50 > > > > > > ++ > > > > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > > > > > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > > > > know about it so we can add it to the 4.7-stable tree? Otherwise > > > > there's not much I can do here now, right? > > > > > > My plan would be actually to not push this to Linus because we have a > > > proper fix for Linus tree. It is just that the fix is quite large and I > > > felt like the stable should get the most simple fix possible, which is > > > this partial revert. So, what I am trying to tell is to push a non-linus > > > patch to stable as it is simpler. > > > > I _REALLY_ hate taking any patches that are not in Linus's tree as 90% > > of the time (well, almost always), it ends up being wrong and hurting us > > in the end. > > I do not like it either but if there is a simple and straightforward > workaround for stable while the upstream can go with the _proper_ fix > from the longer POV then I think this is perfectly justified. Stable > should be always about the simplest fix for the problem IMHO. No, stable should always be "what is in Linus's tree to get it fixed." Again, almost every time we try to "just do this simple thing instead" in a stable tree, it ends up being broken somehow. We have the history to back this up, look at our archives. I'll gladly take 10+ patches to resolve something, _if_ it actually resolves something. But, if we argue about it for a month or so, then we don't have to worry about it as everyone will be using 4.8 :) > Of course, if Linus/Andrew doesn't like to take those compaction > improvements this late then I will ask to merge the partial revert to > Linus tree as well and then there is not much to discuss. Ok, let me know how it goes and we can see what to do. thanks. greg k-h
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 03:42:28PM +0200, Michal Hocko wrote: > On Mon 22-08-16 09:31:14, Greg KH wrote: > > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote: > > > On Mon 22-08-16 06:05:28, Greg KH wrote: > > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > > > [...] > > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 > > > > > > 2001 > > > > > > From: Michal Hocko > > > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation > > > > > > for high > > > > > > order request > > > > > > > > > > > > There have been several reports about pre-mature OOM killer > > > > > > invocation > > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > > > > invoked OOM killer even during basic workloads (light IO or even > > > > > > kernel > > > > > > compile on some filesystems). In all reported cases the memory is > > > > > > fragmented and there are no order-2+ pages available. There is > > > > > > usually > > > > > > a large amount of slab memory (usually dentries/inodes) and further > > > > > > debugging has shown that there are way too many unmovable blocks > > > > > > which > > > > > > are skipped during the compaction. Multiple reporters have > > > > > > confirmed that > > > > > > the current linux-next which includes [1] and [2] helped and OOMs > > > > > > are > > > > > > not reproducible anymore. A simpler fix for the stable is to simply > > > > > > ignore the compaction feedback and retry as long as there is a > > > > > > reclaim > > > > > > progress for high order requests which we used to do before. We > > > > > > already > > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > > > > compaction is enabled as well. > > > > > > > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > > > > [2] > > > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > > > > Signed-off-by: Michal Hocko > > > > > > --- > > > > > > mm/page_alloc.c | 50 > > > > > > ++ > > > > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > > > > > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > > > > know about it so we can add it to the 4.7-stable tree? Otherwise > > > > there's not much I can do here now, right? > > > > > > My plan would be actually to not push this to Linus because we have a > > > proper fix for Linus tree. It is just that the fix is quite large and I > > > felt like the stable should get the most simple fix possible, which is > > > this partial revert. So, what I am trying to tell is to push a non-linus > > > patch to stable as it is simpler. > > > > I _REALLY_ hate taking any patches that are not in Linus's tree as 90% > > of the time (well, almost always), it ends up being wrong and hurting us > > in the end. > > I do not like it either but if there is a simple and straightforward > workaround for stable while the upstream can go with the _proper_ fix > from the longer POV then I think this is perfectly justified. Stable > should be always about the simplest fix for the problem IMHO. No, stable should always be "what is in Linus's tree to get it fixed." Again, almost every time we try to "just do this simple thing instead" in a stable tree, it ends up being broken somehow. We have the history to back this up, look at our archives. I'll gladly take 10+ patches to resolve something, _if_ it actually resolves something. But, if we argue about it for a month or so, then we don't have to worry about it as everyone will be using 4.8 :) > Of course, if Linus/Andrew doesn't like to take those compaction > improvements this late then I will ask to merge the partial revert to > Linus tree as well and then there is not much to discuss. Ok, let me know how it goes and we can see what to do. thanks. greg k-h
Re: OOM detection regressions since 4.7
On Mon 22-08-16 09:31:14, Greg KH wrote: > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote: > > On Mon 22-08-16 06:05:28, Greg KH wrote: > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > > [...] > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > > > > From: Michal Hocko> > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation > > > > > for high > > > > > order request > > > > > > > > > > There have been several reports about pre-mature OOM killer invocation > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > > > invoked OOM killer even during basic workloads (light IO or even > > > > > kernel > > > > > compile on some filesystems). In all reported cases the memory is > > > > > fragmented and there are no order-2+ pages available. There is usually > > > > > a large amount of slab memory (usually dentries/inodes) and further > > > > > debugging has shown that there are way too many unmovable blocks which > > > > > are skipped during the compaction. Multiple reporters have confirmed > > > > > that > > > > > the current linux-next which includes [1] and [2] helped and OOMs are > > > > > not reproducible anymore. A simpler fix for the stable is to simply > > > > > ignore the compaction feedback and retry as long as there is a reclaim > > > > > progress for high order requests which we used to do before. We > > > > > already > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > > > compaction is enabled as well. > > > > > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > > > [2] > > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > > > Signed-off-by: Michal Hocko > > > > > --- > > > > > mm/page_alloc.c | 50 > > > > > ++ > > > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > > > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > > > know about it so we can add it to the 4.7-stable tree? Otherwise > > > there's not much I can do here now, right? > > > > My plan would be actually to not push this to Linus because we have a > > proper fix for Linus tree. It is just that the fix is quite large and I > > felt like the stable should get the most simple fix possible, which is > > this partial revert. So, what I am trying to tell is to push a non-linus > > patch to stable as it is simpler. > > I _REALLY_ hate taking any patches that are not in Linus's tree as 90% > of the time (well, almost always), it ends up being wrong and hurting us > in the end. I do not like it either but if there is a simple and straightforward workaround for stable while the upstream can go with the _proper_ fix from the longer POV then I think this is perfectly justified. Stable should be always about the simplest fix for the problem IMHO. Of course, if Linus/Andrew doesn't like to take those compaction improvements this late then I will ask to merge the partial revert to Linus tree as well and then there is not much to discuss. > What exactly are the commits that are in Linus's tree that resolve this > issue? The initial email in this thread has pointed to those patches. Please note that some of its dependeces (mostly code cleanups) are already merged and that backporting without them would make the backport harder and more risky. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon 22-08-16 09:31:14, Greg KH wrote: > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote: > > On Mon 22-08-16 06:05:28, Greg KH wrote: > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > > [...] > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > > > > From: Michal Hocko > > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation > > > > > for high > > > > > order request > > > > > > > > > > There have been several reports about pre-mature OOM killer invocation > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > > > invoked OOM killer even during basic workloads (light IO or even > > > > > kernel > > > > > compile on some filesystems). In all reported cases the memory is > > > > > fragmented and there are no order-2+ pages available. There is usually > > > > > a large amount of slab memory (usually dentries/inodes) and further > > > > > debugging has shown that there are way too many unmovable blocks which > > > > > are skipped during the compaction. Multiple reporters have confirmed > > > > > that > > > > > the current linux-next which includes [1] and [2] helped and OOMs are > > > > > not reproducible anymore. A simpler fix for the stable is to simply > > > > > ignore the compaction feedback and retry as long as there is a reclaim > > > > > progress for high order requests which we used to do before. We > > > > > already > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > > > compaction is enabled as well. > > > > > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > > > [2] > > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > > > Signed-off-by: Michal Hocko > > > > > --- > > > > > mm/page_alloc.c | 50 > > > > > ++ > > > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > > > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > > > know about it so we can add it to the 4.7-stable tree? Otherwise > > > there's not much I can do here now, right? > > > > My plan would be actually to not push this to Linus because we have a > > proper fix for Linus tree. It is just that the fix is quite large and I > > felt like the stable should get the most simple fix possible, which is > > this partial revert. So, what I am trying to tell is to push a non-linus > > patch to stable as it is simpler. > > I _REALLY_ hate taking any patches that are not in Linus's tree as 90% > of the time (well, almost always), it ends up being wrong and hurting us > in the end. I do not like it either but if there is a simple and straightforward workaround for stable while the upstream can go with the _proper_ fix from the longer POV then I think this is perfectly justified. Stable should be always about the simplest fix for the problem IMHO. Of course, if Linus/Andrew doesn't like to take those compaction improvements this late then I will ask to merge the partial revert to Linus tree as well and then there is not much to discuss. > What exactly are the commits that are in Linus's tree that resolve this > issue? The initial email in this thread has pointed to those patches. Please note that some of its dependeces (mostly code cleanups) are already merged and that backporting without them would make the backport harder and more risky. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote: > On Mon 22-08-16 06:05:28, Greg KH wrote: > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > [...] > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > > > From: Michal Hocko> > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for > > > > high > > > > order request > > > > > > > > There have been several reports about pre-mature OOM killer invocation > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > > invoked OOM killer even during basic workloads (light IO or even kernel > > > > compile on some filesystems). In all reported cases the memory is > > > > fragmented and there are no order-2+ pages available. There is usually > > > > a large amount of slab memory (usually dentries/inodes) and further > > > > debugging has shown that there are way too many unmovable blocks which > > > > are skipped during the compaction. Multiple reporters have confirmed > > > > that > > > > the current linux-next which includes [1] and [2] helped and OOMs are > > > > not reproducible anymore. A simpler fix for the stable is to simply > > > > ignore the compaction feedback and retry as long as there is a reclaim > > > > progress for high order requests which we used to do before. We already > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > > compaction is enabled as well. > > > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > > [2] > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > > Signed-off-by: Michal Hocko > > > > --- > > > > mm/page_alloc.c | 50 ++ > > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > > know about it so we can add it to the 4.7-stable tree? Otherwise > > there's not much I can do here now, right? > > My plan would be actually to not push this to Linus because we have a > proper fix for Linus tree. It is just that the fix is quite large and I > felt like the stable should get the most simple fix possible, which is > this partial revert. So, what I am trying to tell is to push a non-linus > patch to stable as it is simpler. I _REALLY_ hate taking any patches that are not in Linus's tree as 90% of the time (well, almost always), it ends up being wrong and hurting us in the end. What exactly are the commits that are in Linus's tree that resolve this issue? thanks, greg k-h
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote: > On Mon 22-08-16 06:05:28, Greg KH wrote: > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > [...] > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > > > From: Michal Hocko > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for > > > > high > > > > order request > > > > > > > > There have been several reports about pre-mature OOM killer invocation > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > > invoked OOM killer even during basic workloads (light IO or even kernel > > > > compile on some filesystems). In all reported cases the memory is > > > > fragmented and there are no order-2+ pages available. There is usually > > > > a large amount of slab memory (usually dentries/inodes) and further > > > > debugging has shown that there are way too many unmovable blocks which > > > > are skipped during the compaction. Multiple reporters have confirmed > > > > that > > > > the current linux-next which includes [1] and [2] helped and OOMs are > > > > not reproducible anymore. A simpler fix for the stable is to simply > > > > ignore the compaction feedback and retry as long as there is a reclaim > > > > progress for high order requests which we used to do before. We already > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > > compaction is enabled as well. > > > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > > [2] > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > > Signed-off-by: Michal Hocko > > > > --- > > > > mm/page_alloc.c | 50 ++ > > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > > know about it so we can add it to the 4.7-stable tree? Otherwise > > there's not much I can do here now, right? > > My plan would be actually to not push this to Linus because we have a > proper fix for Linus tree. It is just that the fix is quite large and I > felt like the stable should get the most simple fix possible, which is > this partial revert. So, what I am trying to tell is to push a non-linus > patch to stable as it is simpler. I _REALLY_ hate taking any patches that are not in Linus's tree as 90% of the time (well, almost always), it ends up being wrong and hurting us in the end. What exactly are the commits that are in Linus's tree that resolve this issue? thanks, greg k-h
Re: OOM detection regressions since 4.7
On 2016.08.22 at 13:13 +0200, Michal Hocko wrote: > On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote: > > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote: > > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > > > > > > > For the report [1] above: > > > > > > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > > > > # CONFIG_COMPACTION is not set > > > > > > Hmm, without compaction and a heavy fragmentation then I am afraid we > > > cannot really do much. What is the reason to disable compaction in the > > > first place? > > > > I don't recall. Must have been some issue in the past. I will re-enable > > the option. > > Well, without the compaction there is no source of high order pages at > all. You can only reclaim and hope that some of the reclaimed pages will > find its buddy on the list and form the higher order page. This can take > for ever. We used to have the lumpy reclaim and that could help but this > is long gone. > > I do not think we can really sanely optimize for high-order heavy loads > without COMPACTION sanely. At least not without reintroducing lumpy > reclaim or something similar. To be honest I am even not sure which > configurations should disable compaction - except for really highly > controlled !mmu or other one purpose systems. I now recall. It was an issue with CONFIG_TRANSPARENT_HUGEPAGE, so I disabled that option. This then de-selected CONFIG_COMPACTION... -- Markus
Re: OOM detection regressions since 4.7
On 2016.08.22 at 13:13 +0200, Michal Hocko wrote: > On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote: > > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote: > > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > > > > > > > For the report [1] above: > > > > > > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > > > > # CONFIG_COMPACTION is not set > > > > > > Hmm, without compaction and a heavy fragmentation then I am afraid we > > > cannot really do much. What is the reason to disable compaction in the > > > first place? > > > > I don't recall. Must have been some issue in the past. I will re-enable > > the option. > > Well, without the compaction there is no source of high order pages at > all. You can only reclaim and hope that some of the reclaimed pages will > find its buddy on the list and form the higher order page. This can take > for ever. We used to have the lumpy reclaim and that could help but this > is long gone. > > I do not think we can really sanely optimize for high-order heavy loads > without COMPACTION sanely. At least not without reintroducing lumpy > reclaim or something similar. To be honest I am even not sure which > configurations should disable compaction - except for really highly > controlled !mmu or other one purpose systems. I now recall. It was an issue with CONFIG_TRANSPARENT_HUGEPAGE, so I disabled that option. This then de-selected CONFIG_COMPACTION... -- Markus
Re: OOM detection regressions since 4.7
On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote: > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote: > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > > > > > For the report [1] above: > > > > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > > > # CONFIG_COMPACTION is not set > > > > Hmm, without compaction and a heavy fragmentation then I am afraid we > > cannot really do much. What is the reason to disable compaction in the > > first place? > > I don't recall. Must have been some issue in the past. I will re-enable > the option. Well, without the compaction there is no source of high order pages at all. You can only reclaim and hope that some of the reclaimed pages will find its buddy on the list and form the higher order page. This can take for ever. We used to have the lumpy reclaim and that could help but this is long gone. I do not think we can really sanely optimize for high-order heavy loads without COMPACTION sanely. At least not without reintroducing lumpy reclaim or something similar. To be honest I am even not sure which configurations should disable compaction - except for really highly controlled !mmu or other one purpose systems. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote: > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote: > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > > > > > For the report [1] above: > > > > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > > > # CONFIG_COMPACTION is not set > > > > Hmm, without compaction and a heavy fragmentation then I am afraid we > > cannot really do much. What is the reason to disable compaction in the > > first place? > > I don't recall. Must have been some issue in the past. I will re-enable > the option. Well, without the compaction there is no source of high order pages at all. You can only reclaim and hope that some of the reclaimed pages will find its buddy on the list and form the higher order page. This can take for ever. We used to have the lumpy reclaim and that could help but this is long gone. I do not think we can really sanely optimize for high-order heavy loads without COMPACTION sanely. At least not without reintroducing lumpy reclaim or something similar. To be honest I am even not sure which configurations should disable compaction - except for really highly controlled !mmu or other one purpose systems. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On 2016.08.22 at 12:56 +0200, Michal Hocko wrote: > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > > > For the report [1] above: > > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > > # CONFIG_COMPACTION is not set > > Hmm, without compaction and a heavy fragmentation then I am afraid we > cannot really do much. What is the reason to disable compaction in the > first place? I don't recall. Must have been some issue in the past. I will re-enable the option. -- Markus
Re: OOM detection regressions since 4.7
On 2016.08.22 at 12:56 +0200, Michal Hocko wrote: > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > > > For the report [1] above: > > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > > # CONFIG_COMPACTION is not set > > Hmm, without compaction and a heavy fragmentation then I am afraid we > cannot really do much. What is the reason to disable compaction in the > first place? I don't recall. Must have been some issue in the past. I will re-enable the option. -- Markus
Re: OOM detection regressions since 4.7
On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > > killer invocations since 4.7 which contains oom detection rework. All of > > them were for order-2 (kernel stack) alloaction requests failing because > > of a high fragmentation and compaction failing to make any forward > > progress. While investigating this we have found out that the compaction > > just gives up too early. Vlastimil has been working on compaction > > improvement for quite some time and his series [6] is already sitting > > in mmotm tree. This already helps a lot because it drops some heuristics > > which are more aimed at lower latencies for high orders rather than > > reliability. Joonsoo has then identified further problem with too many > > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > > on top of his series [8] which is also in the mmotm tree now. > > > > That being said, the regression is real and should be fixed for 4.7 > > stable users. [6][8] was reported to help and ooms are no longer > > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > > for mergeing those patches and have them in 4.8. For 4.7 I would go > > with a partial revert of the detection rework for high order requests > > (see patch below). This patch is really trivial. If those compaction > > improvements are just too large for 4.8 then we can use the same patch > > as for 4.7 stable for now and revert it in 4.9 after compaction changes > > are merged. > > > > Thoughts? > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > For the report [1] above: > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > # CONFIG_COMPACTION is not set Hmm, without compaction and a heavy fragmentation then I am afraid we cannot really do much. What is the reason to disable compaction in the first place? -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote: > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > > killer invocations since 4.7 which contains oom detection rework. All of > > them were for order-2 (kernel stack) alloaction requests failing because > > of a high fragmentation and compaction failing to make any forward > > progress. While investigating this we have found out that the compaction > > just gives up too early. Vlastimil has been working on compaction > > improvement for quite some time and his series [6] is already sitting > > in mmotm tree. This already helps a lot because it drops some heuristics > > which are more aimed at lower latencies for high orders rather than > > reliability. Joonsoo has then identified further problem with too many > > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > > on top of his series [8] which is also in the mmotm tree now. > > > > That being said, the regression is real and should be fixed for 4.7 > > stable users. [6][8] was reported to help and ooms are no longer > > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > > for mergeing those patches and have them in 4.8. For 4.7 I would go > > with a partial revert of the detection rework for high order requests > > (see patch below). This patch is really trivial. If those compaction > > improvements are just too large for 4.8 then we can use the same patch > > as for 4.7 stable for now and revert it in 4.9 after compaction changes > > are merged. > > > > Thoughts? > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > For the report [1] above: > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION > # CONFIG_COMPACTION is not set Hmm, without compaction and a heavy fragmentation then I am afraid we cannot really do much. What is the reason to disable compaction in the first place? -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon 22-08-16 06:05:28, Greg KH wrote: > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: [...] > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko> > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for > > > high > > > order request > > > > > > There have been several reports about pre-mature OOM killer invocation > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > invoked OOM killer even during basic workloads (light IO or even kernel > > > compile on some filesystems). In all reported cases the memory is > > > fragmented and there are no order-2+ pages available. There is usually > > > a large amount of slab memory (usually dentries/inodes) and further > > > debugging has shown that there are way too many unmovable blocks which > > > are skipped during the compaction. Multiple reporters have confirmed that > > > the current linux-next which includes [1] and [2] helped and OOMs are > > > not reproducible anymore. A simpler fix for the stable is to simply > > > ignore the compaction feedback and retry as long as there is a reclaim > > > progress for high order requests which we used to do before. We already > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > compaction is enabled as well. > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > Signed-off-by: Michal Hocko > > > --- > > > mm/page_alloc.c | 50 ++ > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > know about it so we can add it to the 4.7-stable tree? Otherwise > there's not much I can do here now, right? My plan would be actually to not push this to Linus because we have a proper fix for Linus tree. It is just that the fix is quite large and I felt like the stable should get the most simple fix possible, which is this partial revert. So, what I am trying to tell is to push a non-linus patch to stable as it is simpler. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On Mon 22-08-16 06:05:28, Greg KH wrote: > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: [...] > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko > > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for > > > high > > > order request > > > > > > There have been several reports about pre-mature OOM killer invocation > > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > > invoked OOM killer even during basic workloads (light IO or even kernel > > > compile on some filesystems). In all reported cases the memory is > > > fragmented and there are no order-2+ pages available. There is usually > > > a large amount of slab memory (usually dentries/inodes) and further > > > debugging has shown that there are way too many unmovable blocks which > > > are skipped during the compaction. Multiple reporters have confirmed that > > > the current linux-next which includes [1] and [2] helped and OOMs are > > > not reproducible anymore. A simpler fix for the stable is to simply > > > ignore the compaction feedback and retry as long as there is a reclaim > > > progress for high order requests which we used to do before. We already > > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > > compaction is enabled as well. > > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > > Signed-off-by: Michal Hocko > > > --- > > > mm/page_alloc.c | 50 ++ > > > 1 file changed, 2 insertions(+), 48 deletions(-) > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org > know about it so we can add it to the 4.7-stable tree? Otherwise > there's not much I can do here now, right? My plan would be actually to not push this to Linus because we have a proper fix for Linus tree. It is just that the fix is quite large and I felt like the stable should get the most simple fix possible, which is this partial revert. So, what I am trying to tell is to push a non-linus patch to stable as it is simpler. -- Michal Hocko SUSE Labs
Re: OOM detection regressions since 4.7
On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > killer invocations since 4.7 which contains oom detection rework. All of > them were for order-2 (kernel stack) alloaction requests failing because > of a high fragmentation and compaction failing to make any forward > progress. While investigating this we have found out that the compaction > just gives up too early. Vlastimil has been working on compaction > improvement for quite some time and his series [6] is already sitting > in mmotm tree. This already helps a lot because it drops some heuristics > which are more aimed at lower latencies for high orders rather than > reliability. Joonsoo has then identified further problem with too many > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > on top of his series [8] which is also in the mmotm tree now. > > That being said, the regression is real and should be fixed for 4.7 > stable users. [6][8] was reported to help and ooms are no longer > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > for mergeing those patches and have them in 4.8. For 4.7 I would go > with a partial revert of the detection rework for high order requests > (see patch below). This patch is really trivial. If those compaction > improvements are just too large for 4.8 then we can use the same patch > as for 4.7 stable for now and revert it in 4.9 after compaction changes > are merged. > > Thoughts? > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 For the report [1] above: markus@x4 linux % cat .config | grep CONFIG_COMPACTION # CONFIG_COMPACTION is not set -- Markus
Re: OOM detection regressions since 4.7
On 2016.08.22 at 11:32 +0200, Michal Hocko wrote: > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > killer invocations since 4.7 which contains oom detection rework. All of > them were for order-2 (kernel stack) alloaction requests failing because > of a high fragmentation and compaction failing to make any forward > progress. While investigating this we have found out that the compaction > just gives up too early. Vlastimil has been working on compaction > improvement for quite some time and his series [6] is already sitting > in mmotm tree. This already helps a lot because it drops some heuristics > which are more aimed at lower latencies for high orders rather than > reliability. Joonsoo has then identified further problem with too many > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > on top of his series [8] which is also in the mmotm tree now. > > That being said, the regression is real and should be fixed for 4.7 > stable users. [6][8] was reported to help and ooms are no longer > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > for mergeing those patches and have them in 4.8. For 4.7 I would go > with a partial revert of the detection rework for high order requests > (see patch below). This patch is really trivial. If those compaction > improvements are just too large for 4.8 then we can use the same patch > as for 4.7 stable for now and revert it in 4.9 after compaction changes > are merged. > > Thoughts? > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 For the report [1] above: markus@x4 linux % cat .config | grep CONFIG_COMPACTION # CONFIG_COMPACTION is not set -- Markus
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > [ups, fixing up Greg's email] > > On Mon 22-08-16 11:32:49, Michal Hocko wrote: > > Hi, > > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > > killer invocations since 4.7 which contains oom detection rework. All of > > them were for order-2 (kernel stack) alloaction requests failing because > > of a high fragmentation and compaction failing to make any forward > > progress. While investigating this we have found out that the compaction > > just gives up too early. Vlastimil has been working on compaction > > improvement for quite some time and his series [6] is already sitting > > in mmotm tree. This already helps a lot because it drops some heuristics > > which are more aimed at lower latencies for high orders rather than > > reliability. Joonsoo has then identified further problem with too many > > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > > on top of his series [8] which is also in the mmotm tree now. > > > > That being said, the regression is real and should be fixed for 4.7 > > stable users. [6][8] was reported to help and ooms are no longer > > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > > for mergeing those patches and have them in 4.8. For 4.7 I would go > > with a partial revert of the detection rework for high order requests > > (see patch below). This patch is really trivial. If those compaction > > improvements are just too large for 4.8 then we can use the same patch > > as for 4.7 stable for now and revert it in 4.9 after compaction changes > > are merged. > > > > Thoughts? > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com > > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz > > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html > > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 > > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE > > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > --- > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko> > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high > > order request > > > > There have been several reports about pre-mature OOM killer invocation > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > invoked OOM killer even during basic workloads (light IO or even kernel > > compile on some filesystems). In all reported cases the memory is > > fragmented and there are no order-2+ pages available. There is usually > > a large amount of slab memory (usually dentries/inodes) and further > > debugging has shown that there are way too many unmovable blocks which > > are skipped during the compaction. Multiple reporters have confirmed that > > the current linux-next which includes [1] and [2] helped and OOMs are > > not reproducible anymore. A simpler fix for the stable is to simply > > ignore the compaction feedback and retry as long as there is a reclaim > > progress for high order requests which we used to do before. We already > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > compaction is enabled as well. > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > Signed-off-by: Michal Hocko > > --- > > mm/page_alloc.c | 50 ++ > > 1 file changed, 2 insertions(+), 48 deletions(-) So, if this goes into Linus's tree, can you let sta...@vger.kernel.org know about it so we can add it to the 4.7-stable tree? Otherwise there's not much I can do here now, right? thanks, greg k-h
Re: OOM detection regressions since 4.7
On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote: > [ups, fixing up Greg's email] > > On Mon 22-08-16 11:32:49, Michal Hocko wrote: > > Hi, > > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > > killer invocations since 4.7 which contains oom detection rework. All of > > them were for order-2 (kernel stack) alloaction requests failing because > > of a high fragmentation and compaction failing to make any forward > > progress. While investigating this we have found out that the compaction > > just gives up too early. Vlastimil has been working on compaction > > improvement for quite some time and his series [6] is already sitting > > in mmotm tree. This already helps a lot because it drops some heuristics > > which are more aimed at lower latencies for high orders rather than > > reliability. Joonsoo has then identified further problem with too many > > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > > on top of his series [8] which is also in the mmotm tree now. > > > > That being said, the regression is real and should be fixed for 4.7 > > stable users. [6][8] was reported to help and ooms are no longer > > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > > for mergeing those patches and have them in 4.8. For 4.7 I would go > > with a partial revert of the detection rework for high order requests > > (see patch below). This patch is really trivial. If those compaction > > improvements are just too large for 4.8 then we can use the same patch > > as for 4.7 stable for now and revert it in 4.9 after compaction changes > > are merged. > > > > Thoughts? > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com > > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz > > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html > > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 > > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE > > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > --- > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Mon, 22 Aug 2016 10:52:06 +0200 > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high > > order request > > > > There have been several reports about pre-mature OOM killer invocation > > in 4.7 kernel when order-2 allocation request (for the kernel stack) > > invoked OOM killer even during basic workloads (light IO or even kernel > > compile on some filesystems). In all reported cases the memory is > > fragmented and there are no order-2+ pages available. There is usually > > a large amount of slab memory (usually dentries/inodes) and further > > debugging has shown that there are way too many unmovable blocks which > > are skipped during the compaction. Multiple reporters have confirmed that > > the current linux-next which includes [1] and [2] helped and OOMs are > > not reproducible anymore. A simpler fix for the stable is to simply > > ignore the compaction feedback and retry as long as there is a reclaim > > progress for high order requests which we used to do before. We already > > do that for CONFING_COMPACTION=n so let's reuse the same code when > > compaction is enabled as well. > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > > Signed-off-by: Michal Hocko > > --- > > mm/page_alloc.c | 50 ++ > > 1 file changed, 2 insertions(+), 48 deletions(-) So, if this goes into Linus's tree, can you let sta...@vger.kernel.org know about it so we can add it to the 4.7-stable tree? Otherwise there's not much I can do here now, right? thanks, greg k-h
Re: OOM detection regressions since 4.7
[ups, fixing up Greg's email] On Mon 22-08-16 11:32:49, Michal Hocko wrote: > Hi, > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > killer invocations since 4.7 which contains oom detection rework. All of > them were for order-2 (kernel stack) alloaction requests failing because > of a high fragmentation and compaction failing to make any forward > progress. While investigating this we have found out that the compaction > just gives up too early. Vlastimil has been working on compaction > improvement for quite some time and his series [6] is already sitting > in mmotm tree. This already helps a lot because it drops some heuristics > which are more aimed at lower latencies for high orders rather than > reliability. Joonsoo has then identified further problem with too many > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > on top of his series [8] which is also in the mmotm tree now. > > That being said, the regression is real and should be fixed for 4.7 > stable users. [6][8] was reported to help and ooms are no longer > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > for mergeing those patches and have them in 4.8. For 4.7 I would go > with a partial revert of the detection rework for high order requests > (see patch below). This patch is really trivial. If those compaction > improvements are just too large for 4.8 then we can use the same patch > as for 4.7 stable for now and revert it in 4.9 after compaction changes > are merged. > > Thoughts? > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > --- > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > From: Michal Hocko> Date: Mon, 22 Aug 2016 10:52:06 +0200 > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high > order request > > There have been several reports about pre-mature OOM killer invocation > in 4.7 kernel when order-2 allocation request (for the kernel stack) > invoked OOM killer even during basic workloads (light IO or even kernel > compile on some filesystems). In all reported cases the memory is > fragmented and there are no order-2+ pages available. There is usually > a large amount of slab memory (usually dentries/inodes) and further > debugging has shown that there are way too many unmovable blocks which > are skipped during the compaction. Multiple reporters have confirmed that > the current linux-next which includes [1] and [2] helped and OOMs are > not reproducible anymore. A simpler fix for the stable is to simply > ignore the compaction feedback and retry as long as there is a reclaim > progress for high order requests which we used to do before. We already > do that for CONFING_COMPACTION=n so let's reuse the same code when > compaction is enabled as well. > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > Signed-off-by: Michal Hocko > --- > mm/page_alloc.c | 50 ++ > 1 file changed, 2 insertions(+), 48 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8b3e1341b754..6e354199151b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned > int order, > return NULL; > } > > -static inline bool > -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, > - enum compact_result compact_result, enum migrate_mode > *migrate_mode, > - int compaction_retries) > -{ > - int max_retries = MAX_COMPACT_RETRIES; > - > - if (!order) > - return false; > - > - /* > - * compaction considers all the zone as desperately out of memory > - * so it doesn't really make much sense to retry except when the > - * failure could be caused by weak migration mode. > - */ > - if (compaction_failed(compact_result)) { > - if (*migrate_mode == MIGRATE_ASYNC) { > - *migrate_mode = MIGRATE_SYNC_LIGHT; > - return true; > - } > - return false; > - } > - > - /* > - * make sure the compaction wasn't deferred or didn't bail out early > - * due to locks contention before we declare that we should give up. > - * But do not
Re: OOM detection regressions since 4.7
[ups, fixing up Greg's email] On Mon 22-08-16 11:32:49, Michal Hocko wrote: > Hi, > there have been multiple reports [1][2][3][4][5] about pre-mature OOM > killer invocations since 4.7 which contains oom detection rework. All of > them were for order-2 (kernel stack) alloaction requests failing because > of a high fragmentation and compaction failing to make any forward > progress. While investigating this we have found out that the compaction > just gives up too early. Vlastimil has been working on compaction > improvement for quite some time and his series [6] is already sitting > in mmotm tree. This already helps a lot because it drops some heuristics > which are more aimed at lower latencies for high orders rather than > reliability. Joonsoo has then identified further problem with too many > blocks being marked as unmovable [7] and Vlastimil has prepared a patch > on top of his series [8] which is also in the mmotm tree now. > > That being said, the regression is real and should be fixed for 4.7 > stable users. [6][8] was reported to help and ooms are no longer > reproducible. I know we are quite late (rc3) in 4.8 but I would vote > for mergeing those patches and have them in 4.8. For 4.7 I would go > with a partial revert of the detection rework for high order requests > (see patch below). This patch is really trivial. If those compaction > improvements are just too large for 4.8 then we can use the same patch > as for 4.7 stable for now and revert it in 4.9 after compaction changes > are merged. > > Thoughts? > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > --- > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 22 Aug 2016 10:52:06 +0200 > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high > order request > > There have been several reports about pre-mature OOM killer invocation > in 4.7 kernel when order-2 allocation request (for the kernel stack) > invoked OOM killer even during basic workloads (light IO or even kernel > compile on some filesystems). In all reported cases the memory is > fragmented and there are no order-2+ pages available. There is usually > a large amount of slab memory (usually dentries/inodes) and further > debugging has shown that there are way too many unmovable blocks which > are skipped during the compaction. Multiple reporters have confirmed that > the current linux-next which includes [1] and [2] helped and OOMs are > not reproducible anymore. A simpler fix for the stable is to simply > ignore the compaction feedback and retry as long as there is a reclaim > progress for high order requests which we used to do before. We already > do that for CONFING_COMPACTION=n so let's reuse the same code when > compaction is enabled as well. > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") > Signed-off-by: Michal Hocko > --- > mm/page_alloc.c | 50 ++ > 1 file changed, 2 insertions(+), 48 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8b3e1341b754..6e354199151b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned > int order, > return NULL; > } > > -static inline bool > -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, > - enum compact_result compact_result, enum migrate_mode > *migrate_mode, > - int compaction_retries) > -{ > - int max_retries = MAX_COMPACT_RETRIES; > - > - if (!order) > - return false; > - > - /* > - * compaction considers all the zone as desperately out of memory > - * so it doesn't really make much sense to retry except when the > - * failure could be caused by weak migration mode. > - */ > - if (compaction_failed(compact_result)) { > - if (*migrate_mode == MIGRATE_ASYNC) { > - *migrate_mode = MIGRATE_SYNC_LIGHT; > - return true; > - } > - return false; > - } > - > - /* > - * make sure the compaction wasn't deferred or didn't bail out early > - * due to locks contention before we declare that we should give up. > - * But do not retry if the given zonelist is not
OOM detection regressions since 4.7
Hi, there have been multiple reports [1][2][3][4][5] about pre-mature OOM killer invocations since 4.7 which contains oom detection rework. All of them were for order-2 (kernel stack) alloaction requests failing because of a high fragmentation and compaction failing to make any forward progress. While investigating this we have found out that the compaction just gives up too early. Vlastimil has been working on compaction improvement for quite some time and his series [6] is already sitting in mmotm tree. This already helps a lot because it drops some heuristics which are more aimed at lower latencies for high orders rather than reliability. Joonsoo has then identified further problem with too many blocks being marked as unmovable [7] and Vlastimil has prepared a patch on top of his series [8] which is also in the mmotm tree now. That being said, the regression is real and should be fixed for 4.7 stable users. [6][8] was reported to help and ooms are no longer reproducible. I know we are quite late (rc3) in 4.8 but I would vote for mergeing those patches and have them in 4.8. For 4.7 I would go with a partial revert of the detection rework for high order requests (see patch below). This patch is really trivial. If those compaction improvements are just too large for 4.8 then we can use the same patch as for 4.7 stable for now and revert it in 4.9 after compaction changes are merged. Thoughts? [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz --- >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Mon, 22 Aug 2016 10:52:06 +0200 Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high order request There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress for high order requests which we used to do before. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Signed-off-by: Michal Hocko --- mm/page_alloc.c | 50 ++ 1 file changed, 2 insertions(+), 48 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8b3e1341b754..6e354199151b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } -static inline bool -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, -enum compact_result compact_result, enum migrate_mode *migrate_mode, -int compaction_retries) -{ - int max_retries = MAX_COMPACT_RETRIES; - - if (!order) - return false; - - /* -* compaction considers all the zone as desperately out of memory -* so it doesn't really make much sense to retry except when the -* failure could be caused by weak migration mode. -*/ - if (compaction_failed(compact_result)) { - if (*migrate_mode == MIGRATE_ASYNC) { - *migrate_mode = MIGRATE_SYNC_LIGHT; - return true; - } - return false; - } - - /* -* make sure the compaction wasn't deferred or didn't bail out early -* due to locks contention before we declare that we should give up. -* But do not retry if the given zonelist is not suitable for -* compaction. -*/ - if (compaction_withdrawn(compact_result)) - return compaction_zonelist_suitable(ac, order, alloc_flags); - - /* -*
OOM detection regressions since 4.7
Hi, there have been multiple reports [1][2][3][4][5] about pre-mature OOM killer invocations since 4.7 which contains oom detection rework. All of them were for order-2 (kernel stack) alloaction requests failing because of a high fragmentation and compaction failing to make any forward progress. While investigating this we have found out that the compaction just gives up too early. Vlastimil has been working on compaction improvement for quite some time and his series [6] is already sitting in mmotm tree. This already helps a lot because it drops some heuristics which are more aimed at lower latencies for high orders rather than reliability. Joonsoo has then identified further problem with too many blocks being marked as unmovable [7] and Vlastimil has prepared a patch on top of his series [8] which is also in the mmotm tree now. That being said, the regression is real and should be fixed for 4.7 stable users. [6][8] was reported to help and ooms are no longer reproducible. I know we are quite late (rc3) in 4.8 but I would vote for mergeing those patches and have them in 4.8. For 4.7 I would go with a partial revert of the detection rework for high order requests (see patch below). This patch is really trivial. If those compaction improvements are just too large for 4.8 then we can use the same patch as for 4.7 stable for now and revert it in 4.9 after compaction changes are merged. Thoughts? [1] http://lkml.kernel.org/r/20160731051121.GB307@x4 [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066 [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz --- >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 22 Aug 2016 10:52:06 +0200 Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high order request There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress for high order requests which we used to do before. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Signed-off-by: Michal Hocko --- mm/page_alloc.c | 50 ++ 1 file changed, 2 insertions(+), 48 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8b3e1341b754..6e354199151b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } -static inline bool -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, -enum compact_result compact_result, enum migrate_mode *migrate_mode, -int compaction_retries) -{ - int max_retries = MAX_COMPACT_RETRIES; - - if (!order) - return false; - - /* -* compaction considers all the zone as desperately out of memory -* so it doesn't really make much sense to retry except when the -* failure could be caused by weak migration mode. -*/ - if (compaction_failed(compact_result)) { - if (*migrate_mode == MIGRATE_ASYNC) { - *migrate_mode = MIGRATE_SYNC_LIGHT; - return true; - } - return false; - } - - /* -* make sure the compaction wasn't deferred or didn't bail out early -* due to locks contention before we declare that we should give up. -* But do not retry if the given zonelist is not suitable for -* compaction. -*/ - if (compaction_withdrawn(compact_result)) - return compaction_zonelist_suitable(ac, order, alloc_flags); - - /* -* !costly requests are much more