Re: OOM detection regressions since 4.7

2016-08-29 Thread Jeff Layton
On Mon, 2016-08-29 at 10:28 -0700, Linus Torvalds wrote:
> > On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering  wrote:
> > 
> > 
> > Today I noticed the nfsserver was disabled, probably since months already.
> > Starting it gives a OOM, not sure if this is new with 4.7+.
> 
> That's not an oom, that's just an allocation failure.
> 
> And with order-4, that's actually pretty normal. Nobody should use
> order-4 (that's 16 contiguous pages, fragmentation can easily make
> that hard - *much* harder than the small order-2 or order-2 cases that
> we should largely be able to rely on).
> 
> In fact, people who do multi-order allocations should always have a
> fallback, and use __GFP_NOWARN.
> 
> > 
> > [93348.306406] Call Trace:
> > [93348.306490]  [] __alloc_pages_slowpath+0x1af/0xa10
> > [93348.306501]  [] __alloc_pages_nodemask+0x250/0x290
> > [93348.306511]  [] cache_grow_begin+0x8d/0x540
> > [93348.306520]  [] fallback_alloc+0x161/0x200
> > [93348.306530]  [] __kmalloc+0x1d2/0x570
> > [93348.306589]  [] nfsd_reply_cache_init+0xaa/0x110 [nfsd]
> 
> Hmm. That's kmalloc itself falling back after already failing to grow
> the slab cache earlier (the earlier allocations *were* done with
> NOWARN afaik).
> 
> It does look like nfsdstarts out by allocating the hash table with one
> single fairly big allocation, and has no fallback position.
> 
> I suspect the code expects to be started at boot time, when this just
> isn't an issue. The fact that you loaded the nfsd kernel module with
> memory already fragmented after heavy use is likely why nobody else
> has seen this.
> 
> Adding the nfsd people to the cc, because just from a robustness
> standpoint I suspect it would be better if the code did something like
> 
>  (a) shrink the hash table if the allocation fails (we've got some
> examples of that elsewhere)
> 
> or
> 
>  (b) fall back on a vmalloc allocation (that's certainly the simpler model)
> 
> We do have a "kvfree()" helper function for the "free either a kmalloc
> or vmalloc allocation" but we don't actually have a good helper
> pattern for the allocation side. People just do it by hand, at least
> partly because we have so many different ways to allocate things -
> zeroing, non-zeroing, node-specific or not, atomic or not (atomic
> cannot fall back to vmalloc, obviously) etc etc.
> 
> Bruce, Jeff, comments?
> 
>  Linus

Yeah, that makes total sense.

Hmm...we _do_ already auto-size the hash at init time already, so
shrinking it downward and retrying if the allocation fails wouldn't be
hard to do. Maybe I can just cut it in half and throw a pr_warn to tell
the admin in that case.

In any case...I'll take a look at how we can improve it.

Thanks for the heads-up!
-- 
Jeff Layton 


Re: OOM detection regressions since 4.7

2016-08-29 Thread Jeff Layton
On Mon, 2016-08-29 at 10:28 -0700, Linus Torvalds wrote:
> > On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering  wrote:
> > 
> > 
> > Today I noticed the nfsserver was disabled, probably since months already.
> > Starting it gives a OOM, not sure if this is new with 4.7+.
> 
> That's not an oom, that's just an allocation failure.
> 
> And with order-4, that's actually pretty normal. Nobody should use
> order-4 (that's 16 contiguous pages, fragmentation can easily make
> that hard - *much* harder than the small order-2 or order-2 cases that
> we should largely be able to rely on).
> 
> In fact, people who do multi-order allocations should always have a
> fallback, and use __GFP_NOWARN.
> 
> > 
> > [93348.306406] Call Trace:
> > [93348.306490]  [] __alloc_pages_slowpath+0x1af/0xa10
> > [93348.306501]  [] __alloc_pages_nodemask+0x250/0x290
> > [93348.306511]  [] cache_grow_begin+0x8d/0x540
> > [93348.306520]  [] fallback_alloc+0x161/0x200
> > [93348.306530]  [] __kmalloc+0x1d2/0x570
> > [93348.306589]  [] nfsd_reply_cache_init+0xaa/0x110 [nfsd]
> 
> Hmm. That's kmalloc itself falling back after already failing to grow
> the slab cache earlier (the earlier allocations *were* done with
> NOWARN afaik).
> 
> It does look like nfsdstarts out by allocating the hash table with one
> single fairly big allocation, and has no fallback position.
> 
> I suspect the code expects to be started at boot time, when this just
> isn't an issue. The fact that you loaded the nfsd kernel module with
> memory already fragmented after heavy use is likely why nobody else
> has seen this.
> 
> Adding the nfsd people to the cc, because just from a robustness
> standpoint I suspect it would be better if the code did something like
> 
>  (a) shrink the hash table if the allocation fails (we've got some
> examples of that elsewhere)
> 
> or
> 
>  (b) fall back on a vmalloc allocation (that's certainly the simpler model)
> 
> We do have a "kvfree()" helper function for the "free either a kmalloc
> or vmalloc allocation" but we don't actually have a good helper
> pattern for the allocation side. People just do it by hand, at least
> partly because we have so many different ways to allocate things -
> zeroing, non-zeroing, node-specific or not, atomic or not (atomic
> cannot fall back to vmalloc, obviously) etc etc.
> 
> Bruce, Jeff, comments?
> 
>  Linus

Yeah, that makes total sense.

Hmm...we _do_ already auto-size the hash at init time already, so
shrinking it downward and retrying if the allocation fails wouldn't be
hard to do. Maybe I can just cut it in half and throw a pr_warn to tell
the admin in that case.

In any case...I'll take a look at how we can improve it.

Thanks for the heads-up!
-- 
Jeff Layton 


Re: OOM detection regressions since 4.7

2016-08-29 Thread Linus Torvalds
On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering  wrote:
>
> Today I noticed the nfsserver was disabled, probably since months already.
> Starting it gives a OOM, not sure if this is new with 4.7+.

That's not an oom, that's just an allocation failure.

And with order-4, that's actually pretty normal. Nobody should use
order-4 (that's 16 contiguous pages, fragmentation can easily make
that hard - *much* harder than the small order-2 or order-2 cases that
we should largely be able to rely on).

In fact, people who do multi-order allocations should always have a
fallback, and use __GFP_NOWARN.

> [93348.306406] Call Trace:
> [93348.306490]  [] __alloc_pages_slowpath+0x1af/0xa10
> [93348.306501]  [] __alloc_pages_nodemask+0x250/0x290
> [93348.306511]  [] cache_grow_begin+0x8d/0x540
> [93348.306520]  [] fallback_alloc+0x161/0x200
> [93348.306530]  [] __kmalloc+0x1d2/0x570
> [93348.306589]  [] nfsd_reply_cache_init+0xaa/0x110 [nfsd]

Hmm. That's kmalloc itself falling back after already failing to grow
the slab cache earlier (the earlier allocations *were* done with
NOWARN afaik).

It does look like nfsdstarts out by allocating the hash table with one
single fairly big allocation, and has no fallback position.

I suspect the code expects to be started at boot time, when this just
isn't an issue. The fact that you loaded the nfsd kernel module with
memory already fragmented after heavy use is likely why nobody else
has seen this.

Adding the nfsd people to the cc, because just from a robustness
standpoint I suspect it would be better if the code did something like

 (a) shrink the hash table if the allocation fails (we've got some
examples of that elsewhere)

or

 (b) fall back on a vmalloc allocation (that's certainly the simpler model)

We do have a "kvfree()" helper function for the "free either a kmalloc
or vmalloc allocation" but we don't actually have a good helper
pattern for the allocation side. People just do it by hand, at least
partly because we have so many different ways to allocate things -
zeroing, non-zeroing, node-specific or not, atomic or not (atomic
cannot fall back to vmalloc, obviously) etc etc.

Bruce, Jeff, comments?

 Linus


Re: OOM detection regressions since 4.7

2016-08-29 Thread Linus Torvalds
On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering  wrote:
>
> Today I noticed the nfsserver was disabled, probably since months already.
> Starting it gives a OOM, not sure if this is new with 4.7+.

That's not an oom, that's just an allocation failure.

And with order-4, that's actually pretty normal. Nobody should use
order-4 (that's 16 contiguous pages, fragmentation can easily make
that hard - *much* harder than the small order-2 or order-2 cases that
we should largely be able to rely on).

In fact, people who do multi-order allocations should always have a
fallback, and use __GFP_NOWARN.

> [93348.306406] Call Trace:
> [93348.306490]  [] __alloc_pages_slowpath+0x1af/0xa10
> [93348.306501]  [] __alloc_pages_nodemask+0x250/0x290
> [93348.306511]  [] cache_grow_begin+0x8d/0x540
> [93348.306520]  [] fallback_alloc+0x161/0x200
> [93348.306530]  [] __kmalloc+0x1d2/0x570
> [93348.306589]  [] nfsd_reply_cache_init+0xaa/0x110 [nfsd]

Hmm. That's kmalloc itself falling back after already failing to grow
the slab cache earlier (the earlier allocations *were* done with
NOWARN afaik).

It does look like nfsdstarts out by allocating the hash table with one
single fairly big allocation, and has no fallback position.

I suspect the code expects to be started at boot time, when this just
isn't an issue. The fact that you loaded the nfsd kernel module with
memory already fragmented after heavy use is likely why nobody else
has seen this.

Adding the nfsd people to the cc, because just from a robustness
standpoint I suspect it would be better if the code did something like

 (a) shrink the hash table if the allocation fails (we've got some
examples of that elsewhere)

or

 (b) fall back on a vmalloc allocation (that's certainly the simpler model)

We do have a "kvfree()" helper function for the "free either a kmalloc
or vmalloc allocation" but we don't actually have a good helper
pattern for the allocation side. People just do it by hand, at least
partly because we have so many different ways to allocate things -
zeroing, non-zeroing, node-specific or not, atomic or not (atomic
cannot fall back to vmalloc, obviously) etc etc.

Bruce, Jeff, comments?

 Linus


Re: OOM detection regressions since 4.7

2016-08-29 Thread Olaf Hering
On Mon, Aug 29, Michal Hocko wrote:

> On Mon 29-08-16 16:52:03, Olaf Hering wrote:
> > I ran rc3 for a few hours on Friday amd FireFox was not killed.
> > Now rc3 is running for a day with the usual workload and FireFox is
> > still running.
> Is the patch
> (http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied?

Yes.

Tested-by: Olaf Hering 

Olaf


signature.asc
Description: PGP signature


Re: OOM detection regressions since 4.7

2016-08-29 Thread Olaf Hering
On Mon, Aug 29, Michal Hocko wrote:

> On Mon 29-08-16 16:52:03, Olaf Hering wrote:
> > I ran rc3 for a few hours on Friday amd FireFox was not killed.
> > Now rc3 is running for a day with the usual workload and FireFox is
> > still running.
> Is the patch
> (http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied?

Yes.

Tested-by: Olaf Hering 

Olaf


signature.asc
Description: PGP signature


Re: OOM detection regressions since 4.7

2016-08-29 Thread Michal Hocko
On Mon 29-08-16 16:52:03, Olaf Hering wrote:
> On Thu, Aug 25, Olaf Hering wrote:
> 
> > On Thu, Aug 25, Michal Hocko wrote:
> > 
> > > Any luck with the testing of this patch?
> 
> I ran rc3 for a few hours on Friday amd FireFox was not killed.
> Now rc3 is running for a day with the usual workload and FireFox is
> still running.

Is the patch
(http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied?

> Today I noticed the nfsserver was disabled, probably since months already.
> Starting it gives a OOM, not sure if this is new with 4.7+.
> Full dmesg attached.
> [93348.306369] modprobe: page allocation failure: order:4, 
> mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

ok so order-4 (COSTLY allocation) has failed because

[...]
> [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 
> 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 
> 15908kB
> [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 
> 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 
> 133372kB
> [93348.313822] Node 0 Normal: 14003*4kB (UME) 25*8kB (UME) 2*16kB (UM) 0*32kB 
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56244kB

the memory is too fragmented for such a large allocation. Failing
order-4 requests is not so severe because we do not invoke the oom
killer if they fail. Especially without GFP_REPEAT we do not even try
too hard. Recent oom detection changes shouldn't change this behavior.

-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-29 Thread Michal Hocko
On Mon 29-08-16 16:52:03, Olaf Hering wrote:
> On Thu, Aug 25, Olaf Hering wrote:
> 
> > On Thu, Aug 25, Michal Hocko wrote:
> > 
> > > Any luck with the testing of this patch?
> 
> I ran rc3 for a few hours on Friday amd FireFox was not killed.
> Now rc3 is running for a day with the usual workload and FireFox is
> still running.

Is the patch
(http://lkml.kernel.org/r/20160823074339.gb23...@dhcp22.suse.cz) applied?

> Today I noticed the nfsserver was disabled, probably since months already.
> Starting it gives a OOM, not sure if this is new with 4.7+.
> Full dmesg attached.
> [93348.306369] modprobe: page allocation failure: order:4, 
> mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

ok so order-4 (COSTLY allocation) has failed because

[...]
> [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 
> 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 
> 15908kB
> [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 
> 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 
> 133372kB
> [93348.313822] Node 0 Normal: 14003*4kB (UME) 25*8kB (UME) 2*16kB (UM) 0*32kB 
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56244kB

the memory is too fragmented for such a large allocation. Failing
order-4 requests is not so severe because we do not invoke the oom
killer if they fail. Especially without GFP_REPEAT we do not even try
too hard. Recent oom detection changes shouldn't change this behavior.

-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-29 Thread Olaf Hering
On Mon, Aug 29, Olaf Hering wrote:

> Full dmesg attached.

Now..


dmesg-4.8.0-rc3-3.bug994066-default.txt.gz
Description: GNU Zip compressed data


signature.asc
Description: PGP signature


Re: OOM detection regressions since 4.7

2016-08-29 Thread Olaf Hering
On Mon, Aug 29, Olaf Hering wrote:

> Full dmesg attached.

Now..


dmesg-4.8.0-rc3-3.bug994066-default.txt.gz
Description: GNU Zip compressed data


signature.asc
Description: PGP signature


Re: OOM detection regressions since 4.7

2016-08-29 Thread Olaf Hering
On Thu, Aug 25, Olaf Hering wrote:

> On Thu, Aug 25, Michal Hocko wrote:
> 
> > Any luck with the testing of this patch?

I ran rc3 for a few hours on Friday amd FireFox was not killed.
Now rc3 is running for a day with the usual workload and FireFox is
still running.

Today I noticed the nfsserver was disabled, probably since months already.
Starting it gives a OOM, not sure if this is new with 4.7+.
Full dmesg attached.


[0.00] Linux version 4.8.0-rc3-3.bug994066-default (geeko@buildhost) 
(gcc version 6.1.1 20160815 [gcc-6-branch revision 239479] (SUSE Linux) ) #1 
SMP PREEMPT Mon Aug 22 14:52:18 UTC 2016 (c0d2ef5)

[64378.582489] tun: Universal TUN/TAP device driver, 1.6
[64378.582493] tun: (C) 1999-2004 Max Krasnyansky 
[93347.645123] RPC: Registered named UNIX socket transport module.
[93347.645128] RPC: Registered udp transport module.
[93347.645130] RPC: Registered tcp transport module.
[93347.645132] RPC: Registered tcp NFSv4.1 backchannel transport module.
[93348.227828] Installing knfsd (copyright (C) 1996 o...@monad.swb.de).
[93348.306369] modprobe: page allocation failure: order:4, 
mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
[93348.306379] CPU: 2 PID: 30467 Comm: modprobe Not tainted 
4.8.0-rc3-3.bug994066-default #1
[93348.306382] Hardware name: Hewlett-Packard HP ProBook 6555b/1455, BIOS 68DTM 
Ver. F.21 06/14/2012
[93348.306386]   813a2952 0004 
88003fb6ba30
[93348.306394]  81198a4b 026040cf 026040c1 
88003fb6c000
[93348.306400]  0004 88003fb6baac 026040c0 
0040
[93348.306406] Call Trace:
[93348.306437]  [] dump_trace+0x5e/0x310
[93348.306449]  [] show_stack_log_lvl+0x11b/0x1a0
[93348.306459]  [] show_stack+0x21/0x40
[93348.306468]  [] dump_stack+0x5c/0x7a
[93348.306478]  [] warn_alloc_failed+0xdb/0x150
[93348.306490]  [] __alloc_pages_slowpath+0x1af/0xa10
[93348.306501]  [] __alloc_pages_nodemask+0x250/0x290
[93348.306511]  [] cache_grow_begin+0x8d/0x540
[93348.306520]  [] fallback_alloc+0x161/0x200
[93348.306530]  [] __kmalloc+0x1d2/0x570
[93348.306589]  [] nfsd_reply_cache_init+0xaa/0x110 [nfsd]
[93348.306649]  [] init_nfsd+0x56/0xea0 [nfsd]
[93348.306664]  [] do_one_initcall+0x4b/0x180
[93348.306674]  [] do_init_module+0x5b/0x1fe
[93348.306684]  [] load_module+0x1a75/0x1d00
[93348.306695]  [] SYSC_finit_module+0xa4/0xe0
[93348.306705]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
[93348.313626] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xa8

[93348.313629] Leftover inexact backtrace:

[93348.313691] Mem-Info:
[93348.313704] active_anon:467209 inactive_anon:125491 isolated_anon:0
active_file:264880 inactive_file:166389 isolated_file:0
unevictable:8 dirty:250 writeback:0 unstable:0
slab_reclaimable:796425 slab_unreclaimable:34803
mapped:54783 shmem:24119 pagetables:9083 bounce:0
free:51321 free_pcp:68 free_cma:0
[93348.313717] Node 0 active_anon:1868836kB inactive_anon:501964kB 
active_file:1059520kB inactive_file:665556kB unevictable:32kB 
isolated(anon):0kB isolated(file):0kB mapped:219132kB dirty:1000kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 749568kB anon_thp: 
96476kB writeback_tmp:0kB unstable:0kB pages_scanned:24 all_unreclaimable? no
[93348.313719] Node 0 DMA free:15908kB min:136kB low:168kB high:200kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[93348.313729] lowmem_reserve[]: 0 2626 7621 7621 7621
[93348.313745] Node 0 DMA32 free:133192kB min:23244kB low:29052kB high:34860kB 
active_anon:642152kB inactive_anon:119848kB active_file:257900kB 
inactive_file:116560kB unevictable:0kB writepending:292kB present:2847412kB 
managed:2766832kB mlocked:0kB slab_reclaimable:1418576kB 
slab_unreclaimable:39004kB kernel_stack:256kB pagetables:1448kB bounce:0kB 
free_pcp:128kB local_pcp:0kB free_cma:0kB
[93348.313755] lowmem_reserve[]: 0 0 4994 4994 4994
[93348.313762] Node 0 Normal free:56184kB min:44200kB low:55248kB high:66296kB 
active_anon:1226576kB inactive_anon:382200kB active_file:801508kB 
inactive_file:548992kB unevictable:32kB writepending:536kB present:5242880kB 
managed:5114880kB mlocked:32kB slab_reclaimable:1767124kB 
slab_unreclaimable:100208kB kernel_stack:9104kB pagetables:34884kB bounce:0kB 
free_pcp:144kB local_pcp:0kB free_cma:0kB
[93348.313771] lowmem_reserve[]: 0 0 0 0 0
[93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB 
(U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 
10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133372kB
[93348.313822] Node 0 

Re: OOM detection regressions since 4.7

2016-08-29 Thread Olaf Hering
On Thu, Aug 25, Olaf Hering wrote:

> On Thu, Aug 25, Michal Hocko wrote:
> 
> > Any luck with the testing of this patch?

I ran rc3 for a few hours on Friday amd FireFox was not killed.
Now rc3 is running for a day with the usual workload and FireFox is
still running.

Today I noticed the nfsserver was disabled, probably since months already.
Starting it gives a OOM, not sure if this is new with 4.7+.
Full dmesg attached.


[0.00] Linux version 4.8.0-rc3-3.bug994066-default (geeko@buildhost) 
(gcc version 6.1.1 20160815 [gcc-6-branch revision 239479] (SUSE Linux) ) #1 
SMP PREEMPT Mon Aug 22 14:52:18 UTC 2016 (c0d2ef5)

[64378.582489] tun: Universal TUN/TAP device driver, 1.6
[64378.582493] tun: (C) 1999-2004 Max Krasnyansky 
[93347.645123] RPC: Registered named UNIX socket transport module.
[93347.645128] RPC: Registered udp transport module.
[93347.645130] RPC: Registered tcp transport module.
[93347.645132] RPC: Registered tcp NFSv4.1 backchannel transport module.
[93348.227828] Installing knfsd (copyright (C) 1996 o...@monad.swb.de).
[93348.306369] modprobe: page allocation failure: order:4, 
mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
[93348.306379] CPU: 2 PID: 30467 Comm: modprobe Not tainted 
4.8.0-rc3-3.bug994066-default #1
[93348.306382] Hardware name: Hewlett-Packard HP ProBook 6555b/1455, BIOS 68DTM 
Ver. F.21 06/14/2012
[93348.306386]   813a2952 0004 
88003fb6ba30
[93348.306394]  81198a4b 026040cf 026040c1 
88003fb6c000
[93348.306400]  0004 88003fb6baac 026040c0 
0040
[93348.306406] Call Trace:
[93348.306437]  [] dump_trace+0x5e/0x310
[93348.306449]  [] show_stack_log_lvl+0x11b/0x1a0
[93348.306459]  [] show_stack+0x21/0x40
[93348.306468]  [] dump_stack+0x5c/0x7a
[93348.306478]  [] warn_alloc_failed+0xdb/0x150
[93348.306490]  [] __alloc_pages_slowpath+0x1af/0xa10
[93348.306501]  [] __alloc_pages_nodemask+0x250/0x290
[93348.306511]  [] cache_grow_begin+0x8d/0x540
[93348.306520]  [] fallback_alloc+0x161/0x200
[93348.306530]  [] __kmalloc+0x1d2/0x570
[93348.306589]  [] nfsd_reply_cache_init+0xaa/0x110 [nfsd]
[93348.306649]  [] init_nfsd+0x56/0xea0 [nfsd]
[93348.306664]  [] do_one_initcall+0x4b/0x180
[93348.306674]  [] do_init_module+0x5b/0x1fe
[93348.306684]  [] load_module+0x1a75/0x1d00
[93348.306695]  [] SYSC_finit_module+0xa4/0xe0
[93348.306705]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
[93348.313626] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xa8

[93348.313629] Leftover inexact backtrace:

[93348.313691] Mem-Info:
[93348.313704] active_anon:467209 inactive_anon:125491 isolated_anon:0
active_file:264880 inactive_file:166389 isolated_file:0
unevictable:8 dirty:250 writeback:0 unstable:0
slab_reclaimable:796425 slab_unreclaimable:34803
mapped:54783 shmem:24119 pagetables:9083 bounce:0
free:51321 free_pcp:68 free_cma:0
[93348.313717] Node 0 active_anon:1868836kB inactive_anon:501964kB 
active_file:1059520kB inactive_file:665556kB unevictable:32kB 
isolated(anon):0kB isolated(file):0kB mapped:219132kB dirty:1000kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 749568kB anon_thp: 
96476kB writeback_tmp:0kB unstable:0kB pages_scanned:24 all_unreclaimable? no
[93348.313719] Node 0 DMA free:15908kB min:136kB low:168kB high:200kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[93348.313729] lowmem_reserve[]: 0 2626 7621 7621 7621
[93348.313745] Node 0 DMA32 free:133192kB min:23244kB low:29052kB high:34860kB 
active_anon:642152kB inactive_anon:119848kB active_file:257900kB 
inactive_file:116560kB unevictable:0kB writepending:292kB present:2847412kB 
managed:2766832kB mlocked:0kB slab_reclaimable:1418576kB 
slab_unreclaimable:39004kB kernel_stack:256kB pagetables:1448kB bounce:0kB 
free_pcp:128kB local_pcp:0kB free_cma:0kB
[93348.313755] lowmem_reserve[]: 0 0 4994 4994 4994
[93348.313762] Node 0 Normal free:56184kB min:44200kB low:55248kB high:66296kB 
active_anon:1226576kB inactive_anon:382200kB active_file:801508kB 
inactive_file:548992kB unevictable:32kB writepending:536kB present:5242880kB 
managed:5114880kB mlocked:32kB slab_reclaimable:1767124kB 
slab_unreclaimable:100208kB kernel_stack:9104kB pagetables:34884kB bounce:0kB 
free_pcp:144kB local_pcp:0kB free_cma:0kB
[93348.313771] lowmem_reserve[]: 0 0 0 0 0
[93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB 
(U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 
10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133372kB
[93348.313822] Node 0 Normal: 14003*4kB 

Re: OOM detection regressions since 4.7

2016-08-27 Thread Arkadiusz Miskiewicz
On Thursday 25 of August 2016, Michal Hocko wrote:
> On Tue 23-08-16 09:43:39, Michal Hocko wrote:
> > On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  
wrote:
> > > > Of course, if Linus/Andrew doesn't like to take those compaction
> > > > improvements this late then I will ask to merge the partial revert to
> > > > Linus tree as well and then there is not much to discuss.
> > > 
> > > This sounds like the prudent option.  Can we get 4.8 working
> > > well-enough, backport that into 4.7.x and worry about the fancier stuff
> > > for 4.9?
> > 
> > OK, fair enough.
> > 
> > I would really appreciate if the original reporters could retest with
> > this patch on top of the current Linus tree.
> 
> Any luck with the testing of this patch?

Here my "rm -rf && cp -al" 10x in parallel test finished without OOM, so

Tested-by: Arkadiusz Miśkiewicz 

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )


Re: OOM detection regressions since 4.7

2016-08-27 Thread Arkadiusz Miskiewicz
On Thursday 25 of August 2016, Michal Hocko wrote:
> On Tue 23-08-16 09:43:39, Michal Hocko wrote:
> > On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  
wrote:
> > > > Of course, if Linus/Andrew doesn't like to take those compaction
> > > > improvements this late then I will ask to merge the partial revert to
> > > > Linus tree as well and then there is not much to discuss.
> > > 
> > > This sounds like the prudent option.  Can we get 4.8 working
> > > well-enough, backport that into 4.7.x and worry about the fancier stuff
> > > for 4.9?
> > 
> > OK, fair enough.
> > 
> > I would really appreciate if the original reporters could retest with
> > this patch on top of the current Linus tree.
> 
> Any luck with the testing of this patch?

Here my "rm -rf && cp -al" 10x in parallel test finished without OOM, so

Tested-by: Arkadiusz Miśkiewicz 

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )


Re: OOM detection regressions since 4.7

2016-08-26 Thread Ralf-Peter Rohbeck

On 25.08.2016 23:26, Michal Hocko wrote:

On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote:
[...]

This worked for me for about 12 hours of my torture test. Logs are at
https://urldefense.proofpoint.com/v2/url?u=https-3A__filebin.net_2rfah407nbhzs69e_OOM-5F4.8.0-2Drc2-5Fp1.tar.bz2=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=xBE9zOUuzzrfyIgW70g1kmSzqiGPNXjBnN_zvF4eStQ=jdGSxmrQNhIx4cjVDsyyAA0K83hANgWXu1aFBDh_1B4=
 .

Thanks! Can we add your
Tested-by: Ralf-Peter Rohbeck 

to the patch?


Sure.


Ralf-Peter


--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.


Re: OOM detection regressions since 4.7

2016-08-26 Thread Ralf-Peter Rohbeck

On 25.08.2016 23:26, Michal Hocko wrote:

On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote:
[...]

This worked for me for about 12 hours of my torture test. Logs are at
https://urldefense.proofpoint.com/v2/url?u=https-3A__filebin.net_2rfah407nbhzs69e_OOM-5F4.8.0-2Drc2-5Fp1.tar.bz2=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=xBE9zOUuzzrfyIgW70g1kmSzqiGPNXjBnN_zvF4eStQ=jdGSxmrQNhIx4cjVDsyyAA0K83hANgWXu1aFBDh_1B4=
 .

Thanks! Can we add your
Tested-by: Ralf-Peter Rohbeck 

to the patch?


Sure.


Ralf-Peter


--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.


Re: OOM detection regressions since 4.7

2016-08-26 Thread Michal Hocko
On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote:
[...]
> This worked for me for about 12 hours of my torture test. Logs are at
> https://filebin.net/2rfah407nbhzs69e/OOM_4.8.0-rc2_p1.tar.bz2.

Thanks! Can we add your
Tested-by: Ralf-Peter Rohbeck 

to the patch?
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-26 Thread Michal Hocko
On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote:
[...]
> This worked for me for about 12 hours of my torture test. Logs are at
> https://filebin.net/2rfah407nbhzs69e/OOM_4.8.0-rc2_p1.tar.bz2.

Thanks! Can we add your
Tested-by: Ralf-Peter Rohbeck 

to the patch?
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-25 Thread Ralf-Peter Rohbeck

On 23.08.2016 00:43, Michal Hocko wrote:

OK, fair enough.
I would really appreciate if the original reporters could retest with
this patch on top of the current Linus tree. The stable backport posted
earlier doesn't apply on the current master cleanly but the change is
essentially same. mmotm tree then can revert this patch before Vlastimil
series is applied because that code is touching the currently removed
code.
---
 From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 22 Aug 2016 10:52:06 +0200
Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
  order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed that
the current linux-next which includes [1] and [2] helped and OOMs are
not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress
and we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_20160810091226.6709-2D1-2Dvbabka-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=sJBmU_ySuE2OhXkEeFyTfUr05xjB-mO4aQ5yl4w8z1M=
[2] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_f7a9ea9d-2Dbb88-2Dbfd6-2De340-2D3a933559305a-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=9oXRJsI8kr1rfMU9tAb9q0-8YlBCZO0XCCFRo0ASjlg=

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Signed-off-by: Michal Hocko 
---
  mm/page_alloc.c | 51 ++-
  1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..7791a03f8deb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
return NULL;
  }
  
-static inline bool

-should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-enum compact_result compact_result,
-enum compact_priority *compact_priority,
-int compaction_retries)
-{
-   int max_retries = MAX_COMPACT_RETRIES;
-
-   if (!order)
-   return false;
-
-   /*
-* compaction considers all the zone as desperately out of memory
-* so it doesn't really make much sense to retry except when the
-* failure could be caused by insufficient priority
-*/
-   if (compaction_failed(compact_result)) {
-   if (*compact_priority > MIN_COMPACT_PRIORITY) {
-   (*compact_priority)--;
-   return true;
-   }
-   return false;
-   }
-
-   /*
-* make sure the compaction wasn't deferred or didn't bail out early
-* due to locks contention before we declare that we should give up.
-* But do not retry if the given zonelist is not suitable for
-* compaction.
-*/
-   if (compaction_withdrawn(compact_result))
-   return compaction_zonelist_suitable(ac, order, alloc_flags);
-
-   /*
-* !costly requests are much more important than __GFP_REPEAT
-* costly ones because they are de facto nofail and invoke OOM
-* killer to move on while costly can fail and users are ready
-* to cope with that. 1/4 retries is rather arbitrary but we
-* would need much more detailed feedback from compaction to
-* make a better decision.
-*/
-   if (order > PAGE_ALLOC_COSTLY_ORDER)
-   max_retries /= 4;
-   if (compaction_retries <= max_retries)
-   return true;
-
-   return false;
-}
  #else
  static inline struct page *
  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
return NULL;
  }
  
+#endif /* CONFIG_COMPACTION */

+
  static inline bool
  should_compact_retry(struct alloc_context *ac, unsigned int order, int 
alloc_flags,
 enum compact_result compact_result,
@@ -3221,7 +3175,6 @@ 

Re: OOM detection regressions since 4.7

2016-08-25 Thread Ralf-Peter Rohbeck

On 23.08.2016 00:43, Michal Hocko wrote:

OK, fair enough.
I would really appreciate if the original reporters could retest with
this patch on top of the current Linus tree. The stable backport posted
earlier doesn't apply on the current master cleanly but the change is
essentially same. mmotm tree then can revert this patch before Vlastimil
series is applied because that code is touching the currently removed
code.
---
 From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 22 Aug 2016 10:52:06 +0200
Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
  order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed that
the current linux-next which includes [1] and [2] helped and OOMs are
not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress
and we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_20160810091226.6709-2D1-2Dvbabka-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=sJBmU_ySuE2OhXkEeFyTfUr05xjB-mO4aQ5yl4w8z1M=
[2] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_f7a9ea9d-2Dbb88-2Dbfd6-2De340-2D3a933559305a-40suse.cz=DQIBAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM=9oXRJsI8kr1rfMU9tAb9q0-8YlBCZO0XCCFRo0ASjlg=

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Signed-off-by: Michal Hocko 
---
  mm/page_alloc.c | 51 ++-
  1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..7791a03f8deb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
return NULL;
  }
  
-static inline bool

-should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-enum compact_result compact_result,
-enum compact_priority *compact_priority,
-int compaction_retries)
-{
-   int max_retries = MAX_COMPACT_RETRIES;
-
-   if (!order)
-   return false;
-
-   /*
-* compaction considers all the zone as desperately out of memory
-* so it doesn't really make much sense to retry except when the
-* failure could be caused by insufficient priority
-*/
-   if (compaction_failed(compact_result)) {
-   if (*compact_priority > MIN_COMPACT_PRIORITY) {
-   (*compact_priority)--;
-   return true;
-   }
-   return false;
-   }
-
-   /*
-* make sure the compaction wasn't deferred or didn't bail out early
-* due to locks contention before we declare that we should give up.
-* But do not retry if the given zonelist is not suitable for
-* compaction.
-*/
-   if (compaction_withdrawn(compact_result))
-   return compaction_zonelist_suitable(ac, order, alloc_flags);
-
-   /*
-* !costly requests are much more important than __GFP_REPEAT
-* costly ones because they are de facto nofail and invoke OOM
-* killer to move on while costly can fail and users are ready
-* to cope with that. 1/4 retries is rather arbitrary but we
-* would need much more detailed feedback from compaction to
-* make a better decision.
-*/
-   if (order > PAGE_ALLOC_COSTLY_ORDER)
-   max_retries /= 4;
-   if (compaction_retries <= max_retries)
-   return true;
-
-   return false;
-}
  #else
  static inline struct page *
  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
return NULL;
  }
  
+#endif /* CONFIG_COMPACTION */

+
  static inline bool
  should_compact_retry(struct alloc_context *ac, unsigned int order, int 
alloc_flags,
 enum compact_result compact_result,
@@ -3221,7 +3175,6 @@ should_compact_retry(struct 

Re: OOM detection regressions since 4.7

2016-08-25 Thread Michal Hocko
On Tue 23-08-16 09:43:39, Michal Hocko wrote:
> On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  wrote:
> > 
> > > Of course, if Linus/Andrew doesn't like to take those compaction
> > > improvements this late then I will ask to merge the partial revert to
> > > Linus tree as well and then there is not much to discuss.
> > 
> > This sounds like the prudent option.  Can we get 4.8 working
> > well-enough, backport that into 4.7.x and worry about the fancier stuff
> > for 4.9?
> 
> OK, fair enough.
> 
> I would really appreciate if the original reporters could retest with
> this patch on top of the current Linus tree.

Any luck with the testing of this patch?
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-25 Thread Michal Hocko
On Tue 23-08-16 09:43:39, Michal Hocko wrote:
> On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  wrote:
> > 
> > > Of course, if Linus/Andrew doesn't like to take those compaction
> > > improvements this late then I will ask to merge the partial revert to
> > > Linus tree as well and then there is not much to discuss.
> > 
> > This sounds like the prudent option.  Can we get 4.8 working
> > well-enough, backport that into 4.7.x and worry about the fancier stuff
> > for 4.9?
> 
> OK, fair enough.
> 
> I would really appreciate if the original reporters could retest with
> this patch on top of the current Linus tree.

Any luck with the testing of this patch?
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-25 Thread Olaf Hering
On Thu, Aug 25, Michal Hocko wrote:

> Any luck with the testing of this patch?

Not this week, sorry.

Olaf


signature.asc
Description: PGP signature


Re: OOM detection regressions since 4.7

2016-08-25 Thread Olaf Hering
On Thu, Aug 25, Michal Hocko wrote:

> Any luck with the testing of this patch?

Not this week, sorry.

Olaf


signature.asc
Description: PGP signature


Re: OOM detection regressions since 4.7

2016-08-24 Thread Joonsoo Kim
2016-08-24 16:04 GMT+09:00 Michal Hocko :
> On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:
>> Looks like my mail client eat my reply so I resend.
>>
>> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
>> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
>> > [...]
>> > > Hello, Michal.
>> > >
>> > > I agree with partial revert but revert should be a different form.
>> > > Below change try to reuse should_compact_retry() version for
>> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
>> > > Markus report [1].
>> >
>> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
>> > order workloads that calling any change in that behavior a regression
>> > is little bit exaggerated. Disabling compaction should have a very
>> > strong reason. I haven't heard any so far. I am even wondering whether
>> > there is a legitimate reason for that these days.
>> >
>> > > Theoretical reason for this regression is that it would stop retry
>> > > even if there are enough lru pages. It only checks if freepage
>> > > excesses min watermark or not for retry decision. To prevent
>> > > pre-mature OOM killer, we need to keep allocation loop when there are
>> > > enough lru pages. So, logic should be something like that.
>> > >
>> > > should_compact_retry()
>> > > {
>> > > for_each_zone_zonelist_nodemask {
>> > > available = zone_reclaimable_pages(zone);
>> > > available += zone_page_state_snapshot(zone, 
>> > > NR_FREE_PAGES);
>> > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
>> > > ac_classzone_idx(ac), alloc_flags, available))
>> > > return true;
>> > >
>> > > }
>> > > }
>> > >
>> > > I suggested it before and current situation looks like it is indeed
>> > > needed.
>> >
>> > this just opens doors for an unbounded reclaim/threshing becacause
>> > you can reclaim as much as you like and there is no guarantee of a
>> > forward progress. The reason why !COMPACTION should_compact_retry only
>> > checks for the min_wmark without the reclaimable bias is that this will
>> > guarantee a retry if we are failing due to high order wmark check rather
>> > than a lack of memory. This condition is guaranteed to converge and the
>> > probability of the unbounded reclaim is much more reduced.
>>
>> In case of a lack of memory with a lot of reclaimable lru pages, why
>> do we stop reclaim/compaction?
>>
>> With your partial reverting patch, allocation logic would be like as
>> following.
>>
>> Assume following situation:
>> o a lot of reclaimable lru pages
>> o no order-2 freepage
>> o not enough order-0 freepage for min watermark
>> o order-2 allocation
>>
>> 1. order-2 allocation failed due to min watermark
>> 2. go to reclaim/compaction
>> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
>> min watermark isn't met for order-0
>> 4. compaction is skipped due to not enough freepage
>> 5. should_reclaim_retry() returns false because min watermark for
>> order-2 page isn't met
>> 6. should_compact_retry() returns false because min watermark for
>> order-0 page isn't met
>> 6. allocation is failed without any retry and OOM is invoked.
>
> If the direct reclaim is not able to get us over min wmark for order-0
> then we would be likely to hit the oom even for order-0 requests.

No, this situation is that direct reclaim can get us over min wmark for order-0
but it needs retry. IIUC, direct reclaim would not reclaim enough memory
at once. It tries to reclaim small amount of lru pages and break out to check
watermark.

>> Is it what you want?
>>
>> And, please elaborate more on how your logic guarantee to converge.
>> After order-0 freepage exceed min watermark, there is no way to stop
>> reclaim/threshing. Number of freepage just increase monotonically and
>> retry cannot be stopped until order-2 allocation succeed. Am I missing
>> something?
>
> My statement was imprecise at best. You are right that there is no
> guarantee to fullfil order-2 request. What I meant to say is that we
> should converge when we are getting out of memory (aka even order-0
> would have hard time to succeed). should_reclaim_retry does that by
> the back off scaling of the reclaimable pages. should_compact_retry
> would have to do the same thing which would effectively turn it into
> should_reclaim_retry.

So, I suggested to change should_reclaim_retry() for high order request,
before.

>> > > And, I still think that your OOM detection rework has some flaws.
>> > >
>> > > 1) It doesn't consider freeable objects that can be freed by 
>> > > shrink_slab().
>> > > There are many subsystems that cache many objects and they will be
>> > > freed by shrink_slab() interface. But, you don't account them when
>> > > making the OOM decision.
>> >
>> > I fully rely on the reclaim and compaction feedback. And that is the
>> > place where we should strive for improvements. So 

Re: OOM detection regressions since 4.7

2016-08-24 Thread Joonsoo Kim
2016-08-24 16:04 GMT+09:00 Michal Hocko :
> On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:
>> Looks like my mail client eat my reply so I resend.
>>
>> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
>> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
>> > [...]
>> > > Hello, Michal.
>> > >
>> > > I agree with partial revert but revert should be a different form.
>> > > Below change try to reuse should_compact_retry() version for
>> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
>> > > Markus report [1].
>> >
>> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
>> > order workloads that calling any change in that behavior a regression
>> > is little bit exaggerated. Disabling compaction should have a very
>> > strong reason. I haven't heard any so far. I am even wondering whether
>> > there is a legitimate reason for that these days.
>> >
>> > > Theoretical reason for this regression is that it would stop retry
>> > > even if there are enough lru pages. It only checks if freepage
>> > > excesses min watermark or not for retry decision. To prevent
>> > > pre-mature OOM killer, we need to keep allocation loop when there are
>> > > enough lru pages. So, logic should be something like that.
>> > >
>> > > should_compact_retry()
>> > > {
>> > > for_each_zone_zonelist_nodemask {
>> > > available = zone_reclaimable_pages(zone);
>> > > available += zone_page_state_snapshot(zone, 
>> > > NR_FREE_PAGES);
>> > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
>> > > ac_classzone_idx(ac), alloc_flags, available))
>> > > return true;
>> > >
>> > > }
>> > > }
>> > >
>> > > I suggested it before and current situation looks like it is indeed
>> > > needed.
>> >
>> > this just opens doors for an unbounded reclaim/threshing becacause
>> > you can reclaim as much as you like and there is no guarantee of a
>> > forward progress. The reason why !COMPACTION should_compact_retry only
>> > checks for the min_wmark without the reclaimable bias is that this will
>> > guarantee a retry if we are failing due to high order wmark check rather
>> > than a lack of memory. This condition is guaranteed to converge and the
>> > probability of the unbounded reclaim is much more reduced.
>>
>> In case of a lack of memory with a lot of reclaimable lru pages, why
>> do we stop reclaim/compaction?
>>
>> With your partial reverting patch, allocation logic would be like as
>> following.
>>
>> Assume following situation:
>> o a lot of reclaimable lru pages
>> o no order-2 freepage
>> o not enough order-0 freepage for min watermark
>> o order-2 allocation
>>
>> 1. order-2 allocation failed due to min watermark
>> 2. go to reclaim/compaction
>> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
>> min watermark isn't met for order-0
>> 4. compaction is skipped due to not enough freepage
>> 5. should_reclaim_retry() returns false because min watermark for
>> order-2 page isn't met
>> 6. should_compact_retry() returns false because min watermark for
>> order-0 page isn't met
>> 6. allocation is failed without any retry and OOM is invoked.
>
> If the direct reclaim is not able to get us over min wmark for order-0
> then we would be likely to hit the oom even for order-0 requests.

No, this situation is that direct reclaim can get us over min wmark for order-0
but it needs retry. IIUC, direct reclaim would not reclaim enough memory
at once. It tries to reclaim small amount of lru pages and break out to check
watermark.

>> Is it what you want?
>>
>> And, please elaborate more on how your logic guarantee to converge.
>> After order-0 freepage exceed min watermark, there is no way to stop
>> reclaim/threshing. Number of freepage just increase monotonically and
>> retry cannot be stopped until order-2 allocation succeed. Am I missing
>> something?
>
> My statement was imprecise at best. You are right that there is no
> guarantee to fullfil order-2 request. What I meant to say is that we
> should converge when we are getting out of memory (aka even order-0
> would have hard time to succeed). should_reclaim_retry does that by
> the back off scaling of the reclaimable pages. should_compact_retry
> would have to do the same thing which would effectively turn it into
> should_reclaim_retry.

So, I suggested to change should_reclaim_retry() for high order request,
before.

>> > > And, I still think that your OOM detection rework has some flaws.
>> > >
>> > > 1) It doesn't consider freeable objects that can be freed by 
>> > > shrink_slab().
>> > > There are many subsystems that cache many objects and they will be
>> > > freed by shrink_slab() interface. But, you don't account them when
>> > > making the OOM decision.
>> >
>> > I fully rely on the reclaim and compaction feedback. And that is the
>> > place where we should strive for improvements. So if we are growing 

Re: OOM detection regressions since 4.7

2016-08-24 Thread Joonsoo Kim
Looks like my mail client eat my reply so I resend.

On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
> On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> [...]
> > Hello, Michal.
> > 
> > I agree with partial revert but revert should be a different form.
> > Below change try to reuse should_compact_retry() version for
> > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > Markus report [1].
> 
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated. Disabling compaction should have a very
> strong reason. I haven't heard any so far. I am even wondering whether
> there is a legitimate reason for that these days.
> 
> > Theoretical reason for this regression is that it would stop retry
> > even if there are enough lru pages. It only checks if freepage
> > excesses min watermark or not for retry decision. To prevent
> > pre-mature OOM killer, we need to keep allocation loop when there are
> > enough lru pages. So, logic should be something like that.
> > 
> > should_compact_retry()
> > {
> > for_each_zone_zonelist_nodemask {
> > available = zone_reclaimable_pages(zone);
> > available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> > ac_classzone_idx(ac), alloc_flags, available))
> > return true;
> > 
> > }
> > }
> > 
> > I suggested it before and current situation looks like it is indeed
> > needed.
> 
> this just opens doors for an unbounded reclaim/threshing becacause
> you can reclaim as much as you like and there is no guarantee of a
> forward progress. The reason why !COMPACTION should_compact_retry only
> checks for the min_wmark without the reclaimable bias is that this will
> guarantee a retry if we are failing due to high order wmark check rather
> than a lack of memory. This condition is guaranteed to converge and the
> probability of the unbounded reclaim is much more reduced.

In case of a lack of memory with a lot of reclaimable lru pages, why 
do we stop reclaim/compaction?

With your partial reverting patch, allocation logic would be like as
following.

Assume following situation:
o a lot of reclaimable lru pages
o no order-2 freepage
o not enough order-0 freepage for min watermark
o order-2 allocation

1. order-2 allocation failed due to min watermark
2. go to reclaim/compaction
3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
min watermark isn't met for order-0
4. compaction is skipped due to not enough freepage
5. should_reclaim_retry() returns false because min watermark for
order-2 page isn't met
6. should_compact_retry() returns false because min watermark for
order-0 page isn't met
6. allocation is failed without any retry and OOM is invoked.

Is it what you want?

And, please elaborate more on how your logic guarantee to converge.
After order-0 freepage exceed min watermark, there is no way to stop
reclaim/threshing. Number of freepage just increase monotonically and
retry cannot be stopped until order-2 allocation succeed. Am I missing
something?


> > And, I still think that your OOM detection rework has some flaws.
> >
> > 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> > There are many subsystems that cache many objects and they will be
> > freed by shrink_slab() interface. But, you don't account them when
> > making the OOM decision.
> 
> I fully rely on the reclaim and compaction feedback. And that is the
> place where we should strive for improvements. So if we are growing way
> too many slab objects we should take care about that in the slab reclaim
> which is tightly coupled with the LRU reclaim rather than up the layer
> in the page allocator.

No. slab shrink logic which is tightly coupled with the LRU reclaim
totally makes sense. What doesn't makes sense is the way of using
these functionality and utilizing these freebacks on your OOM
detection rework.

For example, compaction will do it's best with current resource. But,
as I said before, compaction will be more powerful if the system has
more free memory. Your logic just guarantee to give it to minimum
amount of free memory to run so I don't think it's result is
reliable to determine if we are in OOM or not.

And, your logic doesn't consider how many pages can be freed by slab
shrink. As I said before, there would exist high order reclaimable
page or we can make high order freepage by actual free.

Most importantly, I think that it is fundamentally impossible to
anticipate if we can make high order freepage or not by snapshot of
information about number of freeable page. So, your logic rely on
compaction but there are many types of pages that cannot be migrated
by compaction but can be reclaimed. So, fully relying on compaction
result for OOM decision 

Re: OOM detection regressions since 4.7

2016-08-24 Thread Michal Hocko
On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:
> Looks like my mail client eat my reply so I resend.
> 
> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> > [...]
> > > Hello, Michal.
> > > 
> > > I agree with partial revert but revert should be a different form.
> > > Below change try to reuse should_compact_retry() version for
> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > > Markus report [1].
> > 
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated. Disabling compaction should have a very
> > strong reason. I haven't heard any so far. I am even wondering whether
> > there is a legitimate reason for that these days.
> > 
> > > Theoretical reason for this regression is that it would stop retry
> > > even if there are enough lru pages. It only checks if freepage
> > > excesses min watermark or not for retry decision. To prevent
> > > pre-mature OOM killer, we need to keep allocation loop when there are
> > > enough lru pages. So, logic should be something like that.
> > > 
> > > should_compact_retry()
> > > {
> > > for_each_zone_zonelist_nodemask {
> > > available = zone_reclaimable_pages(zone);
> > > available += zone_page_state_snapshot(zone, 
> > > NR_FREE_PAGES);
> > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> > > ac_classzone_idx(ac), alloc_flags, available))
> > > return true;
> > > 
> > > }
> > > }
> > > 
> > > I suggested it before and current situation looks like it is indeed
> > > needed.
> > 
> > this just opens doors for an unbounded reclaim/threshing becacause
> > you can reclaim as much as you like and there is no guarantee of a
> > forward progress. The reason why !COMPACTION should_compact_retry only
> > checks for the min_wmark without the reclaimable bias is that this will
> > guarantee a retry if we are failing due to high order wmark check rather
> > than a lack of memory. This condition is guaranteed to converge and the
> > probability of the unbounded reclaim is much more reduced.
> 
> In case of a lack of memory with a lot of reclaimable lru pages, why 
> do we stop reclaim/compaction?
> 
> With your partial reverting patch, allocation logic would be like as
> following.
> 
> Assume following situation:
> o a lot of reclaimable lru pages
> o no order-2 freepage
> o not enough order-0 freepage for min watermark
> o order-2 allocation
> 
> 1. order-2 allocation failed due to min watermark
> 2. go to reclaim/compaction
> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
> min watermark isn't met for order-0
> 4. compaction is skipped due to not enough freepage
> 5. should_reclaim_retry() returns false because min watermark for
> order-2 page isn't met
> 6. should_compact_retry() returns false because min watermark for
> order-0 page isn't met
> 6. allocation is failed without any retry and OOM is invoked.

If the direct reclaim is not able to get us over min wmark for order-0
then we would be likely to hit the oom even for order-0 requests.

> Is it what you want?
> 
> And, please elaborate more on how your logic guarantee to converge.
> After order-0 freepage exceed min watermark, there is no way to stop
> reclaim/threshing. Number of freepage just increase monotonically and
> retry cannot be stopped until order-2 allocation succeed. Am I missing
> something?

My statement was imprecise at best. You are right that there is no
guarantee to fullfil order-2 request. What I meant to say is that we
should converge when we are getting out of memory (aka even order-0
would have hard time to succeed). should_reclaim_retry does that by
the back off scaling of the reclaimable pages. should_compact_retry
would have to do the same thing which would effectively turn it into
should_reclaim_retry.

> > > And, I still think that your OOM detection rework has some flaws.
> > >
> > > 1) It doesn't consider freeable objects that can be freed by 
> > > shrink_slab().
> > > There are many subsystems that cache many objects and they will be
> > > freed by shrink_slab() interface. But, you don't account them when
> > > making the OOM decision.
> > 
> > I fully rely on the reclaim and compaction feedback. And that is the
> > place where we should strive for improvements. So if we are growing way
> > too many slab objects we should take care about that in the slab reclaim
> > which is tightly coupled with the LRU reclaim rather than up the layer
> > in the page allocator.
> 
> No. slab shrink logic which is tightly coupled with the LRU reclaim
> totally makes sense.

Once the number of slab object is much larger than LRU pages (what we
have seen in some oom reports) then the way how they are coupled just
stops making a sense because the current 

Re: OOM detection regressions since 4.7

2016-08-24 Thread Joonsoo Kim
Looks like my mail client eat my reply so I resend.

On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
> On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> [...]
> > Hello, Michal.
> > 
> > I agree with partial revert but revert should be a different form.
> > Below change try to reuse should_compact_retry() version for
> > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > Markus report [1].
> 
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated. Disabling compaction should have a very
> strong reason. I haven't heard any so far. I am even wondering whether
> there is a legitimate reason for that these days.
> 
> > Theoretical reason for this regression is that it would stop retry
> > even if there are enough lru pages. It only checks if freepage
> > excesses min watermark or not for retry decision. To prevent
> > pre-mature OOM killer, we need to keep allocation loop when there are
> > enough lru pages. So, logic should be something like that.
> > 
> > should_compact_retry()
> > {
> > for_each_zone_zonelist_nodemask {
> > available = zone_reclaimable_pages(zone);
> > available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> > ac_classzone_idx(ac), alloc_flags, available))
> > return true;
> > 
> > }
> > }
> > 
> > I suggested it before and current situation looks like it is indeed
> > needed.
> 
> this just opens doors for an unbounded reclaim/threshing becacause
> you can reclaim as much as you like and there is no guarantee of a
> forward progress. The reason why !COMPACTION should_compact_retry only
> checks for the min_wmark without the reclaimable bias is that this will
> guarantee a retry if we are failing due to high order wmark check rather
> than a lack of memory. This condition is guaranteed to converge and the
> probability of the unbounded reclaim is much more reduced.

In case of a lack of memory with a lot of reclaimable lru pages, why 
do we stop reclaim/compaction?

With your partial reverting patch, allocation logic would be like as
following.

Assume following situation:
o a lot of reclaimable lru pages
o no order-2 freepage
o not enough order-0 freepage for min watermark
o order-2 allocation

1. order-2 allocation failed due to min watermark
2. go to reclaim/compaction
3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
min watermark isn't met for order-0
4. compaction is skipped due to not enough freepage
5. should_reclaim_retry() returns false because min watermark for
order-2 page isn't met
6. should_compact_retry() returns false because min watermark for
order-0 page isn't met
6. allocation is failed without any retry and OOM is invoked.

Is it what you want?

And, please elaborate more on how your logic guarantee to converge.
After order-0 freepage exceed min watermark, there is no way to stop
reclaim/threshing. Number of freepage just increase monotonically and
retry cannot be stopped until order-2 allocation succeed. Am I missing
something?


> > And, I still think that your OOM detection rework has some flaws.
> >
> > 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> > There are many subsystems that cache many objects and they will be
> > freed by shrink_slab() interface. But, you don't account them when
> > making the OOM decision.
> 
> I fully rely on the reclaim and compaction feedback. And that is the
> place where we should strive for improvements. So if we are growing way
> too many slab objects we should take care about that in the slab reclaim
> which is tightly coupled with the LRU reclaim rather than up the layer
> in the page allocator.

No. slab shrink logic which is tightly coupled with the LRU reclaim
totally makes sense. What doesn't makes sense is the way of using
these functionality and utilizing these freebacks on your OOM
detection rework.

For example, compaction will do it's best with current resource. But,
as I said before, compaction will be more powerful if the system has
more free memory. Your logic just guarantee to give it to minimum
amount of free memory to run so I don't think it's result is
reliable to determine if we are in OOM or not.

And, your logic doesn't consider how many pages can be freed by slab
shrink. As I said before, there would exist high order reclaimable
page or we can make high order freepage by actual free.

Most importantly, I think that it is fundamentally impossible to
anticipate if we can make high order freepage or not by snapshot of
information about number of freeable page. So, your logic rely on
compaction but there are many types of pages that cannot be migrated
by compaction but can be reclaimed. So, fully relying on compaction
result for OOM decision 

Re: OOM detection regressions since 4.7

2016-08-24 Thread Michal Hocko
On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:
> Looks like my mail client eat my reply so I resend.
> 
> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> > [...]
> > > Hello, Michal.
> > > 
> > > I agree with partial revert but revert should be a different form.
> > > Below change try to reuse should_compact_retry() version for
> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > > Markus report [1].
> > 
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated. Disabling compaction should have a very
> > strong reason. I haven't heard any so far. I am even wondering whether
> > there is a legitimate reason for that these days.
> > 
> > > Theoretical reason for this regression is that it would stop retry
> > > even if there are enough lru pages. It only checks if freepage
> > > excesses min watermark or not for retry decision. To prevent
> > > pre-mature OOM killer, we need to keep allocation loop when there are
> > > enough lru pages. So, logic should be something like that.
> > > 
> > > should_compact_retry()
> > > {
> > > for_each_zone_zonelist_nodemask {
> > > available = zone_reclaimable_pages(zone);
> > > available += zone_page_state_snapshot(zone, 
> > > NR_FREE_PAGES);
> > > if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> > > ac_classzone_idx(ac), alloc_flags, available))
> > > return true;
> > > 
> > > }
> > > }
> > > 
> > > I suggested it before and current situation looks like it is indeed
> > > needed.
> > 
> > this just opens doors for an unbounded reclaim/threshing becacause
> > you can reclaim as much as you like and there is no guarantee of a
> > forward progress. The reason why !COMPACTION should_compact_retry only
> > checks for the min_wmark without the reclaimable bias is that this will
> > guarantee a retry if we are failing due to high order wmark check rather
> > than a lack of memory. This condition is guaranteed to converge and the
> > probability of the unbounded reclaim is much more reduced.
> 
> In case of a lack of memory with a lot of reclaimable lru pages, why 
> do we stop reclaim/compaction?
> 
> With your partial reverting patch, allocation logic would be like as
> following.
> 
> Assume following situation:
> o a lot of reclaimable lru pages
> o no order-2 freepage
> o not enough order-0 freepage for min watermark
> o order-2 allocation
> 
> 1. order-2 allocation failed due to min watermark
> 2. go to reclaim/compaction
> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
> min watermark isn't met for order-0
> 4. compaction is skipped due to not enough freepage
> 5. should_reclaim_retry() returns false because min watermark for
> order-2 page isn't met
> 6. should_compact_retry() returns false because min watermark for
> order-0 page isn't met
> 6. allocation is failed without any retry and OOM is invoked.

If the direct reclaim is not able to get us over min wmark for order-0
then we would be likely to hit the oom even for order-0 requests.

> Is it what you want?
> 
> And, please elaborate more on how your logic guarantee to converge.
> After order-0 freepage exceed min watermark, there is no way to stop
> reclaim/threshing. Number of freepage just increase monotonically and
> retry cannot be stopped until order-2 allocation succeed. Am I missing
> something?

My statement was imprecise at best. You are right that there is no
guarantee to fullfil order-2 request. What I meant to say is that we
should converge when we are getting out of memory (aka even order-0
would have hard time to succeed). should_reclaim_retry does that by
the back off scaling of the reclaimable pages. should_compact_retry
would have to do the same thing which would effectively turn it into
should_reclaim_retry.

> > > And, I still think that your OOM detection rework has some flaws.
> > >
> > > 1) It doesn't consider freeable objects that can be freed by 
> > > shrink_slab().
> > > There are many subsystems that cache many objects and they will be
> > > freed by shrink_slab() interface. But, you don't account them when
> > > making the OOM decision.
> > 
> > I fully rely on the reclaim and compaction feedback. And that is the
> > place where we should strive for improvements. So if we are growing way
> > too many slab objects we should take care about that in the slab reclaim
> > which is tightly coupled with the LRU reclaim rather than up the layer
> > in the page allocator.
> 
> No. slab shrink logic which is tightly coupled with the LRU reclaim
> totally makes sense.

Once the number of slab object is much larger than LRU pages (what we
have seen in some oom reports) then the way how they are coupled just
stops making a sense because the current 

Re: OOM detection regressions since 4.7

2016-08-24 Thread Michal Hocko
On Tue 23-08-16 15:08:05, Linus Torvalds wrote:
> On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko  wrote:
> >
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated.
> 
> Well, the thread info allocations certainly haven't been big problems
> before. So regressing those would seem to be a real regression.
> 
> What happened? We've done the order-2 allocation for the stack since
> May 2014, so that isn't new. Did we cut off retries for low orders?

Yes, with the original implementation the number of reclaim retries is
basically unbounded and as long as we have a reclaim progress. This has
changed to be a bounded process. Without the compaction this means that
we were reclaim as long as an order-2 page was formed.

> So I would not say that it's an exaggeration to say that order-2
> allocations failing is a regression.

I would agree with you with COMPACTION enabled but with compaction
disabled which should be really limited to !MMU configurations I think
there is not much we can do. Well, we could simply retry for ever
without invoking OOM killer for higher order request for this config
option and rely on order-0 to hit the OOM. Do we want that though?
I do not remember anybody with !MMU to complain. Markus had COMPACTION
disabled accidentally.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-24 Thread Michal Hocko
On Tue 23-08-16 15:08:05, Linus Torvalds wrote:
> On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko  wrote:
> >
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated.
> 
> Well, the thread info allocations certainly haven't been big problems
> before. So regressing those would seem to be a real regression.
> 
> What happened? We've done the order-2 allocation for the stack since
> May 2014, so that isn't new. Did we cut off retries for low orders?

Yes, with the original implementation the number of reclaim retries is
basically unbounded and as long as we have a reclaim progress. This has
changed to be a bounded process. Without the compaction this means that
we were reclaim as long as an order-2 page was formed.

> So I would not say that it's an exaggeration to say that order-2
> allocations failing is a regression.

I would agree with you with COMPACTION enabled but with compaction
disabled which should be really limited to !MMU configurations I think
there is not much we can do. Well, we could simply retry for ever
without invoking OOM killer for higher order request for this config
option and rely on order-0 to hit the OOM. Do we want that though?
I do not remember anybody with !MMU to complain. Markus had COMPACTION
disabled accidentally.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-23 Thread Linus Torvalds
On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko  wrote:
>
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated.

Well, the thread info allocations certainly haven't been big problems
before. So regressing those would seem to be a real regression.

What happened? We've done the order-2 allocation for the stack since
May 2014, so that isn't new. Did we cut off retries for low orders?

So I would not say that it's an exaggeration to say that order-2
allocations failing is a regression.

Yes, yes, for 4.9 we may well end up using vmalloc for the kernel
stack, but there are certainly other things that want low-order
(non-hugepage) allocations. Like kmalloc(), which often ends up using
small orders just to pack data more efficiently (allocating a single
page can be hugely wasteful even if the individual allocations are
smaller than that - so allocating a few pages and packing more
allocations into it helps fight internal fragmentation)

So this definitely needs to be fixed for 4.7 (and apparently there's a
few patches still pending even for 4.8)

 Linus


Re: OOM detection regressions since 4.7

2016-08-23 Thread Linus Torvalds
On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko  wrote:
>
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated.

Well, the thread info allocations certainly haven't been big problems
before. So regressing those would seem to be a real regression.

What happened? We've done the order-2 allocation for the stack since
May 2014, so that isn't new. Did we cut off retries for low orders?

So I would not say that it's an exaggeration to say that order-2
allocations failing is a regression.

Yes, yes, for 4.9 we may well end up using vmalloc for the kernel
stack, but there are certainly other things that want low-order
(non-hugepage) allocations. Like kmalloc(), which often ends up using
small orders just to pack data more efficiently (allocating a single
page can be hugely wasteful even if the individual allocations are
smaller than that - so allocating a few pages and packing more
allocations into it helps fight internal fragmentation)

So this definitely needs to be fixed for 4.7 (and apparently there's a
few patches still pending even for 4.8)

 Linus


Re: OOM detection regressions since 4.7

2016-08-23 Thread Michal Hocko
On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  wrote:
> 
> > Of course, if Linus/Andrew doesn't like to take those compaction
> > improvements this late then I will ask to merge the partial revert to
> > Linus tree as well and then there is not much to discuss.
> 
> This sounds like the prudent option.  Can we get 4.8 working
> well-enough, backport that into 4.7.x and worry about the fancier stuff
> for 4.9?

OK, fair enough.

I would really appreciate if the original reporters could retest with
this patch on top of the current Linus tree. The stable backport posted
earlier doesn't apply on the current master cleanly but the change is
essentially same. mmotm tree then can revert this patch before Vlastimil
series is applied because that code is touching the currently removed
code.
---
>From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 22 Aug 2016 10:52:06 +0200
Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
 order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed that
the current linux-next which includes [1] and [2] helped and OOMs are
not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress
and we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 51 ++-
 1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..7791a03f8deb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
return NULL;
 }
 
-static inline bool
-should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-enum compact_result compact_result,
-enum compact_priority *compact_priority,
-int compaction_retries)
-{
-   int max_retries = MAX_COMPACT_RETRIES;
-
-   if (!order)
-   return false;
-
-   /*
-* compaction considers all the zone as desperately out of memory
-* so it doesn't really make much sense to retry except when the
-* failure could be caused by insufficient priority
-*/
-   if (compaction_failed(compact_result)) {
-   if (*compact_priority > MIN_COMPACT_PRIORITY) {
-   (*compact_priority)--;
-   return true;
-   }
-   return false;
-   }
-
-   /*
-* make sure the compaction wasn't deferred or didn't bail out early
-* due to locks contention before we declare that we should give up.
-* But do not retry if the given zonelist is not suitable for
-* compaction.
-*/
-   if (compaction_withdrawn(compact_result))
-   return compaction_zonelist_suitable(ac, order, alloc_flags);
-
-   /*
-* !costly requests are much more important than __GFP_REPEAT
-* costly ones because they are de facto nofail and invoke OOM
-* killer to move on while costly can fail and users are ready
-* to cope with that. 1/4 retries is rather arbitrary but we
-* would need much more detailed feedback from compaction to
-* make a better decision.
-*/
-   if (order > PAGE_ALLOC_COSTLY_ORDER)
-   max_retries /= 4;
-   if (compaction_retries <= max_retries)
-   return true;
-
-   return false;
-}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
return NULL;
 }
 
+#endif /* CONFIG_COMPACTION */
+
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int 
alloc_flags,
 enum compact_result compact_result,
@@ -3221,7 +3175,6 @@ should_compact_retry(struct alloc_context *ac, 

Re: OOM detection regressions since 4.7

2016-08-23 Thread Michal Hocko
On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  wrote:
> 
> > Of course, if Linus/Andrew doesn't like to take those compaction
> > improvements this late then I will ask to merge the partial revert to
> > Linus tree as well and then there is not much to discuss.
> 
> This sounds like the prudent option.  Can we get 4.8 working
> well-enough, backport that into 4.7.x and worry about the fancier stuff
> for 4.9?

OK, fair enough.

I would really appreciate if the original reporters could retest with
this patch on top of the current Linus tree. The stable backport posted
earlier doesn't apply on the current master cleanly but the change is
essentially same. mmotm tree then can revert this patch before Vlastimil
series is applied because that code is touching the currently removed
code.
---
>From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 22 Aug 2016 10:52:06 +0200
Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
 order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed that
the current linux-next which includes [1] and [2] helped and OOMs are
not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress
and we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 51 ++-
 1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..7791a03f8deb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
return NULL;
 }
 
-static inline bool
-should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-enum compact_result compact_result,
-enum compact_priority *compact_priority,
-int compaction_retries)
-{
-   int max_retries = MAX_COMPACT_RETRIES;
-
-   if (!order)
-   return false;
-
-   /*
-* compaction considers all the zone as desperately out of memory
-* so it doesn't really make much sense to retry except when the
-* failure could be caused by insufficient priority
-*/
-   if (compaction_failed(compact_result)) {
-   if (*compact_priority > MIN_COMPACT_PRIORITY) {
-   (*compact_priority)--;
-   return true;
-   }
-   return false;
-   }
-
-   /*
-* make sure the compaction wasn't deferred or didn't bail out early
-* due to locks contention before we declare that we should give up.
-* But do not retry if the given zonelist is not suitable for
-* compaction.
-*/
-   if (compaction_withdrawn(compact_result))
-   return compaction_zonelist_suitable(ac, order, alloc_flags);
-
-   /*
-* !costly requests are much more important than __GFP_REPEAT
-* costly ones because they are de facto nofail and invoke OOM
-* killer to move on while costly can fail and users are ready
-* to cope with that. 1/4 retries is rather arbitrary but we
-* would need much more detailed feedback from compaction to
-* make a better decision.
-*/
-   if (order > PAGE_ALLOC_COSTLY_ORDER)
-   max_retries /= 4;
-   if (compaction_retries <= max_retries)
-   return true;
-
-   return false;
-}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
return NULL;
 }
 
+#endif /* CONFIG_COMPACTION */
+
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int 
alloc_flags,
 enum compact_result compact_result,
@@ -3221,7 +3175,6 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
}

Re: OOM detection regressions since 4.7

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 09:40:14, Markus Trippelsdorf wrote:
> On 2016.08.23 at 09:33 +0200, Michal Hocko wrote:
> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> > [...]
> > > Hello, Michal.
> > > 
> > > I agree with partial revert but revert should be a different form.
> > > Below change try to reuse should_compact_retry() version for
> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > > Markus report [1].
> > 
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated. Disabling compaction should have a very
> > strong reason. I haven't heard any so far. I am even wondering whether
> > there is a legitimate reason for that these days.
> 
> BTW, the current config description:
> 
>   CONFIG_COMPACTION:
>   Allows the compaction of memory for the allocation of huge pages. 
> 
> doesn't make it clear to the user that this is an essential feature.

Yes I plan to send a clarification patch.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 09:40:14, Markus Trippelsdorf wrote:
> On 2016.08.23 at 09:33 +0200, Michal Hocko wrote:
> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> > [...]
> > > Hello, Michal.
> > > 
> > > I agree with partial revert but revert should be a different form.
> > > Below change try to reuse should_compact_retry() version for
> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > > Markus report [1].
> > 
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated. Disabling compaction should have a very
> > strong reason. I haven't heard any so far. I am even wondering whether
> > there is a legitimate reason for that these days.
> 
> BTW, the current config description:
> 
>   CONFIG_COMPACTION:
>   Allows the compaction of memory for the allocation of huge pages. 
> 
> doesn't make it clear to the user that this is an essential feature.

Yes I plan to send a clarification patch.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-23 Thread Markus Trippelsdorf
On 2016.08.23 at 09:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> [...]
> > Hello, Michal.
> > 
> > I agree with partial revert but revert should be a different form.
> > Below change try to reuse should_compact_retry() version for
> > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > Markus report [1].
> 
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated. Disabling compaction should have a very
> strong reason. I haven't heard any so far. I am even wondering whether
> there is a legitimate reason for that these days.

BTW, the current config description:

  CONFIG_COMPACTION:
  Allows the compaction of memory for the allocation of huge pages. 

doesn't make it clear to the user that this is an essential feature.

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-23 Thread Markus Trippelsdorf
On 2016.08.23 at 09:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> [...]
> > Hello, Michal.
> > 
> > I agree with partial revert but revert should be a different form.
> > Below change try to reuse should_compact_retry() version for
> > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > Markus report [1].
> 
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated. Disabling compaction should have a very
> strong reason. I haven't heard any so far. I am even wondering whether
> there is a legitimate reason for that these days.

BTW, the current config description:

  CONFIG_COMPACTION:
  Allows the compaction of memory for the allocation of huge pages. 

doesn't make it clear to the user that this is an essential feature.

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
[...]
> Hello, Michal.
> 
> I agree with partial revert but revert should be a different form.
> Below change try to reuse should_compact_retry() version for
> !CONFIG_COMPACTION but it turned out that it also causes regression in
> Markus report [1].

I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
order workloads that calling any change in that behavior a regression
is little bit exaggerated. Disabling compaction should have a very
strong reason. I haven't heard any so far. I am even wondering whether
there is a legitimate reason for that these days.

> Theoretical reason for this regression is that it would stop retry
> even if there are enough lru pages. It only checks if freepage
> excesses min watermark or not for retry decision. To prevent
> pre-mature OOM killer, we need to keep allocation loop when there are
> enough lru pages. So, logic should be something like that.
> 
> should_compact_retry()
> {
> for_each_zone_zonelist_nodemask {
> available = zone_reclaimable_pages(zone);
> available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> ac_classzone_idx(ac), alloc_flags, available))
> return true;
> 
> }
> }
> 
> I suggested it before and current situation looks like it is indeed
> needed.

this just opens doors for an unbounded reclaim/threshing becacause
you can reclaim as much as you like and there is no guarantee of a
forward progress. The reason why !COMPACTION should_compact_retry only
checks for the min_wmark without the reclaimable bias is that this will
guarantee a retry if we are failing due to high order wmark check rather
than a lack of memory. This condition is guaranteed to converge and the
probability of the unbounded reclaim is much more reduced.

> And, I still think that your OOM detection rework has some flaws.
>
> 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> There are many subsystems that cache many objects and they will be
> freed by shrink_slab() interface. But, you don't account them when
> making the OOM decision.

I fully rely on the reclaim and compaction feedback. And that is the
place where we should strive for improvements. So if we are growing way
too many slab objects we should take care about that in the slab reclaim
which is tightly coupled with the LRU reclaim rather than up the layer
in the page allocator.
 
> Think about following situation that we are trying to find order-2
> freepage and some subsystem has order-2 freepage. It can be freed by
> shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
> invoked to free this order-2 freepage in that subsystem. OOM would be
> triggered when compaction fails even if there is a order-2 freeable
> page. I think that if decision is made before whole lru list is
> scanned and then shrink_slab() is invoked for whole freeable objects,
> it would cause pre-mature OOM.

I do not see why we would need to scan through the whole LRU list when
we are under a high order pressure. It is true, though, that slab
shrinkers can and should be more sensitive to the requested order to
help release higher order pages preferably.

> It seems that you already knows this issue [2].
> 
> 2) 'OOM detection rework' depends on compaction too much. Compaction
> algorithm is racy and has some limitation. It's failure doesn't mean we
> are in OOM situation.

As long as this is the only reliable source of higher order pages then
we do not have any other choice in order to have deterministic behavior.

> Even if Vlastimil's patchset and mine is
> applied, it is still possible that compaction scanner cannot find enough
> freepage due to race condition and return pre-mature failure. To
> reduce this race effect, I hope to give more chances to retry even if
> full compaction is failed.

Than we can improve compaction_failed() heuristic and do not call it the
end of the day after a single attempt to get a high order page after
scanning the whole memory. But to me this all sounds like an internal
implementation detail of the compaction and the OOM detection in the
page allocator should be as much independent on it as possible - same as
it is independent on the internal reclaim decisions. That was the whole
point of my rework. To actually melt "do something as long as at least a
single page is reclaimed" into an actual algorithm which can be measured
and reason about.

> We can remove this heuristic when we make sure that compaction is
> stable enough.

How do we know that, though, if we do not rely on it? Artificial tests
do not exhibit those corner cases. I was bashing my testing systems to
cause as much fragmentation as possible, yet I wasn't able to trigger
issues reported recently by real world workloads. Do not take me wrong,
I understand your concerns but OOM 

Re: OOM detection regressions since 4.7

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
[...]
> Hello, Michal.
> 
> I agree with partial revert but revert should be a different form.
> Below change try to reuse should_compact_retry() version for
> !CONFIG_COMPACTION but it turned out that it also causes regression in
> Markus report [1].

I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
order workloads that calling any change in that behavior a regression
is little bit exaggerated. Disabling compaction should have a very
strong reason. I haven't heard any so far. I am even wondering whether
there is a legitimate reason for that these days.

> Theoretical reason for this regression is that it would stop retry
> even if there are enough lru pages. It only checks if freepage
> excesses min watermark or not for retry decision. To prevent
> pre-mature OOM killer, we need to keep allocation loop when there are
> enough lru pages. So, logic should be something like that.
> 
> should_compact_retry()
> {
> for_each_zone_zonelist_nodemask {
> available = zone_reclaimable_pages(zone);
> available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> ac_classzone_idx(ac), alloc_flags, available))
> return true;
> 
> }
> }
> 
> I suggested it before and current situation looks like it is indeed
> needed.

this just opens doors for an unbounded reclaim/threshing becacause
you can reclaim as much as you like and there is no guarantee of a
forward progress. The reason why !COMPACTION should_compact_retry only
checks for the min_wmark without the reclaimable bias is that this will
guarantee a retry if we are failing due to high order wmark check rather
than a lack of memory. This condition is guaranteed to converge and the
probability of the unbounded reclaim is much more reduced.

> And, I still think that your OOM detection rework has some flaws.
>
> 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> There are many subsystems that cache many objects and they will be
> freed by shrink_slab() interface. But, you don't account them when
> making the OOM decision.

I fully rely on the reclaim and compaction feedback. And that is the
place where we should strive for improvements. So if we are growing way
too many slab objects we should take care about that in the slab reclaim
which is tightly coupled with the LRU reclaim rather than up the layer
in the page allocator.
 
> Think about following situation that we are trying to find order-2
> freepage and some subsystem has order-2 freepage. It can be freed by
> shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
> invoked to free this order-2 freepage in that subsystem. OOM would be
> triggered when compaction fails even if there is a order-2 freeable
> page. I think that if decision is made before whole lru list is
> scanned and then shrink_slab() is invoked for whole freeable objects,
> it would cause pre-mature OOM.

I do not see why we would need to scan through the whole LRU list when
we are under a high order pressure. It is true, though, that slab
shrinkers can and should be more sensitive to the requested order to
help release higher order pages preferably.

> It seems that you already knows this issue [2].
> 
> 2) 'OOM detection rework' depends on compaction too much. Compaction
> algorithm is racy and has some limitation. It's failure doesn't mean we
> are in OOM situation.

As long as this is the only reliable source of higher order pages then
we do not have any other choice in order to have deterministic behavior.

> Even if Vlastimil's patchset and mine is
> applied, it is still possible that compaction scanner cannot find enough
> freepage due to race condition and return pre-mature failure. To
> reduce this race effect, I hope to give more chances to retry even if
> full compaction is failed.

Than we can improve compaction_failed() heuristic and do not call it the
end of the day after a single attempt to get a high order page after
scanning the whole memory. But to me this all sounds like an internal
implementation detail of the compaction and the OOM detection in the
page allocator should be as much independent on it as possible - same as
it is independent on the internal reclaim decisions. That was the whole
point of my rework. To actually melt "do something as long as at least a
single page is reclaimed" into an actual algorithm which can be measured
and reason about.

> We can remove this heuristic when we make sure that compaction is
> stable enough.

How do we know that, though, if we do not rely on it? Artificial tests
do not exhibit those corner cases. I was bashing my testing systems to
cause as much fragmentation as possible, yet I wasn't able to trigger
issues reported recently by real world workloads. Do not take me wrong,
I understand your concerns but OOM 

Re: OOM detection regressions since 4.7

2016-08-22 Thread Joonsoo Kim
On Mon, Aug 22, 2016 at 11:32:49AM +0200, Michal Hocko wrote:
> Hi, 
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
> [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
> [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> 
> ---
> >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>  order request
> 
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore. A simpler fix for the stable is to simply
> ignore the compaction feedback and retry as long as there is a reclaim
> progress for high order requests which we used to do before. We already
> do that for CONFING_COMPACTION=n so let's reuse the same code when
> compaction is enabled as well.

Hello, Michal.

I agree with partial revert but revert should be a different form.
Below change try to reuse should_compact_retry() version for
!CONFIG_COMPACTION but it turned out that it also causes regression in
Markus report [1].

Theoretical reason for this regression is that it would stop retry
even if there are enough lru pages. It only checks if freepage
excesses min watermark or not for retry decision. To prevent
pre-mature OOM killer, we need to keep allocation loop when there are
enough lru pages. So, logic should be something like that.

should_compact_retry()
{
for_each_zone_zonelist_nodemask {
available = zone_reclaimable_pages(zone);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
ac_classzone_idx(ac), alloc_flags, available))
return true;

}
}

I suggested it before and current situation looks like it is indeed
needed.

And, I still think that your OOM detection rework has some flaws.

1) It doesn't consider freeable objects that can be freed by shrink_slab().
There are many subsystems that cache many objects and they will be
freed by shrink_slab() interface. But, you don't account them when
making the OOM decision.

Think about following situation that we are trying to find order-2
freepage and some subsystem has order-2 freepage. It can be freed by
shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
invoked to free this order-2 freepage in that subsystem. OOM would be
triggered when compaction fails even if there is a order-2 freeable
page. I think that 

Re: OOM detection regressions since 4.7

2016-08-22 Thread Joonsoo Kim
On Mon, Aug 22, 2016 at 11:32:49AM +0200, Michal Hocko wrote:
> Hi, 
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
> [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
> [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> 
> ---
> >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>  order request
> 
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore. A simpler fix for the stable is to simply
> ignore the compaction feedback and retry as long as there is a reclaim
> progress for high order requests which we used to do before. We already
> do that for CONFING_COMPACTION=n so let's reuse the same code when
> compaction is enabled as well.

Hello, Michal.

I agree with partial revert but revert should be a different form.
Below change try to reuse should_compact_retry() version for
!CONFIG_COMPACTION but it turned out that it also causes regression in
Markus report [1].

Theoretical reason for this regression is that it would stop retry
even if there are enough lru pages. It only checks if freepage
excesses min watermark or not for retry decision. To prevent
pre-mature OOM killer, we need to keep allocation loop when there are
enough lru pages. So, logic should be something like that.

should_compact_retry()
{
for_each_zone_zonelist_nodemask {
available = zone_reclaimable_pages(zone);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
ac_classzone_idx(ac), alloc_flags, available))
return true;

}
}

I suggested it before and current situation looks like it is indeed
needed.

And, I still think that your OOM detection rework has some flaws.

1) It doesn't consider freeable objects that can be freed by shrink_slab().
There are many subsystems that cache many objects and they will be
freed by shrink_slab() interface. But, you don't account them when
making the OOM decision.

Think about following situation that we are trying to find order-2
freepage and some subsystem has order-2 freepage. It can be freed by
shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
invoked to free this order-2 freepage in that subsystem. OOM would be
triggered when compaction fails even if there is a order-2 freeable
page. I think that if decision is 

Re: OOM detection regressions since 4.7

2016-08-22 Thread Andrew Morton
On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  wrote:

> Of course, if Linus/Andrew doesn't like to take those compaction
> improvements this late then I will ask to merge the partial revert to
> Linus tree as well and then there is not much to discuss.

This sounds like the prudent option.  Can we get 4.8 working
well-enough, backport that into 4.7.x and worry about the fancier stuff
for 4.9?


Re: OOM detection regressions since 4.7

2016-08-22 Thread Andrew Morton
On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko  wrote:

> Of course, if Linus/Andrew doesn't like to take those compaction
> improvements this late then I will ask to merge the partial revert to
> Linus tree as well and then there is not much to discuss.

This sounds like the prudent option.  Can we get 4.8 working
well-enough, backport that into 4.7.x and worry about the fancier stuff
for 4.9?


Re: OOM detection regressions since 4.7

2016-08-22 Thread Greg KH
On Mon, Aug 22, 2016 at 03:42:28PM +0200, Michal Hocko wrote:
> On Mon 22-08-16 09:31:14, Greg KH wrote:
> > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> > > On Mon 22-08-16 06:05:28, Greg KH wrote:
> > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 
> > > > > > 2001
> > > > > > From: Michal Hocko 
> > > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation 
> > > > > > for high
> > > > > >  order request
> > > > > > 
> > > > > > There have been several reports about pre-mature OOM killer 
> > > > > > invocation
> > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > > > invoked OOM killer even during basic workloads (light IO or even 
> > > > > > kernel
> > > > > > compile on some filesystems). In all reported cases the memory is
> > > > > > fragmented and there are no order-2+ pages available. There is 
> > > > > > usually
> > > > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > > > debugging has shown that there are way too many unmovable blocks 
> > > > > > which
> > > > > > are skipped during the compaction. Multiple reporters have 
> > > > > > confirmed that
> > > > > > the current linux-next which includes [1] and [2] helped and OOMs 
> > > > > > are
> > > > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > > > ignore the compaction feedback and retry as long as there is a 
> > > > > > reclaim
> > > > > > progress for high order requests which we used to do before. We 
> > > > > > already
> > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > > > compaction is enabled as well.
> > > > > > 
> > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > > > > [2] 
> > > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > > > > 
> > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > > > Signed-off-by: Michal Hocko 
> > > > > > ---
> > > > > >  mm/page_alloc.c | 50 
> > > > > > ++
> > > > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > > > 
> > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> > > > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > > > there's not much I can do here now, right?
> > > 
> > > My plan would be actually to not push this to Linus because we have a
> > > proper fix for Linus tree. It is just that the fix is quite large and I
> > > felt like the stable should get the most simple fix possible, which is
> > > this partial revert. So, what I am trying to tell is to push a non-linus
> > > patch to stable as it is simpler.
> > 
> > I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
> > of the time (well, almost always), it ends up being wrong and hurting us
> > in the end.
> 
> I do not like it either but if there is a simple and straightforward
> workaround for stable while the upstream can go with the _proper_ fix
> from the longer POV then I think this is perfectly justified. Stable
> should be always about the simplest fix for the problem IMHO.

No, stable should always be "what is in Linus's tree to get it fixed."

Again, almost every time we try to "just do this simple thing instead"
in a stable tree, it ends up being broken somehow.  We have the history
to back this up, look at our archives.

I'll gladly take 10+ patches to resolve something, _if_ it actually
resolves something.

But, if we argue about it for a month or so, then we don't have to worry
about it as everyone will be using 4.8 :)

> Of course, if Linus/Andrew doesn't like to take those compaction
> improvements this late then I will ask to merge the partial revert to
> Linus tree as well and then there is not much to discuss.

Ok, let me know how it goes and we can see what to do.

thanks.

greg k-h


Re: OOM detection regressions since 4.7

2016-08-22 Thread Greg KH
On Mon, Aug 22, 2016 at 03:42:28PM +0200, Michal Hocko wrote:
> On Mon 22-08-16 09:31:14, Greg KH wrote:
> > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> > > On Mon 22-08-16 06:05:28, Greg KH wrote:
> > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 
> > > > > > 2001
> > > > > > From: Michal Hocko 
> > > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation 
> > > > > > for high
> > > > > >  order request
> > > > > > 
> > > > > > There have been several reports about pre-mature OOM killer 
> > > > > > invocation
> > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > > > invoked OOM killer even during basic workloads (light IO or even 
> > > > > > kernel
> > > > > > compile on some filesystems). In all reported cases the memory is
> > > > > > fragmented and there are no order-2+ pages available. There is 
> > > > > > usually
> > > > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > > > debugging has shown that there are way too many unmovable blocks 
> > > > > > which
> > > > > > are skipped during the compaction. Multiple reporters have 
> > > > > > confirmed that
> > > > > > the current linux-next which includes [1] and [2] helped and OOMs 
> > > > > > are
> > > > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > > > ignore the compaction feedback and retry as long as there is a 
> > > > > > reclaim
> > > > > > progress for high order requests which we used to do before. We 
> > > > > > already
> > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > > > compaction is enabled as well.
> > > > > > 
> > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > > > > [2] 
> > > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > > > > 
> > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > > > Signed-off-by: Michal Hocko 
> > > > > > ---
> > > > > >  mm/page_alloc.c | 50 
> > > > > > ++
> > > > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > > > 
> > > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> > > > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > > > there's not much I can do here now, right?
> > > 
> > > My plan would be actually to not push this to Linus because we have a
> > > proper fix for Linus tree. It is just that the fix is quite large and I
> > > felt like the stable should get the most simple fix possible, which is
> > > this partial revert. So, what I am trying to tell is to push a non-linus
> > > patch to stable as it is simpler.
> > 
> > I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
> > of the time (well, almost always), it ends up being wrong and hurting us
> > in the end.
> 
> I do not like it either but if there is a simple and straightforward
> workaround for stable while the upstream can go with the _proper_ fix
> from the longer POV then I think this is perfectly justified. Stable
> should be always about the simplest fix for the problem IMHO.

No, stable should always be "what is in Linus's tree to get it fixed."

Again, almost every time we try to "just do this simple thing instead"
in a stable tree, it ends up being broken somehow.  We have the history
to back this up, look at our archives.

I'll gladly take 10+ patches to resolve something, _if_ it actually
resolves something.

But, if we argue about it for a month or so, then we don't have to worry
about it as everyone will be using 4.8 :)

> Of course, if Linus/Andrew doesn't like to take those compaction
> improvements this late then I will ask to merge the partial revert to
> Linus tree as well and then there is not much to discuss.

Ok, let me know how it goes and we can see what to do.

thanks.

greg k-h


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 09:31:14, Greg KH wrote:
> On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> > On Mon 22-08-16 06:05:28, Greg KH wrote:
> > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> > [...]
> > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko 
> > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation 
> > > > > for high
> > > > >  order request
> > > > > 
> > > > > There have been several reports about pre-mature OOM killer invocation
> > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > > invoked OOM killer even during basic workloads (light IO or even 
> > > > > kernel
> > > > > compile on some filesystems). In all reported cases the memory is
> > > > > fragmented and there are no order-2+ pages available. There is usually
> > > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > > debugging has shown that there are way too many unmovable blocks which
> > > > > are skipped during the compaction. Multiple reporters have confirmed 
> > > > > that
> > > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > > progress for high order requests which we used to do before. We 
> > > > > already
> > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > > compaction is enabled as well.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > > > [2] 
> > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > > > 
> > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > > Signed-off-by: Michal Hocko 
> > > > > ---
> > > > >  mm/page_alloc.c | 50 
> > > > > ++
> > > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > > 
> > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> > > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > > there's not much I can do here now, right?
> > 
> > My plan would be actually to not push this to Linus because we have a
> > proper fix for Linus tree. It is just that the fix is quite large and I
> > felt like the stable should get the most simple fix possible, which is
> > this partial revert. So, what I am trying to tell is to push a non-linus
> > patch to stable as it is simpler.
> 
> I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
> of the time (well, almost always), it ends up being wrong and hurting us
> in the end.

I do not like it either but if there is a simple and straightforward
workaround for stable while the upstream can go with the _proper_ fix
from the longer POV then I think this is perfectly justified. Stable
should be always about the simplest fix for the problem IMHO.

Of course, if Linus/Andrew doesn't like to take those compaction
improvements this late then I will ask to merge the partial revert to
Linus tree as well and then there is not much to discuss.

> What exactly are the commits that are in Linus's tree that resolve this
> issue?

The initial email in this thread has pointed to those patches. Please
note that some of its dependeces (mostly code cleanups) are already
merged and that backporting without them would make the backport harder
and more risky.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 09:31:14, Greg KH wrote:
> On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> > On Mon 22-08-16 06:05:28, Greg KH wrote:
> > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> > [...]
> > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko 
> > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation 
> > > > > for high
> > > > >  order request
> > > > > 
> > > > > There have been several reports about pre-mature OOM killer invocation
> > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > > invoked OOM killer even during basic workloads (light IO or even 
> > > > > kernel
> > > > > compile on some filesystems). In all reported cases the memory is
> > > > > fragmented and there are no order-2+ pages available. There is usually
> > > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > > debugging has shown that there are way too many unmovable blocks which
> > > > > are skipped during the compaction. Multiple reporters have confirmed 
> > > > > that
> > > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > > progress for high order requests which we used to do before. We 
> > > > > already
> > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > > compaction is enabled as well.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > > > [2] 
> > > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > > > 
> > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > > Signed-off-by: Michal Hocko 
> > > > > ---
> > > > >  mm/page_alloc.c | 50 
> > > > > ++
> > > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > > 
> > > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> > > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > > there's not much I can do here now, right?
> > 
> > My plan would be actually to not push this to Linus because we have a
> > proper fix for Linus tree. It is just that the fix is quite large and I
> > felt like the stable should get the most simple fix possible, which is
> > this partial revert. So, what I am trying to tell is to push a non-linus
> > patch to stable as it is simpler.
> 
> I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
> of the time (well, almost always), it ends up being wrong and hurting us
> in the end.

I do not like it either but if there is a simple and straightforward
workaround for stable while the upstream can go with the _proper_ fix
from the longer POV then I think this is perfectly justified. Stable
should be always about the simplest fix for the problem IMHO.

Of course, if Linus/Andrew doesn't like to take those compaction
improvements this late then I will ask to merge the partial revert to
Linus tree as well and then there is not much to discuss.

> What exactly are the commits that are in Linus's tree that resolve this
> issue?

The initial email in this thread has pointed to those patches. Please
note that some of its dependeces (mostly code cleanups) are already
merged and that backporting without them would make the backport harder
and more risky.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Greg KH
On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> On Mon 22-08-16 06:05:28, Greg KH wrote:
> > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> [...]
> > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko 
> > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for 
> > > > high
> > > >  order request
> > > > 
> > > > There have been several reports about pre-mature OOM killer invocation
> > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > > compile on some filesystems). In all reported cases the memory is
> > > > fragmented and there are no order-2+ pages available. There is usually
> > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > debugging has shown that there are way too many unmovable blocks which
> > > > are skipped during the compaction. Multiple reporters have confirmed 
> > > > that
> > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > progress for high order requests which we used to do before. We already
> > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > compaction is enabled as well.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > > [2] 
> > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > > 
> > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > Signed-off-by: Michal Hocko 
> > > > ---
> > > >  mm/page_alloc.c | 50 ++
> > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > 
> > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > there's not much I can do here now, right?
> 
> My plan would be actually to not push this to Linus because we have a
> proper fix for Linus tree. It is just that the fix is quite large and I
> felt like the stable should get the most simple fix possible, which is
> this partial revert. So, what I am trying to tell is to push a non-linus
> patch to stable as it is simpler.

I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
of the time (well, almost always), it ends up being wrong and hurting us
in the end.

What exactly are the commits that are in Linus's tree that resolve this
issue?

thanks,

greg k-h


Re: OOM detection regressions since 4.7

2016-08-22 Thread Greg KH
On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> On Mon 22-08-16 06:05:28, Greg KH wrote:
> > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> [...]
> > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko 
> > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for 
> > > > high
> > > >  order request
> > > > 
> > > > There have been several reports about pre-mature OOM killer invocation
> > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > > compile on some filesystems). In all reported cases the memory is
> > > > fragmented and there are no order-2+ pages available. There is usually
> > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > debugging has shown that there are way too many unmovable blocks which
> > > > are skipped during the compaction. Multiple reporters have confirmed 
> > > > that
> > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > progress for high order requests which we used to do before. We already
> > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > compaction is enabled as well.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > > [2] 
> > > > http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > > 
> > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > Signed-off-by: Michal Hocko 
> > > > ---
> > > >  mm/page_alloc.c | 50 ++
> > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > 
> > So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > there's not much I can do here now, right?
> 
> My plan would be actually to not push this to Linus because we have a
> proper fix for Linus tree. It is just that the fix is quite large and I
> felt like the stable should get the most simple fix possible, which is
> this partial revert. So, what I am trying to tell is to push a non-linus
> patch to stable as it is simpler.

I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
of the time (well, almost always), it ends up being wrong and hurting us
in the end.

What exactly are the commits that are in Linus's tree that resolve this
issue?

thanks,

greg k-h


Re: OOM detection regressions since 4.7

2016-08-22 Thread Markus Trippelsdorf
On 2016.08.22 at 13:13 +0200, Michal Hocko wrote:
> On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote:
> > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > > > 
> > > > For the report [1] above:
> > > > 
> > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > > > # CONFIG_COMPACTION is not set
> > > 
> > > Hmm, without compaction and a heavy fragmentation then I am afraid we
> > > cannot really do much. What is the reason to disable compaction in the
> > > first place?
> > 
> > I don't recall. Must have been some issue in the past. I will re-enable
> > the option.
> 
> Well, without the compaction there is no source of high order pages at
> all. You can only reclaim and hope that some of the reclaimed pages will
> find its buddy on the list and form the higher order page. This can take
> for ever. We used to have the lumpy reclaim and that could help but this
> is long gone.
> 
> I do not think we can really sanely optimize for high-order heavy loads
> without COMPACTION sanely. At least not without reintroducing lumpy
> reclaim or something similar. To be honest I am even not sure which
> configurations should disable compaction - except for really highly
> controlled !mmu or other one purpose systems.

I now recall. It was an issue with CONFIG_TRANSPARENT_HUGEPAGE, so I
disabled that option. This then de-selected CONFIG_COMPACTION...

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-22 Thread Markus Trippelsdorf
On 2016.08.22 at 13:13 +0200, Michal Hocko wrote:
> On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote:
> > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > > > 
> > > > For the report [1] above:
> > > > 
> > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > > > # CONFIG_COMPACTION is not set
> > > 
> > > Hmm, without compaction and a heavy fragmentation then I am afraid we
> > > cannot really do much. What is the reason to disable compaction in the
> > > first place?
> > 
> > I don't recall. Must have been some issue in the past. I will re-enable
> > the option.
> 
> Well, without the compaction there is no source of high order pages at
> all. You can only reclaim and hope that some of the reclaimed pages will
> find its buddy on the list and form the higher order page. This can take
> for ever. We used to have the lumpy reclaim and that could help but this
> is long gone.
> 
> I do not think we can really sanely optimize for high-order heavy loads
> without COMPACTION sanely. At least not without reintroducing lumpy
> reclaim or something similar. To be honest I am even not sure which
> configurations should disable compaction - except for really highly
> controlled !mmu or other one purpose systems.

I now recall. It was an issue with CONFIG_TRANSPARENT_HUGEPAGE, so I
disabled that option. This then de-selected CONFIG_COMPACTION...

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote:
> On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > > 
> > > For the report [1] above:
> > > 
> > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > > # CONFIG_COMPACTION is not set
> > 
> > Hmm, without compaction and a heavy fragmentation then I am afraid we
> > cannot really do much. What is the reason to disable compaction in the
> > first place?
> 
> I don't recall. Must have been some issue in the past. I will re-enable
> the option.

Well, without the compaction there is no source of high order pages at
all. You can only reclaim and hope that some of the reclaimed pages will
find its buddy on the list and form the higher order page. This can take
for ever. We used to have the lumpy reclaim and that could help but this
is long gone.

I do not think we can really sanely optimize for high-order heavy loads
without COMPACTION sanely. At least not without reintroducing lumpy
reclaim or something similar. To be honest I am even not sure which
configurations should disable compaction - except for really highly
controlled !mmu or other one purpose systems.

-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote:
> On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > > 
> > > For the report [1] above:
> > > 
> > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > > # CONFIG_COMPACTION is not set
> > 
> > Hmm, without compaction and a heavy fragmentation then I am afraid we
> > cannot really do much. What is the reason to disable compaction in the
> > first place?
> 
> I don't recall. Must have been some issue in the past. I will re-enable
> the option.

Well, without the compaction there is no source of high order pages at
all. You can only reclaim and hope that some of the reclaimed pages will
find its buddy on the list and form the higher order page. This can take
for ever. We used to have the lumpy reclaim and that could help but this
is long gone.

I do not think we can really sanely optimize for high-order heavy loads
without COMPACTION sanely. At least not without reintroducing lumpy
reclaim or something similar. To be honest I am even not sure which
configurations should disable compaction - except for really highly
controlled !mmu or other one purpose systems.

-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Markus Trippelsdorf
On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > 
> > For the report [1] above:
> > 
> > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > # CONFIG_COMPACTION is not set
> 
> Hmm, without compaction and a heavy fragmentation then I am afraid we
> cannot really do much. What is the reason to disable compaction in the
> first place?

I don't recall. Must have been some issue in the past. I will re-enable
the option.

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-22 Thread Markus Trippelsdorf
On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > 
> > For the report [1] above:
> > 
> > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > # CONFIG_COMPACTION is not set
> 
> Hmm, without compaction and a heavy fragmentation then I am afraid we
> cannot really do much. What is the reason to disable compaction in the
> first place?

I don't recall. Must have been some issue in the past. I will re-enable
the option.

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> > killer invocations since 4.7 which contains oom detection rework. All of
> > them were for order-2 (kernel stack) alloaction requests failing because
> > of a high fragmentation and compaction failing to make any forward
> > progress. While investigating this we have found out that the compaction
> > just gives up too early. Vlastimil has been working on compaction
> > improvement for quite some time and his series [6] is already sitting
> > in mmotm tree. This already helps a lot because it drops some heuristics
> > which are more aimed at lower latencies for high orders rather than
> > reliability. Joonsoo has then identified further problem with too many
> > blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> > on top of his series [8] which is also in the mmotm tree now.
> > 
> > That being said, the regression is real and should be fixed for 4.7
> > stable users. [6][8] was reported to help and ooms are no longer
> > reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> > for mergeing those patches and have them in 4.8. For 4.7 I would go
> > with a partial revert of the detection rework for high order requests
> > (see patch below). This patch is really trivial. If those compaction
> > improvements are just too large for 4.8 then we can use the same patch
> > as for 4.7 stable for now and revert it in 4.9 after compaction changes
> > are merged.
> > 
> > Thoughts?
> > 
> > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> 
> For the report [1] above:
> 
> markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> # CONFIG_COMPACTION is not set

Hmm, without compaction and a heavy fragmentation then I am afraid we
cannot really do much. What is the reason to disable compaction in the
first place?
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> > killer invocations since 4.7 which contains oom detection rework. All of
> > them were for order-2 (kernel stack) alloaction requests failing because
> > of a high fragmentation and compaction failing to make any forward
> > progress. While investigating this we have found out that the compaction
> > just gives up too early. Vlastimil has been working on compaction
> > improvement for quite some time and his series [6] is already sitting
> > in mmotm tree. This already helps a lot because it drops some heuristics
> > which are more aimed at lower latencies for high orders rather than
> > reliability. Joonsoo has then identified further problem with too many
> > blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> > on top of his series [8] which is also in the mmotm tree now.
> > 
> > That being said, the regression is real and should be fixed for 4.7
> > stable users. [6][8] was reported to help and ooms are no longer
> > reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> > for mergeing those patches and have them in 4.8. For 4.7 I would go
> > with a partial revert of the detection rework for high order requests
> > (see patch below). This patch is really trivial. If those compaction
> > improvements are just too large for 4.8 then we can use the same patch
> > as for 4.7 stable for now and revert it in 4.9 after compaction changes
> > are merged.
> > 
> > Thoughts?
> > 
> > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> 
> For the report [1] above:
> 
> markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> # CONFIG_COMPACTION is not set

Hmm, without compaction and a heavy fragmentation then I am afraid we
cannot really do much. What is the reason to disable compaction in the
first place?
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 06:05:28, Greg KH wrote:
> On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
[...]
> > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for 
> > > high
> > >  order request
> > > 
> > > There have been several reports about pre-mature OOM killer invocation
> > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > compile on some filesystems). In all reported cases the memory is
> > > fragmented and there are no order-2+ pages available. There is usually
> > > a large amount of slab memory (usually dentries/inodes) and further
> > > debugging has shown that there are way too many unmovable blocks which
> > > are skipped during the compaction. Multiple reporters have confirmed that
> > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > not reproducible anymore. A simpler fix for the stable is to simply
> > > ignore the compaction feedback and retry as long as there is a reclaim
> > > progress for high order requests which we used to do before. We already
> > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > compaction is enabled as well.
> > > 
> > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > 
> > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > Signed-off-by: Michal Hocko 
> > > ---
> > >  mm/page_alloc.c | 50 ++
> > >  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> know about it so we can add it to the 4.7-stable tree?  Otherwise
> there's not much I can do here now, right?

My plan would be actually to not push this to Linus because we have a
proper fix for Linus tree. It is just that the fix is quite large and I
felt like the stable should get the most simple fix possible, which is
this partial revert. So, what I am trying to tell is to push a non-linus
patch to stable as it is simpler.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 06:05:28, Greg KH wrote:
> On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
[...]
> > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for 
> > > high
> > >  order request
> > > 
> > > There have been several reports about pre-mature OOM killer invocation
> > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > compile on some filesystems). In all reported cases the memory is
> > > fragmented and there are no order-2+ pages available. There is usually
> > > a large amount of slab memory (usually dentries/inodes) and further
> > > debugging has shown that there are way too many unmovable blocks which
> > > are skipped during the compaction. Multiple reporters have confirmed that
> > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > not reproducible anymore. A simpler fix for the stable is to simply
> > > ignore the compaction feedback and retry as long as there is a reclaim
> > > progress for high order requests which we used to do before. We already
> > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > compaction is enabled as well.
> > > 
> > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > > 
> > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > Signed-off-by: Michal Hocko 
> > > ---
> > >  mm/page_alloc.c | 50 ++
> > >  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
> know about it so we can add it to the 4.7-stable tree?  Otherwise
> there's not much I can do here now, right?

My plan would be actually to not push this to Linus because we have a
proper fix for Linus tree. It is just that the fix is quite large and I
felt like the stable should get the most simple fix possible, which is
this partial revert. So, what I am trying to tell is to push a non-linus
patch to stable as it is simpler.
-- 
Michal Hocko
SUSE Labs


Re: OOM detection regressions since 4.7

2016-08-22 Thread Markus Trippelsdorf
On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4

For the report [1] above:

markus@x4 linux % cat .config | grep CONFIG_COMPACTION
# CONFIG_COMPACTION is not set

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-22 Thread Markus Trippelsdorf
On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4

For the report [1] above:

markus@x4 linux % cat .config | grep CONFIG_COMPACTION
# CONFIG_COMPACTION is not set

-- 
Markus


Re: OOM detection regressions since 4.7

2016-08-22 Thread Greg KH
On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> [ups, fixing up Greg's email]
> 
> On Mon 22-08-16 11:32:49, Michal Hocko wrote:
> > Hi, 
> > there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> > killer invocations since 4.7 which contains oom detection rework. All of
> > them were for order-2 (kernel stack) alloaction requests failing because
> > of a high fragmentation and compaction failing to make any forward
> > progress. While investigating this we have found out that the compaction
> > just gives up too early. Vlastimil has been working on compaction
> > improvement for quite some time and his series [6] is already sitting
> > in mmotm tree. This already helps a lot because it drops some heuristics
> > which are more aimed at lower latencies for high orders rather than
> > reliability. Joonsoo has then identified further problem with too many
> > blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> > on top of his series [8] which is also in the mmotm tree now.
> > 
> > That being said, the regression is real and should be fixed for 4.7
> > stable users. [6][8] was reported to help and ooms are no longer
> > reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> > for mergeing those patches and have them in 4.8. For 4.7 I would go
> > with a partial revert of the detection rework for high order requests
> > (see patch below). This patch is really trivial. If those compaction
> > improvements are just too large for 4.8 then we can use the same patch
> > as for 4.7 stable for now and revert it in 4.9 after compaction changes
> > are merged.
> > 
> > Thoughts?
> > 
> > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
> > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
> > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > 
> > ---
> > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> >  order request
> > 
> > There have been several reports about pre-mature OOM killer invocation
> > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > invoked OOM killer even during basic workloads (light IO or even kernel
> > compile on some filesystems). In all reported cases the memory is
> > fragmented and there are no order-2+ pages available. There is usually
> > a large amount of slab memory (usually dentries/inodes) and further
> > debugging has shown that there are way too many unmovable blocks which
> > are skipped during the compaction. Multiple reporters have confirmed that
> > the current linux-next which includes [1] and [2] helped and OOMs are
> > not reproducible anymore. A simpler fix for the stable is to simply
> > ignore the compaction feedback and retry as long as there is a reclaim
> > progress for high order requests which we used to do before. We already
> > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > compaction is enabled as well.
> > 
> > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > 
> > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/page_alloc.c | 50 ++
> >  1 file changed, 2 insertions(+), 48 deletions(-)

So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
know about it so we can add it to the 4.7-stable tree?  Otherwise
there's not much I can do here now, right?

thanks,

greg k-h


Re: OOM detection regressions since 4.7

2016-08-22 Thread Greg KH
On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> [ups, fixing up Greg's email]
> 
> On Mon 22-08-16 11:32:49, Michal Hocko wrote:
> > Hi, 
> > there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> > killer invocations since 4.7 which contains oom detection rework. All of
> > them were for order-2 (kernel stack) alloaction requests failing because
> > of a high fragmentation and compaction failing to make any forward
> > progress. While investigating this we have found out that the compaction
> > just gives up too early. Vlastimil has been working on compaction
> > improvement for quite some time and his series [6] is already sitting
> > in mmotm tree. This already helps a lot because it drops some heuristics
> > which are more aimed at lower latencies for high orders rather than
> > reliability. Joonsoo has then identified further problem with too many
> > blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> > on top of his series [8] which is also in the mmotm tree now.
> > 
> > That being said, the regression is real and should be fixed for 4.7
> > stable users. [6][8] was reported to help and ooms are no longer
> > reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> > for mergeing those patches and have them in 4.8. For 4.7 I would go
> > with a partial revert of the detection rework for high order requests
> > (see patch below). This patch is really trivial. If those compaction
> > improvements are just too large for 4.8 then we can use the same patch
> > as for 4.7 stable for now and revert it in 4.9 after compaction changes
> > are merged.
> > 
> > Thoughts?
> > 
> > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
> > [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
> > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > 
> > ---
> > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> >  order request
> > 
> > There have been several reports about pre-mature OOM killer invocation
> > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > invoked OOM killer even during basic workloads (light IO or even kernel
> > compile on some filesystems). In all reported cases the memory is
> > fragmented and there are no order-2+ pages available. There is usually
> > a large amount of slab memory (usually dentries/inodes) and further
> > debugging has shown that there are way too many unmovable blocks which
> > are skipped during the compaction. Multiple reporters have confirmed that
> > the current linux-next which includes [1] and [2] helped and OOMs are
> > not reproducible anymore. A simpler fix for the stable is to simply
> > ignore the compaction feedback and retry as long as there is a reclaim
> > progress for high order requests which we used to do before. We already
> > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > compaction is enabled as well.
> > 
> > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> > 
> > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/page_alloc.c | 50 ++
> >  1 file changed, 2 insertions(+), 48 deletions(-)

So, if this goes into Linus's tree, can you let sta...@vger.kernel.org
know about it so we can add it to the 4.7-stable tree?  Otherwise
there's not much I can do here now, right?

thanks,

greg k-h


Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
[ups, fixing up Greg's email]

On Mon 22-08-16 11:32:49, Michal Hocko wrote:
> Hi, 
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
> [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
> [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> 
> ---
> From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>  order request
> 
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore. A simpler fix for the stable is to simply
> ignore the compaction feedback and retry as long as there is a reclaim
> progress for high order requests which we used to do before. We already
> do that for CONFING_COMPACTION=n so let's reuse the same code when
> compaction is enabled as well.
> 
> [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> 
> Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> Signed-off-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 50 ++
>  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b3e1341b754..6e354199151b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   return NULL;
>  }
>  
> -static inline bool
> -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> -  enum compact_result compact_result, enum migrate_mode 
> *migrate_mode,
> -  int compaction_retries)
> -{
> - int max_retries = MAX_COMPACT_RETRIES;
> -
> - if (!order)
> - return false;
> -
> - /*
> -  * compaction considers all the zone as desperately out of memory
> -  * so it doesn't really make much sense to retry except when the
> -  * failure could be caused by weak migration mode.
> -  */
> - if (compaction_failed(compact_result)) {
> - if (*migrate_mode == MIGRATE_ASYNC) {
> - *migrate_mode = MIGRATE_SYNC_LIGHT;
> - return true;
> - }
> - return false;
> - }
> -
> - /*
> -  * make sure the compaction wasn't deferred or didn't bail out early
> -  * due to locks contention before we declare that we should give up.
> -  * But do not 

Re: OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
[ups, fixing up Greg's email]

On Mon 22-08-16 11:32:49, Michal Hocko wrote:
> Hi, 
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
> [3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
> [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> [6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> 
> ---
> From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>  order request
> 
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore. A simpler fix for the stable is to simply
> ignore the compaction feedback and retry as long as there is a reclaim
> progress for high order requests which we used to do before. We already
> do that for CONFING_COMPACTION=n so let's reuse the same code when
> compaction is enabled as well.
> 
> [1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
> [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz
> 
> Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> Signed-off-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 50 ++
>  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b3e1341b754..6e354199151b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   return NULL;
>  }
>  
> -static inline bool
> -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> -  enum compact_result compact_result, enum migrate_mode 
> *migrate_mode,
> -  int compaction_retries)
> -{
> - int max_retries = MAX_COMPACT_RETRIES;
> -
> - if (!order)
> - return false;
> -
> - /*
> -  * compaction considers all the zone as desperately out of memory
> -  * so it doesn't really make much sense to retry except when the
> -  * failure could be caused by weak migration mode.
> -  */
> - if (compaction_failed(compact_result)) {
> - if (*migrate_mode == MIGRATE_ASYNC) {
> - *migrate_mode = MIGRATE_SYNC_LIGHT;
> - return true;
> - }
> - return false;
> - }
> -
> - /*
> -  * make sure the compaction wasn't deferred or didn't bail out early
> -  * due to locks contention before we declare that we should give up.
> -  * But do not retry if the given zonelist is not 

OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
Hi, 
there have been multiple reports [1][2][3][4][5] about pre-mature OOM
killer invocations since 4.7 which contains oom detection rework. All of
them were for order-2 (kernel stack) alloaction requests failing because
of a high fragmentation and compaction failing to make any forward
progress. While investigating this we have found out that the compaction
just gives up too early. Vlastimil has been working on compaction
improvement for quite some time and his series [6] is already sitting
in mmotm tree. This already helps a lot because it drops some heuristics
which are more aimed at lower latencies for high orders rather than
reliability. Joonsoo has then identified further problem with too many
blocks being marked as unmovable [7] and Vlastimil has prepared a patch
on top of his series [8] which is also in the mmotm tree now.

That being said, the regression is real and should be fixed for 4.7
stable users. [6][8] was reported to help and ooms are no longer
reproducible. I know we are quite late (rc3) in 4.8 but I would vote
for mergeing those patches and have them in 4.8. For 4.7 I would go
with a partial revert of the detection rework for high order requests
(see patch below). This patch is really trivial. If those compaction
improvements are just too large for 4.8 then we can use the same patch
as for 4.7 stable for now and revert it in 4.9 after compaction changes
are merged.

Thoughts?

[1] http://lkml.kernel.org/r/20160731051121.GB307@x4
[2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
[3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
[4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
[5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
[6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
[7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
[8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz

---
>From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 22 Aug 2016 10:52:06 +0200
Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
 order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed that
the current linux-next which includes [1] and [2] helped and OOMs are
not reproducible anymore. A simpler fix for the stable is to simply
ignore the compaction feedback and retry as long as there is a reclaim
progress for high order requests which we used to do before. We already
do that for CONFING_COMPACTION=n so let's reuse the same code when
compaction is enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 50 ++
 1 file changed, 2 insertions(+), 48 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b3e1341b754..6e354199151b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
return NULL;
 }
 
-static inline bool
-should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-enum compact_result compact_result, enum migrate_mode 
*migrate_mode,
-int compaction_retries)
-{
-   int max_retries = MAX_COMPACT_RETRIES;
-
-   if (!order)
-   return false;
-
-   /*
-* compaction considers all the zone as desperately out of memory
-* so it doesn't really make much sense to retry except when the
-* failure could be caused by weak migration mode.
-*/
-   if (compaction_failed(compact_result)) {
-   if (*migrate_mode == MIGRATE_ASYNC) {
-   *migrate_mode = MIGRATE_SYNC_LIGHT;
-   return true;
-   }
-   return false;
-   }
-
-   /*
-* make sure the compaction wasn't deferred or didn't bail out early
-* due to locks contention before we declare that we should give up.
-* But do not retry if the given zonelist is not suitable for
-* compaction.
-*/
-   if (compaction_withdrawn(compact_result))
-   return compaction_zonelist_suitable(ac, order, alloc_flags);
-
-   /*
-* 

OOM detection regressions since 4.7

2016-08-22 Thread Michal Hocko
Hi, 
there have been multiple reports [1][2][3][4][5] about pre-mature OOM
killer invocations since 4.7 which contains oom detection rework. All of
them were for order-2 (kernel stack) alloaction requests failing because
of a high fragmentation and compaction failing to make any forward
progress. While investigating this we have found out that the compaction
just gives up too early. Vlastimil has been working on compaction
improvement for quite some time and his series [6] is already sitting
in mmotm tree. This already helps a lot because it drops some heuristics
which are more aimed at lower latencies for high orders rather than
reliability. Joonsoo has then identified further problem with too many
blocks being marked as unmovable [7] and Vlastimil has prepared a patch
on top of his series [8] which is also in the mmotm tree now.

That being said, the regression is real and should be fixed for 4.7
stable users. [6][8] was reported to help and ooms are no longer
reproducible. I know we are quite late (rc3) in 4.8 but I would vote
for mergeing those patches and have them in 4.8. For 4.7 I would go
with a partial revert of the detection rework for high order requests
(see patch below). This patch is really trivial. If those compaction
improvements are just too large for 4.8 then we can use the same patch
as for 4.7 stable for now and revert it in 4.9 after compaction changes
are merged.

Thoughts?

[1] http://lkml.kernel.org/r/20160731051121.GB307@x4
[2] http://lkml.kernel.org/r/201608120901.41463.a.miskiew...@gmail.com
[3] http://lkml.kernel.org/r/20160801192620.gd31...@dhcp22.suse.cz
[4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
[5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
[6] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
[7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
[8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz

---
>From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 22 Aug 2016 10:52:06 +0200
Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
 order request

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed that
the current linux-next which includes [1] and [2] helped and OOMs are
not reproducible anymore. A simpler fix for the stable is to simply
ignore the compaction feedback and retry as long as there is a reclaim
progress for high order requests which we used to do before. We already
do that for CONFING_COMPACTION=n so let's reuse the same code when
compaction is enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vba...@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a9335593...@suse.cz

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 50 ++
 1 file changed, 2 insertions(+), 48 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b3e1341b754..6e354199151b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
return NULL;
 }
 
-static inline bool
-should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-enum compact_result compact_result, enum migrate_mode 
*migrate_mode,
-int compaction_retries)
-{
-   int max_retries = MAX_COMPACT_RETRIES;
-
-   if (!order)
-   return false;
-
-   /*
-* compaction considers all the zone as desperately out of memory
-* so it doesn't really make much sense to retry except when the
-* failure could be caused by weak migration mode.
-*/
-   if (compaction_failed(compact_result)) {
-   if (*migrate_mode == MIGRATE_ASYNC) {
-   *migrate_mode = MIGRATE_SYNC_LIGHT;
-   return true;
-   }
-   return false;
-   }
-
-   /*
-* make sure the compaction wasn't deferred or didn't bail out early
-* due to locks contention before we declare that we should give up.
-* But do not retry if the given zonelist is not suitable for
-* compaction.
-*/
-   if (compaction_withdrawn(compact_result))
-   return compaction_zonelist_suitable(ac, order, alloc_flags);
-
-   /*
-* !costly requests are much more