Re: INFO: task hung in __blkdev_get (2)

2018-09-08 Thread Tetsuo Handa
This report contains "getblk(): executed=9 bh_count=0 bh_state=0" lines.
Thus,

#syz dup: INFO: task hung in generic_file_write_iter



Re: [PATCH] mm: memcontrol: print proper OOM header when no eligible victim left

2018-09-08 Thread Tetsuo Handa
On 2018/09/08 22:57, Johannes Weiner wrote:
> On Sat, Sep 08, 2018 at 10:36:06PM +0900, Tetsuo Handa wrote:
>> Due to commit d75da004c708c9fc ("oom: improve oom disable handling") and
>> commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when
>> no eligible victim left"), all
>>
>>   kworker/0:1 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
>> nodemask=(null), order=-1, oom_score_adj=0
>>   (...snipped...)
>>   Out of memory and no killable processes...
>>   OOM request ignored. No task eligible
>>
>> lines are printed.
> 
> This doesn't explain the context, what you were trying to do here, and
> what you expected to happen. Plus, you (...snipped...) the important
> part to understand why it failed in the first place.

I expect:

  When SysRq-f did not find killable processes, it does not emit
  message other than "OOM request ignored. No task eligible".

There is no point with emitting memory information etc.

> 
>> Let's not emit "invoked oom-killer" lines when SysRq-f failed.
> 
> I disagree. If the user asked for an OOM kill, it makes perfect sense
> to dump the memory context and the outcome of the operation - even if
> the outcome is "I didn't find anything to kill". I'd argue that the
> failure case *in particular* is where I want to know about and have
> all the information that could help me understand why it failed.

How emitting memory information etc. helps you understand why it failed?
"No task eligible" is sufficient for you to understand why, isn't it?

> 
> So NAK on the inferred patch premise, but please include way more
> rationale, reproduction scenario etc. in future patches. It's not at
> all clear *why* you think it should work the way you propose here.
> 



Re: [PATCH] mm: memcontrol: print proper OOM header when no eligible victim left

2018-09-08 Thread Tetsuo Handa
On 2018/09/08 22:57, Johannes Weiner wrote:
> On Sat, Sep 08, 2018 at 10:36:06PM +0900, Tetsuo Handa wrote:
>> Due to commit d75da004c708c9fc ("oom: improve oom disable handling") and
>> commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when
>> no eligible victim left"), all
>>
>>   kworker/0:1 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
>> nodemask=(null), order=-1, oom_score_adj=0
>>   (...snipped...)
>>   Out of memory and no killable processes...
>>   OOM request ignored. No task eligible
>>
>> lines are printed.
> 
> This doesn't explain the context, what you were trying to do here, and
> what you expected to happen. Plus, you (...snipped...) the important
> part to understand why it failed in the first place.

I expect:

  When SysRq-f did not find killable processes, it does not emit
  message other than "OOM request ignored. No task eligible".

There is no point with emitting memory information etc.

> 
>> Let's not emit "invoked oom-killer" lines when SysRq-f failed.
> 
> I disagree. If the user asked for an OOM kill, it makes perfect sense
> to dump the memory context and the outcome of the operation - even if
> the outcome is "I didn't find anything to kill". I'd argue that the
> failure case *in particular* is where I want to know about and have
> all the information that could help me understand why it failed.

How emitting memory information etc. helps you understand why it failed?
"No task eligible" is sufficient for you to understand why, isn't it?

> 
> So NAK on the inferred patch premise, but please include way more
> rationale, reproduction scenario etc. in future patches. It's not at
> all clear *why* you think it should work the way you propose here.
> 



Re: [PATCH] mm: memcontrol: print proper OOM header when no eligible victim left

2018-09-08 Thread Tetsuo Handa
On 2018/08/22 1:04, Johannes Weiner wrote:
> When the memcg OOM killer runs out of killable tasks, it currently
> prints a WARN with no further OOM context. This has caused some user
> confusion.
> 
> Warnings indicate a kernel problem. In a reported case, however, the
> situation was triggered by a non-sensical memcg configuration (hard
> limit set to 0). But without any VM context this wasn't obvious from
> the report, and it took some back and forth on the mailing list to
> identify what is actually a trivial issue.
> 
> Handle this OOM condition like we handle it in the global OOM killer:
> dump the full OOM context and tell the user we ran out of tasks.
> 
> This way the user can identify misconfigurations easily by themselves
> and rectify the problem - without having to go through the hassle of
> running into an obscure but unsettling warning, finding the
> appropriate kernel mailing list and waiting for a kernel developer to
> remote-analyze that the memcg configuration caused this.
> 
> If users cannot make sense of why the OOM killer was triggered or why
> it failed, they will still report it to the mailing list, we know that
> from experience. So in case there is an actual kernel bug causing
> this, kernel developers will very likely hear about it.
> 
> Signed-off-by: Johannes Weiner 
> Acked-by: Michal Hocko 
> ---
>  mm/memcontrol.c |  2 --
>  mm/oom_kill.c   | 13 ++---
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 

Now that above patch went to 4.19-rc3, please apply below one.

>From eb2bff2ed308da04785bcf541dd3f748286bfa23 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Sat, 8 Sep 2018 22:26:28 +0900
Subject: [PATCH] mm, oom: Don't emit noises for failed SysRq-f.

Due to commit d75da004c708c9fc ("oom: improve oom disable handling") and
commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when
no eligible victim left"), all

  kworker/0:1 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=-1, oom_score_adj=0
  (...snipped...)
  Out of memory and no killable processes...
  OOM request ignored. No task eligible

lines are printed.
Let's not emit "invoked oom-killer" lines when SysRq-f failed.

Signed-off-by: Tetsuo Handa 
---
 mm/oom_kill.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..92122ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,8 +1106,10 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
-   dump_header(oc, NULL);
-   pr_warn("Out of memory and no killable processes...\n");
+   if (!is_sysrq_oom(oc)) {
+   dump_header(oc, NULL);
+   pr_warn("Out of memory and no killable processes...\n");
+   }
/*
 * If we got here due to an actual allocation at the
 * system level, we cannot survive this and will enter
-- 
1.8.3.1




Re: [PATCH] mm: memcontrol: print proper OOM header when no eligible victim left

2018-09-08 Thread Tetsuo Handa
On 2018/08/22 1:04, Johannes Weiner wrote:
> When the memcg OOM killer runs out of killable tasks, it currently
> prints a WARN with no further OOM context. This has caused some user
> confusion.
> 
> Warnings indicate a kernel problem. In a reported case, however, the
> situation was triggered by a non-sensical memcg configuration (hard
> limit set to 0). But without any VM context this wasn't obvious from
> the report, and it took some back and forth on the mailing list to
> identify what is actually a trivial issue.
> 
> Handle this OOM condition like we handle it in the global OOM killer:
> dump the full OOM context and tell the user we ran out of tasks.
> 
> This way the user can identify misconfigurations easily by themselves
> and rectify the problem - without having to go through the hassle of
> running into an obscure but unsettling warning, finding the
> appropriate kernel mailing list and waiting for a kernel developer to
> remote-analyze that the memcg configuration caused this.
> 
> If users cannot make sense of why the OOM killer was triggered or why
> it failed, they will still report it to the mailing list, we know that
> from experience. So in case there is an actual kernel bug causing
> this, kernel developers will very likely hear about it.
> 
> Signed-off-by: Johannes Weiner 
> Acked-by: Michal Hocko 
> ---
>  mm/memcontrol.c |  2 --
>  mm/oom_kill.c   | 13 ++---
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 

Now that above patch went to 4.19-rc3, please apply below one.

>From eb2bff2ed308da04785bcf541dd3f748286bfa23 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Sat, 8 Sep 2018 22:26:28 +0900
Subject: [PATCH] mm, oom: Don't emit noises for failed SysRq-f.

Due to commit d75da004c708c9fc ("oom: improve oom disable handling") and
commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when
no eligible victim left"), all

  kworker/0:1 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=-1, oom_score_adj=0
  (...snipped...)
  Out of memory and no killable processes...
  OOM request ignored. No task eligible

lines are printed.
Let's not emit "invoked oom-killer" lines when SysRq-f failed.

Signed-off-by: Tetsuo Handa 
---
 mm/oom_kill.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..92122ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,8 +1106,10 @@ bool out_of_memory(struct oom_control *oc)
select_bad_process(oc);
/* Found nothing?!?! */
if (!oc->chosen) {
-   dump_header(oc, NULL);
-   pr_warn("Out of memory and no killable processes...\n");
+   if (!is_sysrq_oom(oc)) {
+   dump_header(oc, NULL);
+   pr_warn("Out of memory and no killable processes...\n");
+   }
/*
 * If we got here due to an actual allocation at the
 * system level, we cannot survive this and will enter
-- 
1.8.3.1




Re: [PATCH 3/6] selinux: convert to kvmalloc

2018-09-07 Thread Tetsuo Handa
On 2018/09/08 1:56, Kent Overstreet wrote:
> @@ -329,8 +328,7 @@ int avtab_alloc(struct avtab *h, u32 nrules)
>   nslot = MAX_AVTAB_HASH_BUCKETS;
>   mask = nslot - 1;
>  
> - h->htable = flex_array_alloc(sizeof(struct avtab_node *), nslot,
> -  GFP_KERNEL | __GFP_ZERO);
> + h->htable = kvmalloc_array(nslot, sizeof(void *), GFP_KERNEL);
>   if (!h->htable)
>   return -ENOMEM;
>  

kvmalloc_array() does not imply __GFP_ZERO.



Re: [PATCH 3/6] selinux: convert to kvmalloc

2018-09-07 Thread Tetsuo Handa
On 2018/09/08 1:56, Kent Overstreet wrote:
> @@ -329,8 +328,7 @@ int avtab_alloc(struct avtab *h, u32 nrules)
>   nslot = MAX_AVTAB_HASH_BUCKETS;
>   mask = nslot - 1;
>  
> - h->htable = flex_array_alloc(sizeof(struct avtab_node *), nslot,
> -  GFP_KERNEL | __GFP_ZERO);
> + h->htable = kvmalloc_array(nslot, sizeof(void *), GFP_KERNEL);
>   if (!h->htable)
>   return -ENOMEM;
>  

kvmalloc_array() does not imply __GFP_ZERO.



Re: BUG: bad usercopy in __check_object_size (2)

2018-09-07 Thread Tetsuo Handa
On 2018/09/08 0:29, syzbot wrote:
> syzbot has found a reproducer for the following crash on:
> 
> HEAD commit:    28619527b8a7 Merge git://git.kernel.org/pub/scm/linux/kern..
> git tree:   bpf
> console output: https://syzkaller.appspot.com/x/log.txt?x=124e64d140
> kernel config:  https://syzkaller.appspot.com/x/.config?x=62e9b447c16085cf
> dashboard link: https://syzkaller.appspot.com/bug?extid=a3c9d2673837ccc0f22b
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=179f9cd140
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11b3e8be40
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a3c9d2673837ccc0f...@syzkaller.appspotmail.com
> 
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x440479
> usercopy: Kernel memory overwrite attempt detected to spans multiple pages 
> (offset 0, size 64)!

Kees, is this because check_page_span() is failing to allow on-stack variable

   u8 opcodes[OPCODE_BUFSIZE];

which by chance crossed PAGE_SIZE boundary?


Re: BUG: bad usercopy in __check_object_size (2)

2018-09-07 Thread Tetsuo Handa
On 2018/09/08 0:29, syzbot wrote:
> syzbot has found a reproducer for the following crash on:
> 
> HEAD commit:    28619527b8a7 Merge git://git.kernel.org/pub/scm/linux/kern..
> git tree:   bpf
> console output: https://syzkaller.appspot.com/x/log.txt?x=124e64d140
> kernel config:  https://syzkaller.appspot.com/x/.config?x=62e9b447c16085cf
> dashboard link: https://syzkaller.appspot.com/bug?extid=a3c9d2673837ccc0f22b
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=179f9cd140
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11b3e8be40
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a3c9d2673837ccc0f...@syzkaller.appspotmail.com
> 
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x440479
> usercopy: Kernel memory overwrite attempt detected to spans multiple pages 
> (offset 0, size 64)!

Kees, is this because check_page_span() is failing to allow on-stack variable

   u8 opcodes[OPCODE_BUFSIZE];

which by chance crossed PAGE_SIZE boundary?


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I assert that we should fix af5679fbc669f31f.
> > 
> > If you can come up with reasonable patch which doesn't complicate the
> > code and it is a clear win for both this particular workload as well as
> > others then why not.
> 
> Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
> prevent from races" ?
> 

Well, serializing setting of MMF_OOM_SKIP using oom_lock was not sufficient.
It seems that some more moment is needed for allowing last second allocation
attempt at __alloc_pages_may_oom() to succeed. We need to migrate to
"mm, oom: fix unnecessary killing of additional processes" thread.



[  702.895936] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  702.900717] a.out cpuset=/ mems_allowed=0
[  702.903210] CPU: 1 PID: 3359 Comm: a.out Tainted: GT 
4.19.0-rc2+ #692
[  702.906630] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  702.910771] Call Trace:
[  702.912681]  dump_stack+0x85/0xcb
[  702.914918]  dump_header+0x69/0x2fe
[  702.917387]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  702.921267]  oom_kill_process+0x307/0x390
[  702.923809]  out_of_memory+0x2f3/0x5d0
[  702.926208]  __alloc_pages_slowpath+0xc01/0x1030
[  702.928817]  __alloc_pages_nodemask+0x333/0x390
[  702.931270]  alloc_pages_vma+0x77/0x1f0
[  702.933618]  __handle_mm_fault+0x81c/0xf40
[  702.935962]  handle_mm_fault+0x1b7/0x3c0
[  702.938215]  __do_page_fault+0x2a6/0x580
[  702.940481]  do_page_fault+0x32/0x270
[  702.942753]  ? page_fault+0x8/0x30
[  702.944860]  page_fault+0x1e/0x30
[  702.947138] RIP: 0033:0x4008d8
[  702.949722] Code: Bad RIP value.
[  702.952000] RSP: 002b:7ffc21fd99c0 EFLAGS: 00010206
[  702.954570] RAX: 7fb3457cc010 RBX: 0001 RCX: 
[  702.957631] RDX: b0b24000 RSI: 0002 RDI: 00020050
[  702.960599] RBP: 7fb3457cc010 R08: 00021000 R09: 00021000
[  702.963531] R10: 0022 R11: 1000 R12: 0006
[  702.966518] R13: 7ffc21fd9ab0 R14:  R15: 
[  702.971186] Mem-Info:
[  702.976959] active_anon:788641 inactive_anon:3457 isolated_anon:0
[  702.976959]  active_file:0 inactive_file:77 isolated_file:0
[  702.976959]  unevictable:0 dirty:0 writeback:0 unstable:0
[  702.976959]  slab_reclaimable:8152 slab_unreclaimable:24616
[  702.976959]  mapped:2806 shmem:3704 pagetables:4355 bounce:0
[  702.976959]  free:20831 free_pcp:136 free_cma:0
[  703.007374] Node 0 active_anon:3154564kB inactive_anon:13828kB 
active_file:304kB inactive_file:112kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:11028kB dirty:0kB writeback:0kB shmem:14816kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2846720kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
[  703.029994] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB 
active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.055331] lowmem_reserve[]: 0 2717 3378 3378
[  703.062881] Node 0 DMA32 free:58152kB min:54108kB low:67632kB high:81156kB 
active_anon:2718364kB inactive_anon:12kB active_file:424kB inactive_file:852kB 
unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB 
mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:728kB 
local_pcp:0kB free_cma:0kB
[  703.091529] lowmem_reserve[]: 0 0 661 661
[  703.096552] Node 0 Normal free:12900kB min:13164kB low:16452kB high:19740kB 
active_anon:434248kB inactive_anon:13816kB active_file:0kB inactive_file:48kB 
unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB 
kernel_stack:6720kB pagetables:17028kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.123046] lowmem_reserve[]: 0 0 0 0
[  703.127490] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 
1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
[  703.134763] Node 0 DMA32: 85*4kB (UM) 59*8kB (UM) 61*16kB (UM) 65*32kB (UME) 
47*64kB (UE) 44*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 
0*2048kB 0*4096kB = 57308kB
[  703.144076] Node 0 Normal: 119*4kB (UM) 69*8kB (UM) 156*16kB (UME) 179*32kB 
(UME) 44*64kB (UE) 9*128kB (UE) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 13476kB
[  703.151746] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB
[  703.156537] 3744 total pagecache pages
[  703.159017] 0 pages in swap cache
[  703.163209] Swap cache stats: add 0, delete 0, find 0/0
[  703.167254] Free swap  = 0kB
[  703.170577] Total swap = 0kB
[  703.173758] 1048422 pages RAM
[  703.176843] 0 pages HighM

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I assert that we should fix af5679fbc669f31f.
> > 
> > If you can come up with reasonable patch which doesn't complicate the
> > code and it is a clear win for both this particular workload as well as
> > others then why not.
> 
> Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
> prevent from races" ?
> 

Well, serializing setting of MMF_OOM_SKIP using oom_lock was not sufficient.
It seems that some more moment is needed for allowing last second allocation
attempt at __alloc_pages_may_oom() to succeed. We need to migrate to
"mm, oom: fix unnecessary killing of additional processes" thread.



[  702.895936] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  702.900717] a.out cpuset=/ mems_allowed=0
[  702.903210] CPU: 1 PID: 3359 Comm: a.out Tainted: GT 
4.19.0-rc2+ #692
[  702.906630] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  702.910771] Call Trace:
[  702.912681]  dump_stack+0x85/0xcb
[  702.914918]  dump_header+0x69/0x2fe
[  702.917387]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  702.921267]  oom_kill_process+0x307/0x390
[  702.923809]  out_of_memory+0x2f3/0x5d0
[  702.926208]  __alloc_pages_slowpath+0xc01/0x1030
[  702.928817]  __alloc_pages_nodemask+0x333/0x390
[  702.931270]  alloc_pages_vma+0x77/0x1f0
[  702.933618]  __handle_mm_fault+0x81c/0xf40
[  702.935962]  handle_mm_fault+0x1b7/0x3c0
[  702.938215]  __do_page_fault+0x2a6/0x580
[  702.940481]  do_page_fault+0x32/0x270
[  702.942753]  ? page_fault+0x8/0x30
[  702.944860]  page_fault+0x1e/0x30
[  702.947138] RIP: 0033:0x4008d8
[  702.949722] Code: Bad RIP value.
[  702.952000] RSP: 002b:7ffc21fd99c0 EFLAGS: 00010206
[  702.954570] RAX: 7fb3457cc010 RBX: 0001 RCX: 
[  702.957631] RDX: b0b24000 RSI: 0002 RDI: 00020050
[  702.960599] RBP: 7fb3457cc010 R08: 00021000 R09: 00021000
[  702.963531] R10: 0022 R11: 1000 R12: 0006
[  702.966518] R13: 7ffc21fd9ab0 R14:  R15: 
[  702.971186] Mem-Info:
[  702.976959] active_anon:788641 inactive_anon:3457 isolated_anon:0
[  702.976959]  active_file:0 inactive_file:77 isolated_file:0
[  702.976959]  unevictable:0 dirty:0 writeback:0 unstable:0
[  702.976959]  slab_reclaimable:8152 slab_unreclaimable:24616
[  702.976959]  mapped:2806 shmem:3704 pagetables:4355 bounce:0
[  702.976959]  free:20831 free_pcp:136 free_cma:0
[  703.007374] Node 0 active_anon:3154564kB inactive_anon:13828kB 
active_file:304kB inactive_file:112kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:11028kB dirty:0kB writeback:0kB shmem:14816kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2846720kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
[  703.029994] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB 
active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.055331] lowmem_reserve[]: 0 2717 3378 3378
[  703.062881] Node 0 DMA32 free:58152kB min:54108kB low:67632kB high:81156kB 
active_anon:2718364kB inactive_anon:12kB active_file:424kB inactive_file:852kB 
unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB 
mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:728kB 
local_pcp:0kB free_cma:0kB
[  703.091529] lowmem_reserve[]: 0 0 661 661
[  703.096552] Node 0 Normal free:12900kB min:13164kB low:16452kB high:19740kB 
active_anon:434248kB inactive_anon:13816kB active_file:0kB inactive_file:48kB 
unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB 
kernel_stack:6720kB pagetables:17028kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.123046] lowmem_reserve[]: 0 0 0 0
[  703.127490] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 
1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
[  703.134763] Node 0 DMA32: 85*4kB (UM) 59*8kB (UM) 61*16kB (UM) 65*32kB (UME) 
47*64kB (UE) 44*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 
0*2048kB 0*4096kB = 57308kB
[  703.144076] Node 0 Normal: 119*4kB (UM) 69*8kB (UM) 156*16kB (UME) 179*32kB 
(UME) 44*64kB (UE) 9*128kB (UE) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 13476kB
[  703.151746] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB
[  703.156537] 3744 total pagecache pages
[  703.159017] 0 pages in swap cache
[  703.163209] Swap cache stats: add 0, delete 0, find 0/0
[  703.167254] Free swap  = 0kB
[  703.170577] Total swap = 0kB
[  703.173758] 1048422 pages RAM
[  703.176843] 0 pages HighM

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Michal Hocko wrote:
> > I assert that we should fix af5679fbc669f31f.
> 
> If you can come up with reasonable patch which doesn't complicate the
> code and it is a clear win for both this particular workload as well as
> others then why not.

Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
prevent from races" ?

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..e096bb8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3065,7 +3065,9 @@ void exit_mmap(struct mm_struct *mm)
 */
(void)__oom_reap_task_mm(mm);
 
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
down_write(>mmap_sem);
up_write(>mmap_sem);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..b2a94c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -606,7 +606,9 @@ static void oom_reap_task(struct task_struct *tsk)
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
 */
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
 
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Michal Hocko wrote:
> > I assert that we should fix af5679fbc669f31f.
> 
> If you can come up with reasonable patch which doesn't complicate the
> code and it is a clear win for both this particular workload as well as
> others then why not.

Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
prevent from races" ?

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..e096bb8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3065,7 +3065,9 @@ void exit_mmap(struct mm_struct *mm)
 */
(void)__oom_reap_task_mm(mm);
 
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
down_write(>mmap_sem);
up_write(>mmap_sem);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..b2a94c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -606,7 +606,9 @@ static void oom_reap_task(struct task_struct *tsk)
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
 */
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
 
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > On 2018/09/05 22:40, Michal Hocko wrote:
> > > Changelog said 
> > > 
> > > "Although this is possible in principle let's wait for it to actually
> > > happen in real life before we make the locking more complex again."
> > > 
> > > So what is the real life workload that hits it? The log you have pasted
> > > below doesn't tell much.
> > 
> > Nothing special. I just ran a multi-threaded memory eater on a 
> > CONFIG_PREEMPT=y kernel.
> 
> I strongly suspec that your test doesn't really represent or simulate
> any real and useful workload. Sure it triggers a rare race and we kill
> another oom victim. Does this warrant to make the code more complex?
> Well, I am not convinced, as I've said countless times.

Yes. Below is an example from a machine running Apache Web server/Tomcat AP 
server/PostgreSQL DB server.
An memory eater needlessly killed Tomcat due to this race. I assert that we 
should fix af5679fbc669f31f.



Before:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---avahi-daemon(631)
   |-crond(657)
   |-dbus-daemon(638)
   |-firewalld(661)---{firewalld}(788)
   |-httpd(1169)-+-httpd(1170)
   | |-httpd(1171)
   | |-httpd(1172)
   | |-httpd(1173)
   | `-httpd(1174)
   |-irqbalance(628)
   |-java(1074)-+-{java}(1092)
   ||-{java}(1093)
   ||-{java}(1094)
   ||-{java}(1095)
   ||-{java}(1096)
   ||-{java}(1097)
   ||-{java}(1098)
   ||-{java}(1099)
   ||-{java}(1100)
   ||-{java}(1101)
   ||-{java}(1102)
   ||-{java}(1103)
   ||-{java}(1104)
   ||-{java}(1105)
   ||-{java}(1106)
   ||-{java}(1107)
   ||-{java}(1108)
   ||-{java}(1109)
   ||-{java}(1110)
   ||-{java}()
   ||-{java}(1114)
   ||-{java}(1115)
   ||-{java}(1116)
   ||-{java}(1117)
   ||-{java}(1118)
   ||-{java}(1119)
   ||-{java}(1120)
   ||-{java}(1121)
   ||-{java}(1122)
   ||-{java}(1123)
   ||-{java}(1124)
   ||-{java}(1125)
   ||-{java}(1126)
   ||-{java}(1127)
   ||-{java}(1128)
   ||-{java}(1129)
   ||-{java}(1130)
   ||-{java}(1131)
   ||-{java}(1132)
   ||-{java}(1133)
   ||-{java}(1134)
   ||-{java}(1135)
   ||-{java}(1136)
   ||-{java}(1137)
   |`-{java}(1138)
   |-ksmtuned(659)---sleep(1727)
   |-login(1006)---bash(1052)---pstree(1728)
   |-polkitd(624)-+-{polkitd}(633)
   |  |-{polkitd}(642)
   |  |-{polkitd}(643)
   |  |-{polkitd}(645)
   |  `-{polkitd}(650)
   |-postgres(1154)-+-postgres(1155)
   ||-postgres(1157)
   ||-postgres(1158)
   ||-postgres(1159)
   ||-postgres(1160)
   |`-postgres(1161)
   |-rsyslogd(986)-+-{rsyslogd}(997)
   |   `-{rsyslogd}(999)
   |-sendmail(1008)
   |-sendmail(1023)
   |-smbd(983)-+-cleanupd(1027)
   |   |-lpqd(1032)
   |   `-smbd-notifyd(1026)
   |-sshd(981)
   |-systemd-journal(529)
   |-systemd-logind(627)
   |-systemd-udevd(560)
   `-tuned(980)-+-{tuned}(1030)
|-{tuned}(1031)
|-{tuned}(1033)
`-{tuned}(1047)



After:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---av

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > On 2018/09/05 22:40, Michal Hocko wrote:
> > > Changelog said 
> > > 
> > > "Although this is possible in principle let's wait for it to actually
> > > happen in real life before we make the locking more complex again."
> > > 
> > > So what is the real life workload that hits it? The log you have pasted
> > > below doesn't tell much.
> > 
> > Nothing special. I just ran a multi-threaded memory eater on a 
> > CONFIG_PREEMPT=y kernel.
> 
> I strongly suspec that your test doesn't really represent or simulate
> any real and useful workload. Sure it triggers a rare race and we kill
> another oom victim. Does this warrant to make the code more complex?
> Well, I am not convinced, as I've said countless times.

Yes. Below is an example from a machine running Apache Web server/Tomcat AP 
server/PostgreSQL DB server.
An memory eater needlessly killed Tomcat due to this race. I assert that we 
should fix af5679fbc669f31f.



Before:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---avahi-daemon(631)
   |-crond(657)
   |-dbus-daemon(638)
   |-firewalld(661)---{firewalld}(788)
   |-httpd(1169)-+-httpd(1170)
   | |-httpd(1171)
   | |-httpd(1172)
   | |-httpd(1173)
   | `-httpd(1174)
   |-irqbalance(628)
   |-java(1074)-+-{java}(1092)
   ||-{java}(1093)
   ||-{java}(1094)
   ||-{java}(1095)
   ||-{java}(1096)
   ||-{java}(1097)
   ||-{java}(1098)
   ||-{java}(1099)
   ||-{java}(1100)
   ||-{java}(1101)
   ||-{java}(1102)
   ||-{java}(1103)
   ||-{java}(1104)
   ||-{java}(1105)
   ||-{java}(1106)
   ||-{java}(1107)
   ||-{java}(1108)
   ||-{java}(1109)
   ||-{java}(1110)
   ||-{java}()
   ||-{java}(1114)
   ||-{java}(1115)
   ||-{java}(1116)
   ||-{java}(1117)
   ||-{java}(1118)
   ||-{java}(1119)
   ||-{java}(1120)
   ||-{java}(1121)
   ||-{java}(1122)
   ||-{java}(1123)
   ||-{java}(1124)
   ||-{java}(1125)
   ||-{java}(1126)
   ||-{java}(1127)
   ||-{java}(1128)
   ||-{java}(1129)
   ||-{java}(1130)
   ||-{java}(1131)
   ||-{java}(1132)
   ||-{java}(1133)
   ||-{java}(1134)
   ||-{java}(1135)
   ||-{java}(1136)
   ||-{java}(1137)
   |`-{java}(1138)
   |-ksmtuned(659)---sleep(1727)
   |-login(1006)---bash(1052)---pstree(1728)
   |-polkitd(624)-+-{polkitd}(633)
   |  |-{polkitd}(642)
   |  |-{polkitd}(643)
   |  |-{polkitd}(645)
   |  `-{polkitd}(650)
   |-postgres(1154)-+-postgres(1155)
   ||-postgres(1157)
   ||-postgres(1158)
   ||-postgres(1159)
   ||-postgres(1160)
   |`-postgres(1161)
   |-rsyslogd(986)-+-{rsyslogd}(997)
   |   `-{rsyslogd}(999)
   |-sendmail(1008)
   |-sendmail(1023)
   |-smbd(983)-+-cleanupd(1027)
   |   |-lpqd(1032)
   |   `-smbd-notifyd(1026)
   |-sshd(981)
   |-systemd-journal(529)
   |-systemd-logind(627)
   |-systemd-udevd(560)
   `-tuned(980)-+-{tuned}(1030)
|-{tuned}(1031)
|-{tuned}(1033)
`-{tuned}(1047)



After:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---av

Re: INFO: task hung in vfat_lookup

2018-09-05 Thread Tetsuo Handa
On 2018/09/05 20:19, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    420f51f4ab6b Merge tag 'arm64-fixes' of git://git.kernel.o..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11296c9240
> kernel config:  https://syzkaller.appspot.com/x/.config?x=531a917630d2a492
> dashboard link: https://syzkaller.appspot.com/bug?extid=72000baa7858f1703b04
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.

A report for linux-next contains

"getblk(): executed=9 bh_count=0 bh_state=0"

lines. Therefore,

#syz dup: INFO: task hung in generic_file_write_iter


Re: INFO: task hung in vfat_lookup

2018-09-05 Thread Tetsuo Handa
On 2018/09/05 20:19, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    420f51f4ab6b Merge tag 'arm64-fixes' of git://git.kernel.o..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11296c9240
> kernel config:  https://syzkaller.appspot.com/x/.config?x=531a917630d2a492
> dashboard link: https://syzkaller.appspot.com/bug?extid=72000baa7858f1703b04
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.

A report for linux-next contains

"getblk(): executed=9 bh_count=0 bh_state=0"

lines. Therefore,

#syz dup: INFO: task hung in generic_file_write_iter


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/09/05 22:40, Michal Hocko wrote:
> Changelog said 
> 
> "Although this is possible in principle let's wait for it to actually
> happen in real life before we make the locking more complex again."
> 
> So what is the real life workload that hits it? The log you have pasted
> below doesn't tell much.

Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y 
kernel.
The OOM reaper succeeded to reclaim all memory but the OOM killer still 
triggered
just because somebody was inside that race window.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int memory_eater(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
static cpu_set_t set = { { 1 } };
sched_setaffinity(0, sizeof(set), );
srand(getpid());
while (1) {
int size = rand() % 1048576;
void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, size);
}
return 0;
}

static int child(void *unused)
{
static char *stack[256] = { };
char buf[32] = { };
int i;
sleep(1);
snprintf(buf, sizeof(buf), "tgid=%u", getpid());
prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
for (i = 0; i < 256; i++)
stack[i] = malloc(4096 * 2);
for (i = 1; i < 128; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | 
CLONE_SIGHAND | CLONE_VM, NULL) == -1)
_exit(1);
for (; i < 254; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_VM, NULL) == -1)
_exit(1);
return 0;
}

int main(int argc, char *argv[])
{
char *stack = malloc(1048576);
char *buf = NULL;
unsigned long size;
unsigned long i;
if (clone(child, stack + 1048576, CLONE_VM, NULL) == -1)
_exit(1);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
while (1)
pause();
return 0;
}

> 
>> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or 
>> sacrifice child
>> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
>> anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
>> [  278.151586] vmtoolsd invoked oom-killer: 
>> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
>> oom_score_adj=0
> [...]
>> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or 
>> sacrifice child
>> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, 
>> anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
>> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now 
>> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 

Another example from a different machine.

[  765.523676] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  765.528172] a.out cpuset=/ mems_allowed=0
[  765.530603] CPU: 5 PID: 4540 Comm: a.out Tainted: GT 
4.19.0-rc2+ #689
[  765.534307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  765.538975] Call Trace:
[  765.540920]  dump_stack+0x85/0xcb
[  765.543088]  dump_header+0x69/0x2fe
[  765.545375]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  765.547983]  oom_kill_process+0x307/0x390
[  765.550394]  out_of_memory+0x2f5/0x5b0
[  765.552818]  __alloc_pages_slowpath+0xc01/0x1030
[  765.555442]  __alloc_pages_nodemask+0x333/0x390
[  765.557966]  alloc_pages_vma+0x77/0x1f0
[  765.560292]  __handle_mm_fault+0x81c/0xf40
[  765.562736]  handle_mm_fault+0x1b7/0x3c0
[  765.565038]  __do_page_fault+0x2a6/0x580
[  765.567420]  do_page_fault+0x32/0x270
[  765.569670]  ? page_fault+0x8/0x30
[  765.571833]  page_fault+0x1e/0x30
[  765.573934] RIP: 0033:0x4008d8
[  765.575924] Code: Bad RIP value.
[  765.577842] RSP: 002b:7ffec6f7d420 EFLAGS: 00010206
[  765.580221] RAX: 7f6a201f4010 RBX: 0001 RCX: 
[  765.583253] RDX: bde23000 RSI: 0002 RDI: 00020050
[  765.586207] RBP: 7f6a201f4010 R08: 00021000 R09: 00021000
[  765.589047] R10: 0022 R11: 1000 R12: 0006
[  765.591996] R13: 7ffec6f7d510 R14:  R15: 
[  765.595732] Mem-Info:
[  765.597580] active_anon:794622 inactive_anon:2126 isolated_anon:0
[  765.597580]  active_file:7 inactive_file:11 isolated_file:0
[  765.597580]  unevictable:0 dirty:0 writeback:0 unstable:0
[  765.597580]  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/09/05 22:40, Michal Hocko wrote:
> Changelog said 
> 
> "Although this is possible in principle let's wait for it to actually
> happen in real life before we make the locking more complex again."
> 
> So what is the real life workload that hits it? The log you have pasted
> below doesn't tell much.

Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y 
kernel.
The OOM reaper succeeded to reclaim all memory but the OOM killer still 
triggered
just because somebody was inside that race window.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int memory_eater(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
static cpu_set_t set = { { 1 } };
sched_setaffinity(0, sizeof(set), );
srand(getpid());
while (1) {
int size = rand() % 1048576;
void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, size);
}
return 0;
}

static int child(void *unused)
{
static char *stack[256] = { };
char buf[32] = { };
int i;
sleep(1);
snprintf(buf, sizeof(buf), "tgid=%u", getpid());
prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
for (i = 0; i < 256; i++)
stack[i] = malloc(4096 * 2);
for (i = 1; i < 128; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | 
CLONE_SIGHAND | CLONE_VM, NULL) == -1)
_exit(1);
for (; i < 254; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_VM, NULL) == -1)
_exit(1);
return 0;
}

int main(int argc, char *argv[])
{
char *stack = malloc(1048576);
char *buf = NULL;
unsigned long size;
unsigned long i;
if (clone(child, stack + 1048576, CLONE_VM, NULL) == -1)
_exit(1);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
while (1)
pause();
return 0;
}

> 
>> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or 
>> sacrifice child
>> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
>> anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
>> [  278.151586] vmtoolsd invoked oom-killer: 
>> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
>> oom_score_adj=0
> [...]
>> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or 
>> sacrifice child
>> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, 
>> anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
>> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now 
>> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 

Another example from a different machine.

[  765.523676] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  765.528172] a.out cpuset=/ mems_allowed=0
[  765.530603] CPU: 5 PID: 4540 Comm: a.out Tainted: GT 
4.19.0-rc2+ #689
[  765.534307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  765.538975] Call Trace:
[  765.540920]  dump_stack+0x85/0xcb
[  765.543088]  dump_header+0x69/0x2fe
[  765.545375]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  765.547983]  oom_kill_process+0x307/0x390
[  765.550394]  out_of_memory+0x2f5/0x5b0
[  765.552818]  __alloc_pages_slowpath+0xc01/0x1030
[  765.555442]  __alloc_pages_nodemask+0x333/0x390
[  765.557966]  alloc_pages_vma+0x77/0x1f0
[  765.560292]  __handle_mm_fault+0x81c/0xf40
[  765.562736]  handle_mm_fault+0x1b7/0x3c0
[  765.565038]  __do_page_fault+0x2a6/0x580
[  765.567420]  do_page_fault+0x32/0x270
[  765.569670]  ? page_fault+0x8/0x30
[  765.571833]  page_fault+0x1e/0x30
[  765.573934] RIP: 0033:0x4008d8
[  765.575924] Code: Bad RIP value.
[  765.577842] RSP: 002b:7ffec6f7d420 EFLAGS: 00010206
[  765.580221] RAX: 7f6a201f4010 RBX: 0001 RCX: 
[  765.583253] RDX: bde23000 RSI: 0002 RDI: 00020050
[  765.586207] RBP: 7f6a201f4010 R08: 00021000 R09: 00021000
[  765.589047] R10: 0022 R11: 1000 R12: 0006
[  765.591996] R13: 7ffec6f7d510 R14:  R15: 
[  765.595732] Mem-Info:
[  765.597580] active_anon:794622 inactive_anon:2126 isolated_anon:0
[  765.597580]  active_file:7 inactive_file:11 isolated_file:0
[  765.597580]  unevictable:0 dirty:0 writeback:0 unstable:0
[  765.597580]  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/08/24 9:31, Tetsuo Handa wrote:
> For now, I don't think we need to add af5679fbc669f31f to the list for
> CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> selection (especially with CONFIG_PREEMPT=y kernels) due to
> 
>__alloc_pages_may_oom():   oom_reap_task():
> 
>  mutex_trylock(_lock) succeeds.
>  get_page_from_freelist() fails.
>  Preempted to other process.
> oom_reap_task_mm() succeeds.
> Sets MMF_OOM_SKIP.
>  Returned from preemption.
>  Finds that MMF_OOM_SKIP was already set.
>  Selects next OOM victim and kills it.
>  mutex_unlock(_lock) is called.
> 
> race window like described as
> 
> Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
> to prevent from races when the page allocator didn't manage to get the
> freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
> on and move on to another victim.  Although this is possible in principle
> let's wait for it to actually happen in real life before we make the
> locking more complex again.
> 
> in that commit.
> 

Yes, that race window is real. We can needlessly select next OOM victim.
I think that af5679fbc669f31f was too optimistic.

[  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice 
child
[  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
[  278.151586] vmtoolsd invoked oom-killer: 
gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
oom_score_adj=0
[  278.156642] vmtoolsd cpuset=/ mems_allowed=0
[  278.158884] CPU: 2 PID: 8916 Comm: vmtoolsd Kdump: loaded Not tainted 
4.19.0-rc2+ #465
[  278.162252] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  278.165499] Call Trace:
[  278.166693]  dump_stack+0x99/0xdc
[  278.167922]  dump_header+0x70/0x2fa
[  278.169414] oom_reaper: reaped process 9943 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[  278.169629]  ? _raw_spin_unlock_irqrestore+0x6a/0x8c
[  278.169633]  oom_kill_process+0x2ee/0x380
[  278.169635]  out_of_memory+0x136/0x540
[  278.169636]  ? out_of_memory+0x1fc/0x540
[  278.169640]  __alloc_pages_slowpath+0x986/0xce4
[  278.169641]  ? get_page_from_freelist+0x16b/0x1600
[  278.169646]  __alloc_pages_nodemask+0x398/0x3d0
[  278.180594]  alloc_pages_current+0x65/0xb0
[  278.182173]  __page_cache_alloc+0x154/0x190
[  278.184200]  ? pagecache_get_page+0x27/0x250
[  278.185410]  filemap_fault+0x4df/0x880
[  278.186282]  ? filemap_fault+0x31b/0x880
[  278.187395]  ? xfs_ilock+0x1bd/0x220
[  278.188264]  ? __xfs_filemap_fault+0x76/0x270
[  278.189268]  ? down_read_nested+0x48/0x80
[  278.190229]  ? xfs_ilock+0x1bd/0x220
[  278.191061]  __xfs_filemap_fault+0x89/0x270
[  278.192059]  xfs_filemap_fault+0x27/0x30
[  278.192967]  __do_fault+0x1f/0x70
[  278.193777]  __handle_mm_fault+0xfbd/0x1470
[  278.194743]  handle_mm_fault+0x1f2/0x400
[  278.195679]  ? handle_mm_fault+0x47/0x400
[  278.196618]  __do_page_fault+0x217/0x4b0
[  278.197504]  do_page_fault+0x3c/0x21e
[  278.198303]  ? page_fault+0x8/0x30
[  278.199092]  page_fault+0x1e/0x30
[  278.199821] RIP: 0033:0x7f2322ebbfb0
[  278.200605] Code: Bad RIP value.
[  278.201370] RSP: 002b:7ffda96e7648 EFLAGS: 00010246
[  278.202518] RAX:  RBX: 7f23220f26b0 RCX: 0010
[  278.204280] RDX:  RSI:  RDI: 7f2321ecfb5b
[  278.205838] RBP: 02504b70 R08: 7f2321ecfb60 R09: 0250bd20
[  278.207426] R10: 383836312d646c69 R11:  R12: 7ffda96e76b0
[  278.208982] R13: 7f2322ea8540 R14: 0250ba90 R15: 7f2323173920
[  278.210840] Mem-Info:
[  278.211462] active_anon:18629 inactive_anon:2390 isolated_anon:0
[  278.211462]  active_file:19 inactive_file:1565 isolated_file:0
[  278.211462]  unevictable:0 dirty:0 writeback:0 unstable:0
[  278.211462]  slab_reclaimable:5820 slab_unreclaimable:8964
[  278.211462]  mapped:2128 shmem:2493 pagetables:1826 bounce:0
[  278.211462]  free:878043 free_pcp:909 free_cma:0
[  278.218830] Node 0 active_anon:74516kB inactive_anon:9560kB active_file:76kB 
inactive_file:6260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:8512kB dirty:0kB writeback:0kB shmem:9972kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? yes
[  278.224997] Node 0 DMA free:15888kB min:288kB low:360kB high:432kB 
active_anon:32kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  278.230602] lowmem_reserve[]: 0 2663 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/08/24 9:31, Tetsuo Handa wrote:
> For now, I don't think we need to add af5679fbc669f31f to the list for
> CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> selection (especially with CONFIG_PREEMPT=y kernels) due to
> 
>__alloc_pages_may_oom():   oom_reap_task():
> 
>  mutex_trylock(_lock) succeeds.
>  get_page_from_freelist() fails.
>  Preempted to other process.
> oom_reap_task_mm() succeeds.
> Sets MMF_OOM_SKIP.
>  Returned from preemption.
>  Finds that MMF_OOM_SKIP was already set.
>  Selects next OOM victim and kills it.
>  mutex_unlock(_lock) is called.
> 
> race window like described as
> 
> Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
> to prevent from races when the page allocator didn't manage to get the
> freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
> on and move on to another victim.  Although this is possible in principle
> let's wait for it to actually happen in real life before we make the
> locking more complex again.
> 
> in that commit.
> 

Yes, that race window is real. We can needlessly select next OOM victim.
I think that af5679fbc669f31f was too optimistic.

[  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice 
child
[  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
[  278.151586] vmtoolsd invoked oom-killer: 
gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
oom_score_adj=0
[  278.156642] vmtoolsd cpuset=/ mems_allowed=0
[  278.158884] CPU: 2 PID: 8916 Comm: vmtoolsd Kdump: loaded Not tainted 
4.19.0-rc2+ #465
[  278.162252] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  278.165499] Call Trace:
[  278.166693]  dump_stack+0x99/0xdc
[  278.167922]  dump_header+0x70/0x2fa
[  278.169414] oom_reaper: reaped process 9943 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[  278.169629]  ? _raw_spin_unlock_irqrestore+0x6a/0x8c
[  278.169633]  oom_kill_process+0x2ee/0x380
[  278.169635]  out_of_memory+0x136/0x540
[  278.169636]  ? out_of_memory+0x1fc/0x540
[  278.169640]  __alloc_pages_slowpath+0x986/0xce4
[  278.169641]  ? get_page_from_freelist+0x16b/0x1600
[  278.169646]  __alloc_pages_nodemask+0x398/0x3d0
[  278.180594]  alloc_pages_current+0x65/0xb0
[  278.182173]  __page_cache_alloc+0x154/0x190
[  278.184200]  ? pagecache_get_page+0x27/0x250
[  278.185410]  filemap_fault+0x4df/0x880
[  278.186282]  ? filemap_fault+0x31b/0x880
[  278.187395]  ? xfs_ilock+0x1bd/0x220
[  278.188264]  ? __xfs_filemap_fault+0x76/0x270
[  278.189268]  ? down_read_nested+0x48/0x80
[  278.190229]  ? xfs_ilock+0x1bd/0x220
[  278.191061]  __xfs_filemap_fault+0x89/0x270
[  278.192059]  xfs_filemap_fault+0x27/0x30
[  278.192967]  __do_fault+0x1f/0x70
[  278.193777]  __handle_mm_fault+0xfbd/0x1470
[  278.194743]  handle_mm_fault+0x1f2/0x400
[  278.195679]  ? handle_mm_fault+0x47/0x400
[  278.196618]  __do_page_fault+0x217/0x4b0
[  278.197504]  do_page_fault+0x3c/0x21e
[  278.198303]  ? page_fault+0x8/0x30
[  278.199092]  page_fault+0x1e/0x30
[  278.199821] RIP: 0033:0x7f2322ebbfb0
[  278.200605] Code: Bad RIP value.
[  278.201370] RSP: 002b:7ffda96e7648 EFLAGS: 00010246
[  278.202518] RAX:  RBX: 7f23220f26b0 RCX: 0010
[  278.204280] RDX:  RSI:  RDI: 7f2321ecfb5b
[  278.205838] RBP: 02504b70 R08: 7f2321ecfb60 R09: 0250bd20
[  278.207426] R10: 383836312d646c69 R11:  R12: 7ffda96e76b0
[  278.208982] R13: 7f2322ea8540 R14: 0250ba90 R15: 7f2323173920
[  278.210840] Mem-Info:
[  278.211462] active_anon:18629 inactive_anon:2390 isolated_anon:0
[  278.211462]  active_file:19 inactive_file:1565 isolated_file:0
[  278.211462]  unevictable:0 dirty:0 writeback:0 unstable:0
[  278.211462]  slab_reclaimable:5820 slab_unreclaimable:8964
[  278.211462]  mapped:2128 shmem:2493 pagetables:1826 bounce:0
[  278.211462]  free:878043 free_pcp:909 free_cma:0
[  278.218830] Node 0 active_anon:74516kB inactive_anon:9560kB active_file:76kB 
inactive_file:6260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:8512kB dirty:0kB writeback:0kB shmem:9972kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? yes
[  278.224997] Node 0 DMA free:15888kB min:288kB low:360kB high:432kB 
active_anon:32kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  278.230602] lowmem_reserve[]: 0 2663 

Re: [PATCH] security: tomoyo: Fix obsolete function

2018-09-04 Thread Tetsuo Handa
On 2018/09/04 17:41, Ding Xiang wrote:
> simple_strtoul is obsolete, and use kstrtouint instead
> 
> Signed-off-by: Ding Xiang 

Acked-by: Tetsuo Handa 

> ---
>  security/tomoyo/common.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/tomoyo/common.c b/security/tomoyo/common.c
> index 03923a1..9b38f94 100644
> --- a/security/tomoyo/common.c
> +++ b/security/tomoyo/common.c
> @@ -1660,7 +1660,8 @@ static void tomoyo_read_pid(struct tomoyo_io_buffer 
> *head)
>   head->r.eof = true;
>   if (tomoyo_str_starts(, "global-pid "))
>   global_pid = true;
> - pid = (unsigned int) simple_strtoul(buf, NULL, 10);
> + if (kstrtouint(buf, 10, ))
> + return;
>   rcu_read_lock();
>   if (global_pid)
>   p = find_task_by_pid_ns(pid, _pid_ns);
> 


Re: [PATCH] security: tomoyo: Fix obsolete function

2018-09-04 Thread Tetsuo Handa
On 2018/09/04 17:41, Ding Xiang wrote:
> simple_strtoul is obsolete, and use kstrtouint instead
> 
> Signed-off-by: Ding Xiang 

Acked-by: Tetsuo Handa 

> ---
>  security/tomoyo/common.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/tomoyo/common.c b/security/tomoyo/common.c
> index 03923a1..9b38f94 100644
> --- a/security/tomoyo/common.c
> +++ b/security/tomoyo/common.c
> @@ -1660,7 +1660,8 @@ static void tomoyo_read_pid(struct tomoyo_io_buffer 
> *head)
>   head->r.eof = true;
>   if (tomoyo_str_starts(, "global-pid "))
>   global_pid = true;
> - pid = (unsigned int) simple_strtoul(buf, NULL, 10);
> + if (kstrtouint(buf, 10, ))
> + return;
>   rcu_read_lock();
>   if (global_pid)
>   p = find_task_by_pid_ns(pid, _pid_ns);
> 


[PATCH] kernel/{lockdep,hung_task}: Show locks and backtrace of running tasks.

2018-09-03 Thread Tetsuo Handa
We are getting reports from syzbot where running task seems to be
relevant to a hung task problem but NMI backtrace does not print useful
information [1].

Although commit 8cc05c71ba5f7936 ("locking/lockdep: Move sanity check to
inside lockdep_print_held_locks()") says that calling
lockdep_print_held_locks() on a running thread is considered unsafe,
it is useful for syzbot to show locks and backtrace of running tasks.
Thus, let's allow it if CONFIG_DEBUG_AID_FOR_SYZBOT is defined.

[1] 
https://syzkaller.appspot.com/bug?id=8bab7a6a5597bb10f90e8227a7d8a483748d93be

Signed-off-by: Tetsuo Handa 
Cc: Dmitry Vyukov 
---
 kernel/hung_task.c   | 20 
 kernel/locking/lockdep.c |  9 +
 2 files changed, 29 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index b9132d1..1ac49a5 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -201,6 +201,26 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
if (hung_task_show_lock)
debug_show_all_locks();
if (hung_task_call_panic) {
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+   /*
+* debug_show_all_locks() above forcibly dumped locks held by
+* running tasks with locks held. Now, let's dump backtrace of
+* running tasks as well, for NMI backtrace below tends to show
+* current thread (i.e. khungtaskd thread itself) and idle CPU
+* which are useless for debugging hung task problems.
+*/
+   rcu_read_lock();
+   for_each_process_thread(g, t) {
+   if (t->state != TASK_RUNNING || t == current)
+   continue;
+   pr_err("INFO: task %s:%d was running.\n", t->comm,
+  t->pid);
+   sched_show_task(t);
+   touch_nmi_watchdog();
+   touch_all_softlockup_watchdogs();
+   }
+   rcu_read_unlock();
+#endif
trigger_all_cpu_backtrace();
panic("hung_task: blocked tasks");
}
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e406c5f..efeebf6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -565,12 +565,21 @@ static void lockdep_print_held_locks(struct task_struct 
*p)
else
printk("%d lock%s held by %s/%d:\n", depth,
   depth > 1 ? "s" : "", p->comm, task_pid_nr(p));
+#ifndef CONFIG_DEBUG_AID_FOR_SYZBOT
/*
 * It's not reliable to print a task's held locks if it's not sleeping
 * and it's not the current task.
 */
if (p->state == TASK_RUNNING && p != current)
return;
+#else
+   /*
+* But showing locks and backtrace of running tasks seems to be helpful
+* for debugging hung task problems. Since syzbot will call panic()
+* shortly, risking problems caused by accessing stale information is
+* acceptable here.
+*/
+#endif
for (i = 0; i < depth; i++) {
printk(" #%d: ", i);
print_lock(p->held_locks + i);
-- 
1.8.3.1



[PATCH] kernel/{lockdep,hung_task}: Show locks and backtrace of running tasks.

2018-09-03 Thread Tetsuo Handa
We are getting reports from syzbot where running task seems to be
relevant to a hung task problem but NMI backtrace does not print useful
information [1].

Although commit 8cc05c71ba5f7936 ("locking/lockdep: Move sanity check to
inside lockdep_print_held_locks()") says that calling
lockdep_print_held_locks() on a running thread is considered unsafe,
it is useful for syzbot to show locks and backtrace of running tasks.
Thus, let's allow it if CONFIG_DEBUG_AID_FOR_SYZBOT is defined.

[1] 
https://syzkaller.appspot.com/bug?id=8bab7a6a5597bb10f90e8227a7d8a483748d93be

Signed-off-by: Tetsuo Handa 
Cc: Dmitry Vyukov 
---
 kernel/hung_task.c   | 20 
 kernel/locking/lockdep.c |  9 +
 2 files changed, 29 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index b9132d1..1ac49a5 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -201,6 +201,26 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
if (hung_task_show_lock)
debug_show_all_locks();
if (hung_task_call_panic) {
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+   /*
+* debug_show_all_locks() above forcibly dumped locks held by
+* running tasks with locks held. Now, let's dump backtrace of
+* running tasks as well, for NMI backtrace below tends to show
+* current thread (i.e. khungtaskd thread itself) and idle CPU
+* which are useless for debugging hung task problems.
+*/
+   rcu_read_lock();
+   for_each_process_thread(g, t) {
+   if (t->state != TASK_RUNNING || t == current)
+   continue;
+   pr_err("INFO: task %s:%d was running.\n", t->comm,
+  t->pid);
+   sched_show_task(t);
+   touch_nmi_watchdog();
+   touch_all_softlockup_watchdogs();
+   }
+   rcu_read_unlock();
+#endif
trigger_all_cpu_backtrace();
panic("hung_task: blocked tasks");
}
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e406c5f..efeebf6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -565,12 +565,21 @@ static void lockdep_print_held_locks(struct task_struct 
*p)
else
printk("%d lock%s held by %s/%d:\n", depth,
   depth > 1 ? "s" : "", p->comm, task_pid_nr(p));
+#ifndef CONFIG_DEBUG_AID_FOR_SYZBOT
/*
 * It's not reliable to print a task's held locks if it's not sleeping
 * and it's not the current task.
 */
if (p->state == TASK_RUNNING && p != current)
return;
+#else
+   /*
+* But showing locks and backtrace of running tasks seems to be helpful
+* for debugging hung task problems. Since syzbot will call panic()
+* shortly, risking problems caused by accessing stale information is
+* acceptable here.
+*/
+#endif
for (i = 0; i < depth; i++) {
printk(" #%d: ", i);
print_lock(p->held_locks + i);
-- 
1.8.3.1



Re: [PATCH 0/8] CaitSith LSM module

2018-09-01 Thread Tetsuo Handa
On 2017/10/22 2:17, Casey Schaufler wrote:
>> As one year elapsed since I proposed CaitSith for upstream, I'd like to
>> hear the status again. I looked at
>> http://schd.ws/hosted_files/lss2017/8b/201709-LinuxSecuritySummit-Stacking.pdf
>>  .
>> How is ETA for Security Module Stacking? Is it a half year or so?
> 
> Assuming that I can keep working on stacking at my current level,
> and that we can work out a couple issues with audit and networking
> there is a possibility we're looking at mid 2018 for stacking. The
> increased interest in security module namespaces for containers is
> helping make stacking seem important to more people.
> 
>> If it is likely take longer, should I resume proposing CaitSith for now
>> as one of "Minor modules" except security_module_enable() check added
>> until Security Module Stacking work completes? Or should I wait for
>> completion of stacking work? I want to know it, for recent proposals are
>> rather staying silent.
> 
> I wouldn't wait if it was my project, but I have been known
> to be more aggressive than is good for me from time to time.
> 

It seems that stacking needs some more time. Manual with updated patch for
current source code is available at https://caitsith.osdn.jp/index2.html .
John, if you can resume reviewing, I'd like to propose CaitSith as one of
"Minor modules" except security_module_enable() check.

Regards.



Re: [PATCH 0/8] CaitSith LSM module

2018-09-01 Thread Tetsuo Handa
On 2017/10/22 2:17, Casey Schaufler wrote:
>> As one year elapsed since I proposed CaitSith for upstream, I'd like to
>> hear the status again. I looked at
>> http://schd.ws/hosted_files/lss2017/8b/201709-LinuxSecuritySummit-Stacking.pdf
>>  .
>> How is ETA for Security Module Stacking? Is it a half year or so?
> 
> Assuming that I can keep working on stacking at my current level,
> and that we can work out a couple issues with audit and networking
> there is a possibility we're looking at mid 2018 for stacking. The
> increased interest in security module namespaces for containers is
> helping make stacking seem important to more people.
> 
>> If it is likely take longer, should I resume proposing CaitSith for now
>> as one of "Minor modules" except security_module_enable() check added
>> until Security Module Stacking work completes? Or should I wait for
>> completion of stacking work? I want to know it, for recent proposals are
>> rather staying silent.
> 
> I wouldn't wait if it was my project, but I have been known
> to be more aggressive than is good for me from time to time.
> 

It seems that stacking needs some more time. Manual with updated patch for
current source code is available at https://caitsith.osdn.jp/index2.html .
John, if you can resume reviewing, I'd like to propose CaitSith as one of
"Minor modules" except security_module_enable() check.

Regards.



Re: [PATCH 2/4] tty: Hold tty_ldisc_lock() during tty_reopen()

2018-08-31 Thread Tetsuo Handa
On 2018/08/31 15:51, Jiri Slaby wrote:
> On 08/29/2018, 05:19 PM, Tetsuo Handa wrote:
>> On 2018/08/29 11:23, Dmitry Safonov wrote:
>>> tty_ldisc_reinit() doesn't race with neither tty_ldisc_hangup()
>>> nor set_ldisc() nor tty_ldisc_release() as they use tty lock.
>>> But it races with anyone who expects line discipline to be the same
>>> after hoding read semaphore in tty_ldisc_ref().
>>>
>>> We've seen the following crash on v4.9.108 stable:
>>>
>>> BUG: unable to handle kernel paging request at 2260
>>> IP: [..] n_tty_receive_buf_common+0x5f/0x86d
>>> Workqueue: events_unbound flush_to_ldisc
>>> Call Trace:
>>>  [..] n_tty_receive_buf2
>>>  [..] tty_ldisc_receive_buf
>>>  [..] flush_to_ldisc
>>>  [..] process_one_work
>>>  [..] worker_thread
>>>  [..] kthread
>>>  [..] ret_from_fork
>>>
>>> I think, tty_ldisc_reinit() should be called with ldisc_sem hold for
>>> writing, which will protect any reader against line discipline changes.
>>>
>>> Note: I failed to reproduce the described crash, so obiviously can't
>>> guarantee that this is the place where line discipline was switched.
>>
>> This will be same with a report at
>> https://syzkaller.appspot.com/bug?id=f08670354701fa64cc0dd3c0128a491bdb16adcc
>>  .
>>
>> syzbot is now testing a patch from Jiri Slaby.
> 
> Yes, my patch passed, so could you add:
> Reported-by: syzbot+3aa9784721dfb90e9...@syzkaller.appspotmail.com
> 
> (not adding tested-by as this particular patch was not tested, but
> shoiuld work the same way.)
> 
> thanks,
> 

Tested with all 4 patches applied using syzbot-provided reproducer and
my simplified reproducer. No crashes and no lockdep warnings.
Also, noisy messages like

  pts pts4033: tty_release: tty->count(10529) != (#fd's(7) + #kopen's(0))

are gone. Very nice. Thank you.


Re: [PATCH 2/4] tty: Hold tty_ldisc_lock() during tty_reopen()

2018-08-31 Thread Tetsuo Handa
On 2018/08/31 15:51, Jiri Slaby wrote:
> On 08/29/2018, 05:19 PM, Tetsuo Handa wrote:
>> On 2018/08/29 11:23, Dmitry Safonov wrote:
>>> tty_ldisc_reinit() doesn't race with neither tty_ldisc_hangup()
>>> nor set_ldisc() nor tty_ldisc_release() as they use tty lock.
>>> But it races with anyone who expects line discipline to be the same
>>> after hoding read semaphore in tty_ldisc_ref().
>>>
>>> We've seen the following crash on v4.9.108 stable:
>>>
>>> BUG: unable to handle kernel paging request at 2260
>>> IP: [..] n_tty_receive_buf_common+0x5f/0x86d
>>> Workqueue: events_unbound flush_to_ldisc
>>> Call Trace:
>>>  [..] n_tty_receive_buf2
>>>  [..] tty_ldisc_receive_buf
>>>  [..] flush_to_ldisc
>>>  [..] process_one_work
>>>  [..] worker_thread
>>>  [..] kthread
>>>  [..] ret_from_fork
>>>
>>> I think, tty_ldisc_reinit() should be called with ldisc_sem hold for
>>> writing, which will protect any reader against line discipline changes.
>>>
>>> Note: I failed to reproduce the described crash, so obiviously can't
>>> guarantee that this is the place where line discipline was switched.
>>
>> This will be same with a report at
>> https://syzkaller.appspot.com/bug?id=f08670354701fa64cc0dd3c0128a491bdb16adcc
>>  .
>>
>> syzbot is now testing a patch from Jiri Slaby.
> 
> Yes, my patch passed, so could you add:
> Reported-by: syzbot+3aa9784721dfb90e9...@syzkaller.appspotmail.com
> 
> (not adding tested-by as this particular patch was not tested, but
> shoiuld work the same way.)
> 
> thanks,
> 

Tested with all 4 patches applied using syzbot-provided reproducer and
my simplified reproducer. No crashes and no lockdep warnings.
Also, noisy messages like

  pts pts4033: tty_release: tty->count(10529) != (#fd's(7) + #kopen's(0))

are gone. Very nice. Thank you.


Re: [PATCH 2/4] tty: Hold tty_ldisc_lock() during tty_reopen()

2018-08-29 Thread Tetsuo Handa
On 2018/08/29 11:23, Dmitry Safonov wrote:
> tty_ldisc_reinit() doesn't race with neither tty_ldisc_hangup()
> nor set_ldisc() nor tty_ldisc_release() as they use tty lock.
> But it races with anyone who expects line discipline to be the same
> after hoding read semaphore in tty_ldisc_ref().
> 
> We've seen the following crash on v4.9.108 stable:
> 
> BUG: unable to handle kernel paging request at 2260
> IP: [..] n_tty_receive_buf_common+0x5f/0x86d
> Workqueue: events_unbound flush_to_ldisc
> Call Trace:
>  [..] n_tty_receive_buf2
>  [..] tty_ldisc_receive_buf
>  [..] flush_to_ldisc
>  [..] process_one_work
>  [..] worker_thread
>  [..] kthread
>  [..] ret_from_fork
> 
> I think, tty_ldisc_reinit() should be called with ldisc_sem hold for
> writing, which will protect any reader against line discipline changes.
> 
> Note: I failed to reproduce the described crash, so obiviously can't
> guarantee that this is the place where line discipline was switched.

This will be same with a report at
https://syzkaller.appspot.com/bug?id=f08670354701fa64cc0dd3c0128a491bdb16adcc .

syzbot is now testing a patch from Jiri Slaby.


Re: [PATCH 2/4] tty: Hold tty_ldisc_lock() during tty_reopen()

2018-08-29 Thread Tetsuo Handa
On 2018/08/29 11:23, Dmitry Safonov wrote:
> tty_ldisc_reinit() doesn't race with neither tty_ldisc_hangup()
> nor set_ldisc() nor tty_ldisc_release() as they use tty lock.
> But it races with anyone who expects line discipline to be the same
> after hoding read semaphore in tty_ldisc_ref().
> 
> We've seen the following crash on v4.9.108 stable:
> 
> BUG: unable to handle kernel paging request at 2260
> IP: [..] n_tty_receive_buf_common+0x5f/0x86d
> Workqueue: events_unbound flush_to_ldisc
> Call Trace:
>  [..] n_tty_receive_buf2
>  [..] tty_ldisc_receive_buf
>  [..] flush_to_ldisc
>  [..] process_one_work
>  [..] worker_thread
>  [..] kthread
>  [..] ret_from_fork
> 
> I think, tty_ldisc_reinit() should be called with ldisc_sem hold for
> writing, which will protect any reader against line discipline changes.
> 
> Note: I failed to reproduce the described crash, so obiviously can't
> guarantee that this is the place where line discipline was switched.

This will be same with a report at
https://syzkaller.appspot.com/bug?id=f08670354701fa64cc0dd3c0128a491bdb16adcc .

syzbot is now testing a patch from Jiri Slaby.


Re: [PATCH v2] n_tty: Protect tty->disc_data using refcount.

2018-08-29 Thread Tetsuo Handa
On 2018/08/29 22:53, Jiri Slaby wrote:
> On 07/24/2018, 05:22 PM, Tetsuo Handa wrote:
>> From 118c64e86641a97d44dec39e313a95b12d9bc3b2 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa 
>> Date: Wed, 25 Jul 2018 00:15:18 +0900
>> Subject: [PATCH v2] n_tty: Protect tty->disc_data using refcount.
>>
>> syzbot is reporting NULL pointer dereference at n_tty_set_termios() [1].
>> This is because ioctl(TIOCVHANGUP) versus ioctl(TCSETS) can race.
>>
>> Since we don't want to introduce new locking dependency, this patch
>> converts "struct n_tty_data *ldata = tty->disc_data;" in individual
>> function into a function argument which follows "struct tty *", and
>> holds tty->disc_data at each "struct tty_ldisc_ops" hook using refcount
>> in order to ensure that memory which contains "struct n_tty_data" will
>> not be released while processing individual function.
> 
> This does not look correct and is way too complicated. ioctls should not
> be called while changing/killing/hanging/whatever a ldisc. But there is
> one missing lock in tty_reopen.
> 
> So does the attached patch helps instead?
> 
> thanks,
> 

That patch seems to help avoiding crashes. (You can use #syz test: command.)
But I think you need to check tty_ldisc_lock() return value...


Re: [PATCH v2] n_tty: Protect tty->disc_data using refcount.

2018-08-29 Thread Tetsuo Handa
On 2018/08/29 22:53, Jiri Slaby wrote:
> On 07/24/2018, 05:22 PM, Tetsuo Handa wrote:
>> From 118c64e86641a97d44dec39e313a95b12d9bc3b2 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa 
>> Date: Wed, 25 Jul 2018 00:15:18 +0900
>> Subject: [PATCH v2] n_tty: Protect tty->disc_data using refcount.
>>
>> syzbot is reporting NULL pointer dereference at n_tty_set_termios() [1].
>> This is because ioctl(TIOCVHANGUP) versus ioctl(TCSETS) can race.
>>
>> Since we don't want to introduce new locking dependency, this patch
>> converts "struct n_tty_data *ldata = tty->disc_data;" in individual
>> function into a function argument which follows "struct tty *", and
>> holds tty->disc_data at each "struct tty_ldisc_ops" hook using refcount
>> in order to ensure that memory which contains "struct n_tty_data" will
>> not be released while processing individual function.
> 
> This does not look correct and is way too complicated. ioctls should not
> be called while changing/killing/hanging/whatever a ldisc. But there is
> one missing lock in tty_reopen.
> 
> So does the attached patch helps instead?
> 
> thanks,
> 

That patch seems to help avoiding crashes. (You can use #syz test: command.)
But I think you need to check tty_ldisc_lock() return value...


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
David Rientjes wrote:
> On Fri, 24 Aug 2018, Tetsuo Handa wrote:
> 
> > > For those of us who are tracking CVE-2016-10723 which has peristently 
> > > been 
> > > labeled as "disputed" and with no clear indication of what patches 
> > > address 
> > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > > under oom_lock") and this patch are the intended mitigations?
> > > 
> > > A list of SHA1s for merged fixed and links to proposed patches to address 
> > > this issue would be appreciated.
> > > 
> > 
> > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> > mitigation for CVE-2016-10723.
> > 
> > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> > should_reclaim_retry()." is independent from CVE-2016-10723.
> > 
> 
> Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
> from oom_reaper") also be added to the list for CVE-2016-10723?
> 

Commit af5679fbc669f31f ("mm, oom: remove oom_lock from oom_reaper")
negated one of the two rationales for commit 9bfe5ded054b8e28 ("mm, oom:
remove sleep from under oom_lock"). If we didn't apply af5679fbc669f31f,
we could make sure that CPU resource is given to the owner of oom_lock
by replacing mutex_trylock() in __alloc_pages_may_oom() with mutex_lock().
But now that af5679fbc669f31f was already applied, we don't know how to
give CPU resource to the OOM reaper / exit_mmap(). We might arrive at
direct OOM reaping but we haven't reached there...

For now, I don't think we need to add af5679fbc669f31f to the list for
CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
selection (especially with CONFIG_PREEMPT=y kernels) due to

   __alloc_pages_may_oom():   oom_reap_task():

 mutex_trylock(_lock) succeeds.
 get_page_from_freelist() fails.
 Preempted to other process.
oom_reap_task_mm() succeeds.
Sets MMF_OOM_SKIP.
 Returned from preemption.
 Finds that MMF_OOM_SKIP was already set.
 Selects next OOM victim and kills it.
 mutex_unlock(_lock) is called.

race window like described as

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

in that commit.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
David Rientjes wrote:
> On Fri, 24 Aug 2018, Tetsuo Handa wrote:
> 
> > > For those of us who are tracking CVE-2016-10723 which has peristently 
> > > been 
> > > labeled as "disputed" and with no clear indication of what patches 
> > > address 
> > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > > under oom_lock") and this patch are the intended mitigations?
> > > 
> > > A list of SHA1s for merged fixed and links to proposed patches to address 
> > > this issue would be appreciated.
> > > 
> > 
> > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> > mitigation for CVE-2016-10723.
> > 
> > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> > should_reclaim_retry()." is independent from CVE-2016-10723.
> > 
> 
> Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
> from oom_reaper") also be added to the list for CVE-2016-10723?
> 

Commit af5679fbc669f31f ("mm, oom: remove oom_lock from oom_reaper")
negated one of the two rationales for commit 9bfe5ded054b8e28 ("mm, oom:
remove sleep from under oom_lock"). If we didn't apply af5679fbc669f31f,
we could make sure that CPU resource is given to the owner of oom_lock
by replacing mutex_trylock() in __alloc_pages_may_oom() with mutex_lock().
But now that af5679fbc669f31f was already applied, we don't know how to
give CPU resource to the OOM reaper / exit_mmap(). We might arrive at
direct OOM reaping but we haven't reached there...

For now, I don't think we need to add af5679fbc669f31f to the list for
CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
selection (especially with CONFIG_PREEMPT=y kernels) due to

   __alloc_pages_may_oom():   oom_reap_task():

 mutex_trylock(_lock) succeeds.
 get_page_from_freelist() fails.
 Preempted to other process.
oom_reap_task_mm() succeeds.
Sets MMF_OOM_SKIP.
 Returned from preemption.
 Finds that MMF_OOM_SKIP was already set.
 Selects next OOM victim and kills it.
 mutex_unlock(_lock) is called.

race window like described as

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

in that commit.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
On 2018/08/24 5:06, David Rientjes wrote:
> For those of us who are tracking CVE-2016-10723 which has peristently been 
> labeled as "disputed" and with no clear indication of what patches address 
> it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> under oom_lock") and this patch are the intended mitigations?
> 
> A list of SHA1s for merged fixed and links to proposed patches to address 
> this issue would be appreciated.
> 

Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
mitigation for CVE-2016-10723.

"[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
should_reclaim_retry()." is independent from CVE-2016-10723.

We haven't made sure that the OOM reaper / exit_mmap() will get enough CPU
resources. For example, under a cluster of concurrently allocating realtime
scheduling priority threads, the OOM reaper takes about 1800 milliseconds
whereas direct OOM reaping takes only a few milliseconds.

Regards.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
On 2018/08/24 5:06, David Rientjes wrote:
> For those of us who are tracking CVE-2016-10723 which has peristently been 
> labeled as "disputed" and with no clear indication of what patches address 
> it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> under oom_lock") and this patch are the intended mitigations?
> 
> A list of SHA1s for merged fixed and links to proposed patches to address 
> this issue would be appreciated.
> 

Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
mitigation for CVE-2016-10723.

"[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
should_reclaim_retry()." is independent from CVE-2016-10723.

We haven't made sure that the OOM reaper / exit_mmap() will get enough CPU
resources. For example, under a cluster of concurrently allocating realtime
scheduling priority threads, the OOM reaper takes about 1800 milliseconds
whereas direct OOM reaping takes only a few milliseconds.

Regards.


Re: [PATCH] apparmor: remove unused label

2018-08-23 Thread Tetsuo Handa
On 2018/08/23 23:21, Kees Cook wrote:
> On Thu, Aug 23, 2018 at 7:09 AM, Arnd Bergmann  wrote:
>> After the corresponding 'goto' was removed, we get a warning
>> for the 'fail' label:
>>
>> security/apparmor/policy_unpack.c: In function 'unpack_dfa':
>> security/apparmor/policy_unpack.c:426:1: error: label 'fail' defined but not 
>> used [-Werror=unused-label]
>>
>> Fixes: fb5841091f28 ("apparmor: remove no-op permission check in 
>> policy_unpack")
>> Signed-off-by: Arnd Bergmann 
> 
> Reviewed-by: Kees Cook 
> 
> -Kees
> 
>> ---
>>  security/apparmor/policy_unpack.c | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/security/apparmor/policy_unpack.c 
>> b/security/apparmor/policy_unpack.c
>> index 3647b5834ace..96d8cf68ce65 100644
>> --- a/security/apparmor/policy_unpack.c
>> +++ b/security/apparmor/policy_unpack.c
>> @@ -423,7 +423,6 @@ static struct aa_dfa *unpack_dfa(struct aa_ext *e)
>>
>> return dfa;
>>
>> -fail:
>> aa_put_dfa(dfa);
>> return ERR_PTR(-EPROTO);

If these lines are unreachable, please remove together...
And that is what Gustavo A. R. Silva reported before this patch?

>>  }
>> --
>> 2.18.0
>>
> 
> 
> 


Re: [PATCH] apparmor: remove unused label

2018-08-23 Thread Tetsuo Handa
On 2018/08/23 23:21, Kees Cook wrote:
> On Thu, Aug 23, 2018 at 7:09 AM, Arnd Bergmann  wrote:
>> After the corresponding 'goto' was removed, we get a warning
>> for the 'fail' label:
>>
>> security/apparmor/policy_unpack.c: In function 'unpack_dfa':
>> security/apparmor/policy_unpack.c:426:1: error: label 'fail' defined but not 
>> used [-Werror=unused-label]
>>
>> Fixes: fb5841091f28 ("apparmor: remove no-op permission check in 
>> policy_unpack")
>> Signed-off-by: Arnd Bergmann 
> 
> Reviewed-by: Kees Cook 
> 
> -Kees
> 
>> ---
>>  security/apparmor/policy_unpack.c | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/security/apparmor/policy_unpack.c 
>> b/security/apparmor/policy_unpack.c
>> index 3647b5834ace..96d8cf68ce65 100644
>> --- a/security/apparmor/policy_unpack.c
>> +++ b/security/apparmor/policy_unpack.c
>> @@ -423,7 +423,6 @@ static struct aa_dfa *unpack_dfa(struct aa_ext *e)
>>
>> return dfa;
>>
>> -fail:
>> aa_put_dfa(dfa);
>> return ERR_PTR(-EPROTO);

If these lines are unreachable, please remove together...
And that is what Gustavo A. R. Silva reported before this patch?

>>  }
>> --
>> 2.18.0
>>
> 
> 
> 


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-21 Thread Tetsuo Handa
On 2018/08/03 15:16, Michal Hocko wrote:
> On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
>> On 2018/07/31 14:09, Michal Hocko wrote:
>>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>>>> On 2018/07/31 4:10, Michal Hocko wrote:
>>>>> Since should_reclaim_retry() should be a natural reschedule point,
>>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>>>> order to guarantee that other pending work items are started. This will
>>>>> workaround this problem and it is less fragile than hunting down when
>>>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>>>> path but this has been removed recently because it caused other issues.
>>>>> Having a single sleeping point is more robust.
>>>>
>>>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>>>> removing the
>>>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>>>> immediately.
>>>
>>> is this an {Acked,Reviewed,Tested}-by?
>>>
>>> I will send the patch to Andrew if the patch is ok. 
>>>
>>>> (And that change will conflict with Roman's cgroup aware OOM killer 
>>>> patchset. But it
>>>> should be easy to rebase.)
>>>
>>> That is still a WIP so I would lose sleep over it.
>>>
>>
>> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
>> linux-next.git ,
>> linux-next.git will get the sleeping point removed. Please send this patch 
>> to linux-next.git .
> 
> I still haven't heard any explicit confirmation that the patch works for
> your workload. Should I beg for it? Or you simply do not want to have
> your stamp on the patch? If yes, I can live with that but this playing
> hide and catch is not really a lot of fun.
> 

I noticed that the patch has not been sent to linux-next.git yet.
Please send to linux-next.git without my stamp on the patch.

[   44.863590] Out of memory: Kill process 1071 (a.out) score 865 or sacrifice 
child
[   44.867666] Killed process 1817 (a.out) total-vm:5244kB, anon-rss:1040kB, 
file-rss:0kB, shmem-rss:0kB
[   44.872176] oom_reaper: reaped process 1817 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[   91.698761] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 48s!
[   91.702313] Showing busy workqueues and worker pools:
[   91.705011] workqueue events: flags=0x0
[   91.707482]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[   91.710524] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], 
e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, 
mmdrop_async_fn, check_corruption
[   91.717439] workqueue events_freezable: flags=0x4
[   91.720161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.723304] pending: vmballoon_work [vmw_balloon]
[   91.726167] workqueue events_power_efficient: flags=0x80
[   91.729139]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[   91.732253] pending: fb_flashcursor, gc_worker [nf_conntrack], 
neigh_periodic_work, neigh_periodic_work
[   91.736471] workqueue events_freezable_power_: flags=0x84
[   91.739546]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.742696] in-flight: 2097:disk_events_workfn
[   91.745517] workqueue mm_percpu_wq: flags=0x8
[   91.748069]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[   91.751182] pending: drain_local_pages_wq BAR(1830), vmstat_update
[   91.754661] workqueue mpt_poll_0: flags=0x8
[   91.757161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.759958] pending: mpt_fault_reset_work [mptbase]
[   91.762696] workqueue xfs-data/sda1: flags=0xc
[   91.765353]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   91.768248] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[   91.771589] workqueue xfs-cil/sda1: flags=0xc
[   91.774009]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.776800] pending: xlog_cil_push_work [xfs] BAR(703)
[   91.779464] workqueue xfs-reclaim/sda1: flags=0xc
[   91.782017]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.784599] pending: xfs_reclaim_worker [xfs]
[   91.786930] workqueue xfs-sync/sda1: flags=0x4
[   91.789289]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.792075] pending: xfs_log_worker [xfs]
[   91.794213] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=48s workers=4 idle: 
52 13 5
[  121.906640] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 78s!
[  121.909572] Showing busy workqueues and worker pools:
[  121.911703] workqueue events: flags=0x0
[  121.913531]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[  121.915873] pending: vmpre

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-21 Thread Tetsuo Handa
On 2018/08/03 15:16, Michal Hocko wrote:
> On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
>> On 2018/07/31 14:09, Michal Hocko wrote:
>>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>>>> On 2018/07/31 4:10, Michal Hocko wrote:
>>>>> Since should_reclaim_retry() should be a natural reschedule point,
>>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>>>> order to guarantee that other pending work items are started. This will
>>>>> workaround this problem and it is less fragile than hunting down when
>>>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>>>> path but this has been removed recently because it caused other issues.
>>>>> Having a single sleeping point is more robust.
>>>>
>>>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>>>> removing the
>>>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>>>> immediately.
>>>
>>> is this an {Acked,Reviewed,Tested}-by?
>>>
>>> I will send the patch to Andrew if the patch is ok. 
>>>
>>>> (And that change will conflict with Roman's cgroup aware OOM killer 
>>>> patchset. But it
>>>> should be easy to rebase.)
>>>
>>> That is still a WIP so I would lose sleep over it.
>>>
>>
>> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
>> linux-next.git ,
>> linux-next.git will get the sleeping point removed. Please send this patch 
>> to linux-next.git .
> 
> I still haven't heard any explicit confirmation that the patch works for
> your workload. Should I beg for it? Or you simply do not want to have
> your stamp on the patch? If yes, I can live with that but this playing
> hide and catch is not really a lot of fun.
> 

I noticed that the patch has not been sent to linux-next.git yet.
Please send to linux-next.git without my stamp on the patch.

[   44.863590] Out of memory: Kill process 1071 (a.out) score 865 or sacrifice 
child
[   44.867666] Killed process 1817 (a.out) total-vm:5244kB, anon-rss:1040kB, 
file-rss:0kB, shmem-rss:0kB
[   44.872176] oom_reaper: reaped process 1817 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[   91.698761] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 48s!
[   91.702313] Showing busy workqueues and worker pools:
[   91.705011] workqueue events: flags=0x0
[   91.707482]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[   91.710524] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], 
e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, 
mmdrop_async_fn, check_corruption
[   91.717439] workqueue events_freezable: flags=0x4
[   91.720161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.723304] pending: vmballoon_work [vmw_balloon]
[   91.726167] workqueue events_power_efficient: flags=0x80
[   91.729139]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[   91.732253] pending: fb_flashcursor, gc_worker [nf_conntrack], 
neigh_periodic_work, neigh_periodic_work
[   91.736471] workqueue events_freezable_power_: flags=0x84
[   91.739546]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.742696] in-flight: 2097:disk_events_workfn
[   91.745517] workqueue mm_percpu_wq: flags=0x8
[   91.748069]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[   91.751182] pending: drain_local_pages_wq BAR(1830), vmstat_update
[   91.754661] workqueue mpt_poll_0: flags=0x8
[   91.757161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.759958] pending: mpt_fault_reset_work [mptbase]
[   91.762696] workqueue xfs-data/sda1: flags=0xc
[   91.765353]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   91.768248] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[   91.771589] workqueue xfs-cil/sda1: flags=0xc
[   91.774009]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.776800] pending: xlog_cil_push_work [xfs] BAR(703)
[   91.779464] workqueue xfs-reclaim/sda1: flags=0xc
[   91.782017]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.784599] pending: xfs_reclaim_worker [xfs]
[   91.786930] workqueue xfs-sync/sda1: flags=0x4
[   91.789289]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.792075] pending: xfs_log_worker [xfs]
[   91.794213] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=48s workers=4 idle: 
52 13 5
[  121.906640] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 78s!
[  121.909572] Showing busy workqueues and worker pools:
[  121.911703] workqueue events: flags=0x0
[  121.913531]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[  121.915873] pending: vmpre

Re: INFO: task hung in generic_file_write_iter

2018-08-20 Thread Tetsuo Handa
On 2018/08/06 20:56, Tetsuo Handa wrote:
> On 2018/08/06 19:09, Jan Kara wrote:
>> Looks like some kind of a race where device block size gets changed while
>> getblk() runs (and creates buffers for underlying page). I don't have time
>> to nail it down at this moment can have a look into it later unless someone
>> beats me to it.
>>
> 
> It seems that loop device is relevant to this problem.

Speak of loop device, I'm waiting for Jens for three months
http://lkml.kernel.org/r/1527297408-4428-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
but Jens is too busy to come up with an alternative. Since that is a big patch, 
I wish we can
start testing that patch before Jan starts writing a patch for fixing getblk() 
race problem.



Re: INFO: task hung in generic_file_write_iter

2018-08-20 Thread Tetsuo Handa
On 2018/08/06 20:56, Tetsuo Handa wrote:
> On 2018/08/06 19:09, Jan Kara wrote:
>> Looks like some kind of a race where device block size gets changed while
>> getblk() runs (and creates buffers for underlying page). I don't have time
>> to nail it down at this moment can have a look into it later unless someone
>> beats me to it.
>>
> 
> It seems that loop device is relevant to this problem.

Speak of loop device, I'm waiting for Jens for three months
http://lkml.kernel.org/r/1527297408-4428-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
but Jens is too busy to come up with an alternative. Since that is a big patch, 
I wish we can
start testing that patch before Jan starts writing a patch for fixing getblk() 
race problem.



Re: WARNING in try_charge

2018-08-09 Thread Tetsuo Handa
On 2018/08/10 0:07, Michal Hocko wrote:
> On Thu 09-08-18 22:57:43, Tetsuo Handa wrote:
>> >From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa 
>> Date: Thu, 9 Aug 2018 22:49:40 +0900
>> Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
>>  memory reserve fails
>>
>> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
>> oom_reaped tasks") changed to select next OOM victim as soon as
>> MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
>> long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
>> caused by this race window [1].
> 
> It is not because the syzbot was exercising a completely different code
> path (memcg charge rather than the page allocator).

I know syzbot is hitting memcg charge path.

> 
>> Since memcg OOM case uses forced charge if current thread is killed,
>> out_of_memory() can return true without selecting next OOM victim.
>> Therefore, this patch changes task_will_free_mem(current) to ignore
>> MMF_OOM_SKIP unless ALLOC_OOM allocation failed.
> 
> And the patch is simply wrong for memcg.
> 

Why? I think I should have done

-+  page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM
-+   || (gfp_mask & __GFP_NOMEMALLOC), ac,
-+   _some_progress);
++  page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM,
++   ac, _some_progress);

because nobody will use __GFP_NOMEMALLOC | __GFP_NOFAIL. But for memcg charge
path, task_will_free_mem(current, false) == true and out_of_memory() will return
true, which avoids unnecessary OOM killing.

Of course, this patch cannot avoid unnecessary OOM killing if out_of_memory()
is called by not yet killed process. But to mitigate it, what can we do other
than defer setting MMF_OOM_SKIP using a timeout based mechanism? Making
the OOM reaper unconditionally reclaim all memory is not a valid answer.



Re: WARNING in try_charge

2018-08-09 Thread Tetsuo Handa
On 2018/08/10 0:07, Michal Hocko wrote:
> On Thu 09-08-18 22:57:43, Tetsuo Handa wrote:
>> >From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa 
>> Date: Thu, 9 Aug 2018 22:49:40 +0900
>> Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
>>  memory reserve fails
>>
>> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
>> oom_reaped tasks") changed to select next OOM victim as soon as
>> MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
>> long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
>> caused by this race window [1].
> 
> It is not because the syzbot was exercising a completely different code
> path (memcg charge rather than the page allocator).

I know syzbot is hitting memcg charge path.

> 
>> Since memcg OOM case uses forced charge if current thread is killed,
>> out_of_memory() can return true without selecting next OOM victim.
>> Therefore, this patch changes task_will_free_mem(current) to ignore
>> MMF_OOM_SKIP unless ALLOC_OOM allocation failed.
> 
> And the patch is simply wrong for memcg.
> 

Why? I think I should have done

-+  page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM
-+   || (gfp_mask & __GFP_NOMEMALLOC), ac,
-+   _some_progress);
++  page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM,
++   ac, _some_progress);

because nobody will use __GFP_NOMEMALLOC | __GFP_NOFAIL. But for memcg charge
path, task_will_free_mem(current, false) == true and out_of_memory() will return
true, which avoids unnecessary OOM killing.

Of course, this patch cannot avoid unnecessary OOM killing if out_of_memory()
is called by not yet killed process. But to mitigate it, what can we do other
than defer setting MMF_OOM_SKIP using a timeout based mechanism? Making
the OOM reaper unconditionally reclaim all memory is not a valid answer.



Re: WARNING in try_charge

2018-08-09 Thread Tetsuo Handa
>From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Thu, 9 Aug 2018 22:49:40 +0900
Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
 memory reserve fails

Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
caused by this race window [1].

Since memcg OOM case uses forced charge if current thread is killed,
out_of_memory() can return true without selecting next OOM victim.
Therefore, this patch changes task_will_free_mem(current) to ignore
MMF_OOM_SKIP unless ALLOC_OOM allocation failed.

[1] 
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: David Rientjes 
---
 include/linux/oom.h | 3 +++
 mm/oom_kill.c   | 8 
 mm/page_alloc.c | 7 +--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a5..b5abacd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -38,6 +38,9 @@ struct oom_control {
 */
const int order;
 
+   /* Did we already try ALLOC_OOM allocation? i*/
+   const bool reserve_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e10b86..95453e8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -782,7 +782,7 @@ static inline bool __task_will_free_mem(struct task_struct 
*task)
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, bool select_new)
 {
struct mm_struct *mm = task->mm;
struct task_struct *p;
@@ -803,7 +803,7 @@ static bool task_will_free_mem(struct task_struct *task)
 * This task has already been drained by the oom reaper so there are
 * only small chances it will free some more
 */
-   if (test_bit(MMF_OOM_SKIP, >flags))
+   if (test_bit(MMF_OOM_SKIP, >flags) && select_new)
return false;
 
if (atomic_read(>mm_users) <= 1)
@@ -939,7 +939,7 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 * so it can die quickly
 */
task_lock(p);
-   if (task_will_free_mem(p)) {
+   if (task_will_free_mem(p, true)) {
mark_oom_victim(p);
wake_oom_reaper(p);
task_unlock(p);
@@ -1069,7 +1069,7 @@ bool out_of_memory(struct oom_control *oc)
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its memory.
 */
-   if (task_will_free_mem(current)) {
+   if (task_will_free_mem(current, oc->reserve_tried)) {
mark_oom_victim(current);
wake_oom_reaper(current);
return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 879b861..03ca29a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3455,7 +3455,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
 }
 
 static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, bool reserve_tried,
const struct alloc_context *ac, unsigned long *did_some_progress)
 {
struct oom_control oc = {
@@ -3464,6 +3464,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
.memcg = NULL,
.gfp_mask = gfp_mask,
.order = order,
+   .reserve_tried = reserve_tried,
};
struct page *page;
 
@@ -4239,7 +4240,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
goto retry_cpuset;
 
/* Reclaim has failed us, start killing things */
-   page = __alloc_pages_may_oom(gfp_mask, order, ac, _some_progress);
+   page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM
+|| (gfp_mask & __GFP_NOMEMALLOC), ac,
+_some_progress);
if (page)
goto got_pg;
 
-- 
1.8.3.1




Re: WARNING in try_charge

2018-08-09 Thread Tetsuo Handa
>From b1f38168f14397c7af9c122cd8207663d96e02ec Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Thu, 9 Aug 2018 22:49:40 +0900
Subject: [PATCH] mm, oom: task_will_free_mem(current) should retry until
 memory reserve fails

Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set. But we don't need to select next OOM victim as
long as ALLOC_OOM allocation can succeed. And syzbot is hitting WARN(1)
caused by this race window [1].

Since memcg OOM case uses forced charge if current thread is killed,
out_of_memory() can return true without selecting next OOM victim.
Therefore, this patch changes task_will_free_mem(current) to ignore
MMF_OOM_SKIP unless ALLOC_OOM allocation failed.

[1] 
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: David Rientjes 
---
 include/linux/oom.h | 3 +++
 mm/oom_kill.c   | 8 
 mm/page_alloc.c | 7 +--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a5..b5abacd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -38,6 +38,9 @@ struct oom_control {
 */
const int order;
 
+   /* Did we already try ALLOC_OOM allocation? i*/
+   const bool reserve_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e10b86..95453e8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -782,7 +782,7 @@ static inline bool __task_will_free_mem(struct task_struct 
*task)
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, bool select_new)
 {
struct mm_struct *mm = task->mm;
struct task_struct *p;
@@ -803,7 +803,7 @@ static bool task_will_free_mem(struct task_struct *task)
 * This task has already been drained by the oom reaper so there are
 * only small chances it will free some more
 */
-   if (test_bit(MMF_OOM_SKIP, >flags))
+   if (test_bit(MMF_OOM_SKIP, >flags) && select_new)
return false;
 
if (atomic_read(>mm_users) <= 1)
@@ -939,7 +939,7 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 * so it can die quickly
 */
task_lock(p);
-   if (task_will_free_mem(p)) {
+   if (task_will_free_mem(p, true)) {
mark_oom_victim(p);
wake_oom_reaper(p);
task_unlock(p);
@@ -1069,7 +1069,7 @@ bool out_of_memory(struct oom_control *oc)
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its memory.
 */
-   if (task_will_free_mem(current)) {
+   if (task_will_free_mem(current, oc->reserve_tried)) {
mark_oom_victim(current);
wake_oom_reaper(current);
return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 879b861..03ca29a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3455,7 +3455,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
 }
 
 static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, bool reserve_tried,
const struct alloc_context *ac, unsigned long *did_some_progress)
 {
struct oom_control oc = {
@@ -3464,6 +3464,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
.memcg = NULL,
.gfp_mask = gfp_mask,
.order = order,
+   .reserve_tried = reserve_tried,
};
struct page *page;
 
@@ -4239,7 +4240,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
goto retry_cpuset;
 
/* Reclaim has failed us, start killing things */
-   page = __alloc_pages_may_oom(gfp_mask, order, ac, _some_progress);
+   page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags == ALLOC_OOM
+|| (gfp_mask & __GFP_NOMEMALLOC), ac,
+_some_progress);
if (page)
goto got_pg;
 
-- 
1.8.3.1




Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

2018-08-09 Thread Tetsuo Handa
On 2018/08/09 18:21, Kirill Tkhai wrote:
> 2)SRCU. Pros are there are no the above problems; we will have completely 
> unlocked and
>   scalable shrink_slab(). We will also have a possibility to avoid 
> unregistering delays,
>   like I did for superblock shrinker. There will be full scalability.
>   Cons is enabling SRCU.
> 

How unregistering delays can be avoided? Since you traverse all shrinkers
using one shrinker_srcu, synchronize_srcu(_srcu) will block
unregistering threads until longest inflight srcu_read_lock() user calls
srcu_read_unlock().

Unless you use per shrinker counter like below, I wonder whether
unregistering delays can be avoided...

  https://marc.info/?l=linux-mm=151060909613004
  https://marc.info/?l=linux-mm=151060909713005


Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

2018-08-09 Thread Tetsuo Handa
On 2018/08/09 18:21, Kirill Tkhai wrote:
> 2)SRCU. Pros are there are no the above problems; we will have completely 
> unlocked and
>   scalable shrink_slab(). We will also have a possibility to avoid 
> unregistering delays,
>   like I did for superblock shrinker. There will be full scalability.
>   Cons is enabling SRCU.
> 

How unregistering delays can be avoided? Since you traverse all shrinkers
using one shrinker_srcu, synchronize_srcu(_srcu) will block
unregistering threads until longest inflight srcu_read_lock() user calls
srcu_read_unlock().

Unless you use per shrinker counter like below, I wonder whether
unregistering delays can be avoided...

  https://marc.info/?l=linux-mm=151060909613004
  https://marc.info/?l=linux-mm=151060909713005


Re: [PATCH] memcg, oom: be careful about races when warning about no reclaimable task

2018-08-08 Thread Tetsuo Handa
On 2018/08/08 5:38, Tetsuo Handa wrote:
> On 2018/08/08 5:19, Johannes Weiner wrote:
>> On Tue, Aug 07, 2018 at 07:15:11PM +0900, Tetsuo Handa wrote:
>>> On 2018/08/07 16:25, Michal Hocko wrote:
>>>> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct 
>>>> mem_cgroup *memcg, gfp_t mask, int
>>>>return OOM_ASYNC;
>>>>}
>>>>  
>>>> -  if (mem_cgroup_out_of_memory(memcg, mask, order))
>>>> +  if (mem_cgroup_out_of_memory(memcg, mask, order) ||
>>>> +  tsk_is_oom_victim(current))
>>>>return OOM_SUCCESS;
>>>>  
>>>>WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>>>>
>>>
>>> I don't think this patch is appropriate. This patch only avoids hitting 
>>> WARN(1).
>>> This patch does not address the root cause:
>>>
>>> The task_will_free_mem(current) test in out_of_memory() is returning false
>>> because test_bit(MMF_OOM_SKIP, >flags) test in task_will_free_mem() is
>>> returning false because MMF_OOM_SKIP was already set by the OOM reaper. The 
>>> OOM
>>> killer does not need to start selecting next OOM victim until "current 
>>> thread
>>> completes __mmput()" or "it fails to complete __mmput() within reasonable
>>> period".
>>
>> I don't see why it matters whether the OOM victim exits or not, unless
>> you count the memory consumed by struct task_struct.
> 
> We are not counting memory consumed by struct task_struct. But David is
> counting memory released between set_bit(MMF_OOM_SKIP, >flags) and
> completion of exit_mmap().

Also, before the OOM reaper was introduced, we waited until TIF_MEMDIE is
cleared from the OOM victim thread. Compared to pre OOM reaper era, giving up
so early is certainly a regression.


Re: [PATCH] memcg, oom: be careful about races when warning about no reclaimable task

2018-08-08 Thread Tetsuo Handa
On 2018/08/08 5:38, Tetsuo Handa wrote:
> On 2018/08/08 5:19, Johannes Weiner wrote:
>> On Tue, Aug 07, 2018 at 07:15:11PM +0900, Tetsuo Handa wrote:
>>> On 2018/08/07 16:25, Michal Hocko wrote:
>>>> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct 
>>>> mem_cgroup *memcg, gfp_t mask, int
>>>>return OOM_ASYNC;
>>>>}
>>>>  
>>>> -  if (mem_cgroup_out_of_memory(memcg, mask, order))
>>>> +  if (mem_cgroup_out_of_memory(memcg, mask, order) ||
>>>> +  tsk_is_oom_victim(current))
>>>>return OOM_SUCCESS;
>>>>  
>>>>WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>>>>
>>>
>>> I don't think this patch is appropriate. This patch only avoids hitting 
>>> WARN(1).
>>> This patch does not address the root cause:
>>>
>>> The task_will_free_mem(current) test in out_of_memory() is returning false
>>> because test_bit(MMF_OOM_SKIP, >flags) test in task_will_free_mem() is
>>> returning false because MMF_OOM_SKIP was already set by the OOM reaper. The 
>>> OOM
>>> killer does not need to start selecting next OOM victim until "current 
>>> thread
>>> completes __mmput()" or "it fails to complete __mmput() within reasonable
>>> period".
>>
>> I don't see why it matters whether the OOM victim exits or not, unless
>> you count the memory consumed by struct task_struct.
> 
> We are not counting memory consumed by struct task_struct. But David is
> counting memory released between set_bit(MMF_OOM_SKIP, >flags) and
> completion of exit_mmap().

Also, before the OOM reaper was introduced, we waited until TIF_MEMDIE is
cleared from the OOM victim thread. Compared to pre OOM reaper era, giving up
so early is certainly a regression.


Re: [PATCH RFC 02/10] mm: Make shrink_slab() lockless

2018-08-08 Thread Tetsuo Handa
On 2018/08/08 20:51, Kirill Tkhai wrote:
> @@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker 
> *shrinker)
>   int id, ret = -ENOMEM;
>  
>   down_write(_rwsem);
> - /* This may call shrinker, so it must use down_read_trylock() */
>   id = idr_alloc(_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
>   if (id < 0)
>   goto unlock;

I don't know why perf reports down_read_trylock(_rwsem). But
above code is already bad. GFP_KERNEL allocation involves shrinkers and
the OOM killer would be invoked because shrinkers are defunctional due to
this down_write(_rwsem). Please avoid blocking memory allocation
with shrinker_rwsem held.


Re: [PATCH RFC 02/10] mm: Make shrink_slab() lockless

2018-08-08 Thread Tetsuo Handa
On 2018/08/08 20:51, Kirill Tkhai wrote:
> @@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker 
> *shrinker)
>   int id, ret = -ENOMEM;
>  
>   down_write(_rwsem);
> - /* This may call shrinker, so it must use down_read_trylock() */
>   id = idr_alloc(_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
>   if (id < 0)
>   goto unlock;

I don't know why perf reports down_read_trylock(_rwsem). But
above code is already bad. GFP_KERNEL allocation involves shrinkers and
the OOM killer would be invoked because shrinkers are defunctional due to
this down_write(_rwsem). Please avoid blocking memory allocation
with shrinker_rwsem held.


Re: [PATCH] memcg, oom: be careful about races when warning about no reclaimable task

2018-08-07 Thread Tetsuo Handa
On 2018/08/08 5:19, Johannes Weiner wrote:
> On Tue, Aug 07, 2018 at 07:15:11PM +0900, Tetsuo Handa wrote:
>> On 2018/08/07 16:25, Michal Hocko wrote:
>>> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct 
>>> mem_cgroup *memcg, gfp_t mask, int
>>> return OOM_ASYNC;
>>> }
>>>  
>>> -   if (mem_cgroup_out_of_memory(memcg, mask, order))
>>> +   if (mem_cgroup_out_of_memory(memcg, mask, order) ||
>>> +   tsk_is_oom_victim(current))
>>> return OOM_SUCCESS;
>>>  
>>> WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>>>
>>
>> I don't think this patch is appropriate. This patch only avoids hitting 
>> WARN(1).
>> This patch does not address the root cause:
>>
>> The task_will_free_mem(current) test in out_of_memory() is returning false
>> because test_bit(MMF_OOM_SKIP, >flags) test in task_will_free_mem() is
>> returning false because MMF_OOM_SKIP was already set by the OOM reaper. The 
>> OOM
>> killer does not need to start selecting next OOM victim until "current thread
>> completes __mmput()" or "it fails to complete __mmput() within reasonable
>> period".
> 
> I don't see why it matters whether the OOM victim exits or not, unless
> you count the memory consumed by struct task_struct.

We are not counting memory consumed by struct task_struct. But David is
counting memory released between set_bit(MMF_OOM_SKIP, >flags) and
completion of exit_mmap().

> 
>> According to 
>> https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 ,
>> PID=23767 selected PID=23766 as an OOM victim and the OOM reaper set 
>> MMF_OOM_SKIP
>> before PID=23766 unnecessarily selects PID=23767 as next OOM victim.
>> At uptime = 366.550949, out_of_memory() should have returned true without 
>> selecting
>> next OOM victim because tsk_is_oom_victim(current) == true.
> 
> The code works just fine. We have to kill tasks until we a) free
> enough memory or b) run out of tasks or c) kill current. When one of
> these outcomes is reached, we allow the charge and return.
> 
> The only problem here is a warning in the wrong place.
> 

If forced charge contained a bug, removing this WARN(1) deprives users of chance
to know that something is going wrong.


Re: [PATCH] memcg, oom: be careful about races when warning about no reclaimable task

2018-08-07 Thread Tetsuo Handa
On 2018/08/08 5:19, Johannes Weiner wrote:
> On Tue, Aug 07, 2018 at 07:15:11PM +0900, Tetsuo Handa wrote:
>> On 2018/08/07 16:25, Michal Hocko wrote:
>>> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct 
>>> mem_cgroup *memcg, gfp_t mask, int
>>> return OOM_ASYNC;
>>> }
>>>  
>>> -   if (mem_cgroup_out_of_memory(memcg, mask, order))
>>> +   if (mem_cgroup_out_of_memory(memcg, mask, order) ||
>>> +   tsk_is_oom_victim(current))
>>> return OOM_SUCCESS;
>>>  
>>> WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>>>
>>
>> I don't think this patch is appropriate. This patch only avoids hitting 
>> WARN(1).
>> This patch does not address the root cause:
>>
>> The task_will_free_mem(current) test in out_of_memory() is returning false
>> because test_bit(MMF_OOM_SKIP, >flags) test in task_will_free_mem() is
>> returning false because MMF_OOM_SKIP was already set by the OOM reaper. The 
>> OOM
>> killer does not need to start selecting next OOM victim until "current thread
>> completes __mmput()" or "it fails to complete __mmput() within reasonable
>> period".
> 
> I don't see why it matters whether the OOM victim exits or not, unless
> you count the memory consumed by struct task_struct.

We are not counting memory consumed by struct task_struct. But David is
counting memory released between set_bit(MMF_OOM_SKIP, >flags) and
completion of exit_mmap().

> 
>> According to 
>> https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 ,
>> PID=23767 selected PID=23766 as an OOM victim and the OOM reaper set 
>> MMF_OOM_SKIP
>> before PID=23766 unnecessarily selects PID=23767 as next OOM victim.
>> At uptime = 366.550949, out_of_memory() should have returned true without 
>> selecting
>> next OOM victim because tsk_is_oom_victim(current) == true.
> 
> The code works just fine. We have to kill tasks until we a) free
> enough memory or b) run out of tasks or c) kill current. When one of
> these outcomes is reached, we allow the charge and return.
> 
> The only problem here is a warning in the wrong place.
> 

If forced charge contained a bug, removing this WARN(1) deprives users of chance
to know that something is going wrong.


Re: WARNING in try_charge

2018-08-07 Thread Tetsuo Handa
On 2018/08/07 6:50, Tetsuo Handa wrote:
>   list_for_each_entry_safe(p, tmp, _victim_list, oom_victim_list) {
>   struct mm_struct *mm = p->signal->oom_mm;
>  
>   if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
>   continue;
>   ret = true;
> + /*
> +  * Since memcg OOM allows forced charge, we can safely wait
> +  * until __mmput() completes.
> +  */
> + if (is_memcg_oom(oc))
> + return true;

Oops. If this OOM victim was blocked on some lock which current thread is
holding, waiting forever unconditionally is not safe.

>  #ifdef CONFIG_MMU
>   /*
>* Since the OOM reaper exists, we can safely wait until
>* MMF_OOM_SKIP is set.
>*/
>   if (!test_bit(MMF_OOM_SKIP, >flags)) {
>   if (!oom_reap_target) {
>   get_task_struct(p);
>   oom_reap_target = p;
>   trace_wake_reaper(p->pid);
>   wake_up(_reaper_wait);
>   }
>  #endif
>   continue;
>   }
>  #endif
>   /* We can wait as long as OOM score is decreasing over time. */
>   if (!victim_mm_stalling(p, mm))
>   continue;
>   gaveup = true;
>   list_del(>oom_victim_list);
>   /* Drop a reference taken by mark_oom_victim(). */
>   put_task_struct(p);
>   }
>   if (gaveup)
>   debug_show_all_locks();
>  
>   return ret;
>  }
> 


Re: WARNING in try_charge

2018-08-07 Thread Tetsuo Handa
On 2018/08/07 6:50, Tetsuo Handa wrote:
>   list_for_each_entry_safe(p, tmp, _victim_list, oom_victim_list) {
>   struct mm_struct *mm = p->signal->oom_mm;
>  
>   if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
>   continue;
>   ret = true;
> + /*
> +  * Since memcg OOM allows forced charge, we can safely wait
> +  * until __mmput() completes.
> +  */
> + if (is_memcg_oom(oc))
> + return true;

Oops. If this OOM victim was blocked on some lock which current thread is
holding, waiting forever unconditionally is not safe.

>  #ifdef CONFIG_MMU
>   /*
>* Since the OOM reaper exists, we can safely wait until
>* MMF_OOM_SKIP is set.
>*/
>   if (!test_bit(MMF_OOM_SKIP, >flags)) {
>   if (!oom_reap_target) {
>   get_task_struct(p);
>   oom_reap_target = p;
>   trace_wake_reaper(p->pid);
>   wake_up(_reaper_wait);
>   }
>  #endif
>   continue;
>   }
>  #endif
>   /* We can wait as long as OOM score is decreasing over time. */
>   if (!victim_mm_stalling(p, mm))
>   continue;
>   gaveup = true;
>   list_del(>oom_victim_list);
>   /* Drop a reference taken by mark_oom_victim(). */
>   put_task_struct(p);
>   }
>   if (gaveup)
>   debug_show_all_locks();
>  
>   return ret;
>  }
> 


Re: [PATCH] memcg, oom: be careful about races when warning about no reclaimable task

2018-08-07 Thread Tetsuo Handa
On 2018/08/07 16:25, Michal Hocko wrote:
> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup 
> *memcg, gfp_t mask, int
>   return OOM_ASYNC;
>   }
>  
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order) ||
> + tsk_is_oom_victim(current))
>   return OOM_SUCCESS;
>  
>   WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
> 

I don't think this patch is appropriate. This patch only avoids hitting WARN(1).
This patch does not address the root cause:

The task_will_free_mem(current) test in out_of_memory() is returning false
because test_bit(MMF_OOM_SKIP, >flags) test in task_will_free_mem() is
returning false because MMF_OOM_SKIP was already set by the OOM reaper. The OOM
killer does not need to start selecting next OOM victim until "current thread
completes __mmput()" or "it fails to complete __mmput() within reasonable
period".

According to https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 ,
PID=23767 selected PID=23766 as an OOM victim and the OOM reaper set 
MMF_OOM_SKIP
before PID=23766 unnecessarily selects PID=23767 as next OOM victim.
At uptime = 366.550949, out_of_memory() should have returned true without 
selecting
next OOM victim because tsk_is_oom_victim(current) == true.

[  365.869417] syz-executor2 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  365.878899] CPU: 0 PID: 23767 Comm: syz-executor2 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  366.487490] Tasks state (memory values in pages):
[  366.492349] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  366.501237] [  23766] 0 2376617620 8221   1269760
 0 syz-executor3
[  366.510367] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) 
score 8252000 or sacrifice child
[  366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, 
anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
[  366.540456] oom_reaper: reaped process 23766 (syz-executor3), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  366.550949] syz-executor3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  366.560374] CPU: 1 PID: 23766 Comm: syz-executor3 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  367.138136] Tasks state (memory values in pages):
[  367.142986] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  367.151889] [  23766] 0 2376617620 8002   1269760
 0 syz-executor3
[  367.160946] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) 
score 8249000 or sacrifice child
[  367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, 
anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
[  367.192101] oom_reaper: reaped process 23767 (syz-executor2), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  367.202986] [ cut here ]
[  367.207845] Memory cgroup charge failed because of no reclaimable memory! 
This looks like a misconfiguration or a kernel bug.
[  367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 
try_charge+0x734/0x1680
[  367.227540] Kernel panic - not syncing: panic_on_warn set ...

Of course, if the hard limit is 0, all processes will be killed after all. But
Michal is ignoring the fact that if the hard limit were not 0, there is a chance
of saving next process from needlessly killed if we waited until "mm of 
PID=23766
completed __mmput()" or "mm of PID=23766 failed to complete __mmput() within
reasonable period". 

We can make efforts not to return false at

/*
 * This task has already been drained by the oom reaper so there are
 * only small chances it will free some more
 */
if (test_bit(MMF_OOM_SKIP, >flags))
return false;

(I admit that ignoring MMF_OOM_SKIP for once might not be sufficient for memcg
case), and we can use feedback based backoff like
"[PATCH 4/4] mm, oom: Fix unnecessary killing of additional processes." *UNTIL*
we come to the point where the OOM reaper can always reclaim all memory.



Re: [PATCH] memcg, oom: be careful about races when warning about no reclaimable task

2018-08-07 Thread Tetsuo Handa
On 2018/08/07 16:25, Michal Hocko wrote:
> @@ -1703,7 +1703,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup 
> *memcg, gfp_t mask, int
>   return OOM_ASYNC;
>   }
>  
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order) ||
> + tsk_is_oom_victim(current))
>   return OOM_SUCCESS;
>  
>   WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
> 

I don't think this patch is appropriate. This patch only avoids hitting WARN(1).
This patch does not address the root cause:

The task_will_free_mem(current) test in out_of_memory() is returning false
because test_bit(MMF_OOM_SKIP, >flags) test in task_will_free_mem() is
returning false because MMF_OOM_SKIP was already set by the OOM reaper. The OOM
killer does not need to start selecting next OOM victim until "current thread
completes __mmput()" or "it fails to complete __mmput() within reasonable
period".

According to https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 ,
PID=23767 selected PID=23766 as an OOM victim and the OOM reaper set 
MMF_OOM_SKIP
before PID=23766 unnecessarily selects PID=23767 as next OOM victim.
At uptime = 366.550949, out_of_memory() should have returned true without 
selecting
next OOM victim because tsk_is_oom_victim(current) == true.

[  365.869417] syz-executor2 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  365.878899] CPU: 0 PID: 23767 Comm: syz-executor2 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  366.487490] Tasks state (memory values in pages):
[  366.492349] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  366.501237] [  23766] 0 2376617620 8221   1269760
 0 syz-executor3
[  366.510367] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) 
score 8252000 or sacrifice child
[  366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, 
anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
[  366.540456] oom_reaper: reaped process 23766 (syz-executor3), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  366.550949] syz-executor3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  366.560374] CPU: 1 PID: 23766 Comm: syz-executor3 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  367.138136] Tasks state (memory values in pages):
[  367.142986] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  367.151889] [  23766] 0 2376617620 8002   1269760
 0 syz-executor3
[  367.160946] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) 
score 8249000 or sacrifice child
[  367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, 
anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
[  367.192101] oom_reaper: reaped process 23767 (syz-executor2), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  367.202986] [ cut here ]
[  367.207845] Memory cgroup charge failed because of no reclaimable memory! 
This looks like a misconfiguration or a kernel bug.
[  367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 
try_charge+0x734/0x1680
[  367.227540] Kernel panic - not syncing: panic_on_warn set ...

Of course, if the hard limit is 0, all processes will be killed after all. But
Michal is ignoring the fact that if the hard limit were not 0, there is a chance
of saving next process from needlessly killed if we waited until "mm of 
PID=23766
completed __mmput()" or "mm of PID=23766 failed to complete __mmput() within
reasonable period". 

We can make efforts not to return false at

/*
 * This task has already been drained by the oom reaper so there are
 * only small chances it will free some more
 */
if (test_bit(MMF_OOM_SKIP, >flags))
return false;

(I admit that ignoring MMF_OOM_SKIP for once might not be sufficient for memcg
case), and we can use feedback based backoff like
"[PATCH 4/4] mm, oom: Fix unnecessary killing of additional processes." *UNTIL*
we come to the point where the OOM reaper can always reclaim all memory.



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 5:55, Michal Hocko wrote:
> On Tue 07-08-18 05:46:04, Tetsuo Handa wrote:
>> On 2018/08/07 5:34, Michal Hocko wrote:
>>> On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
>>>> On 2018/08/07 2:56, Michal Hocko wrote:
>>>>> So the oom victim indeed passed the above force path after the oom
>>>>> invocation. But later on hit the page fault path and that behaved
>>>>> differently and for some reason the force path hasn't triggered. I am
>>>>> wondering how could we hit the page fault path in the first place. The
>>>>> task is already killed! So what the hell is going on here.
>>>>>
>>>>> I must be missing something obvious here.
>>>>>
>>>> YOU ARE OBVIOUSLY MISSING MY MAIL!
>>>>
>>>> I already said this is "mm, oom: task_will_free_mem(current) should ignore 
>>>> MMF_OOM_SKIP for once."
>>>> problem which you are refusing at 
>>>> https://www.spinics.net/lists/linux-mm/msg133774.html .
>>>> And you again ignored my mail. Very sad...
>>>
>>> Your suggestion simply didn't make much sense. There is nothing like
>>> first check is different from the rest.
>>>
>>
>> I don't think your patch is appropriate. It avoids hitting WARN(1) but does 
>> not avoid
>> unnecessary killing of OOM victims.
>>
>> If you look at 
>> https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 , you will
>> notice that both 23766 and 23767 are killed due to 
>> task_will_free_mem(current) == false.
>> This is "unnecessary killing of additional processes".
> 
> Have you noticed the mere detail that the memcg has to kill any task
> attempting the charge because the hard limit is 0? There is simply no
> other way around. You cannot charge. There is no unnecessary killing.
> Full stop. We do allow temporary breach of the hard limit just to let
> the task die and uncharge on the way out.
> 

select_bad_process() is called just because
task_will_free_mem("already killed current thread which has not completed 
__mmput()") == false
is a bug. I'm saying that the OOM killer should not give up as soon as 
MMF_OOM_SKIP is set.

 static bool oom_has_pending_victims(struct oom_control *oc)
 {
struct task_struct *p, *tmp;
bool ret = false;
bool gaveup = false;
 
if (is_sysrq_oom(oc))
return false;
/*
 * Wait for pending victims until __mmput() completes or stalled
 * too long.
 */
list_for_each_entry_safe(p, tmp, _victim_list, oom_victim_list) {
struct mm_struct *mm = p->signal->oom_mm;
 
if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
continue;
ret = true;
+   /*
+* Since memcg OOM allows forced charge, we can safely wait
+* until __mmput() completes.
+*/
+   if (is_memcg_oom(oc))
+   return true;
 #ifdef CONFIG_MMU
/*
 * Since the OOM reaper exists, we can safely wait until
 * MMF_OOM_SKIP is set.
 */
if (!test_bit(MMF_OOM_SKIP, >flags)) {
if (!oom_reap_target) {
get_task_struct(p);
oom_reap_target = p;
trace_wake_reaper(p->pid);
wake_up(_reaper_wait);
}
 #endif
continue;
}
 #endif
/* We can wait as long as OOM score is decreasing over time. */
if (!victim_mm_stalling(p, mm))
continue;
gaveup = true;
list_del(>oom_victim_list);
/* Drop a reference taken by mark_oom_victim(). */
put_task_struct(p);
}
if (gaveup)
debug_show_all_locks();
 
return ret;
 }



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 5:55, Michal Hocko wrote:
> On Tue 07-08-18 05:46:04, Tetsuo Handa wrote:
>> On 2018/08/07 5:34, Michal Hocko wrote:
>>> On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
>>>> On 2018/08/07 2:56, Michal Hocko wrote:
>>>>> So the oom victim indeed passed the above force path after the oom
>>>>> invocation. But later on hit the page fault path and that behaved
>>>>> differently and for some reason the force path hasn't triggered. I am
>>>>> wondering how could we hit the page fault path in the first place. The
>>>>> task is already killed! So what the hell is going on here.
>>>>>
>>>>> I must be missing something obvious here.
>>>>>
>>>> YOU ARE OBVIOUSLY MISSING MY MAIL!
>>>>
>>>> I already said this is "mm, oom: task_will_free_mem(current) should ignore 
>>>> MMF_OOM_SKIP for once."
>>>> problem which you are refusing at 
>>>> https://www.spinics.net/lists/linux-mm/msg133774.html .
>>>> And you again ignored my mail. Very sad...
>>>
>>> Your suggestion simply didn't make much sense. There is nothing like
>>> first check is different from the rest.
>>>
>>
>> I don't think your patch is appropriate. It avoids hitting WARN(1) but does 
>> not avoid
>> unnecessary killing of OOM victims.
>>
>> If you look at 
>> https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 , you will
>> notice that both 23766 and 23767 are killed due to 
>> task_will_free_mem(current) == false.
>> This is "unnecessary killing of additional processes".
> 
> Have you noticed the mere detail that the memcg has to kill any task
> attempting the charge because the hard limit is 0? There is simply no
> other way around. You cannot charge. There is no unnecessary killing.
> Full stop. We do allow temporary breach of the hard limit just to let
> the task die and uncharge on the way out.
> 

select_bad_process() is called just because
task_will_free_mem("already killed current thread which has not completed 
__mmput()") == false
is a bug. I'm saying that the OOM killer should not give up as soon as 
MMF_OOM_SKIP is set.

 static bool oom_has_pending_victims(struct oom_control *oc)
 {
struct task_struct *p, *tmp;
bool ret = false;
bool gaveup = false;
 
if (is_sysrq_oom(oc))
return false;
/*
 * Wait for pending victims until __mmput() completes or stalled
 * too long.
 */
list_for_each_entry_safe(p, tmp, _victim_list, oom_victim_list) {
struct mm_struct *mm = p->signal->oom_mm;
 
if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
continue;
ret = true;
+   /*
+* Since memcg OOM allows forced charge, we can safely wait
+* until __mmput() completes.
+*/
+   if (is_memcg_oom(oc))
+   return true;
 #ifdef CONFIG_MMU
/*
 * Since the OOM reaper exists, we can safely wait until
 * MMF_OOM_SKIP is set.
 */
if (!test_bit(MMF_OOM_SKIP, >flags)) {
if (!oom_reap_target) {
get_task_struct(p);
oom_reap_target = p;
trace_wake_reaper(p->pid);
wake_up(_reaper_wait);
}
 #endif
continue;
}
 #endif
/* We can wait as long as OOM score is decreasing over time. */
if (!victim_mm_stalling(p, mm))
continue;
gaveup = true;
list_del(>oom_victim_list);
/* Drop a reference taken by mark_oom_victim(). */
put_task_struct(p);
}
if (gaveup)
debug_show_all_locks();
 
return ret;
 }



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 5:34, Michal Hocko wrote:
> On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
>> On 2018/08/07 2:56, Michal Hocko wrote:
>>> So the oom victim indeed passed the above force path after the oom
>>> invocation. But later on hit the page fault path and that behaved
>>> differently and for some reason the force path hasn't triggered. I am
>>> wondering how could we hit the page fault path in the first place. The
>>> task is already killed! So what the hell is going on here.
>>>
>>> I must be missing something obvious here.
>>>
>> YOU ARE OBVIOUSLY MISSING MY MAIL!
>>
>> I already said this is "mm, oom: task_will_free_mem(current) should ignore 
>> MMF_OOM_SKIP for once."
>> problem which you are refusing at 
>> https://www.spinics.net/lists/linux-mm/msg133774.html .
>> And you again ignored my mail. Very sad...
> 
> Your suggestion simply didn't make much sense. There is nothing like
> first check is different from the rest.
> 

I don't think your patch is appropriate. It avoids hitting WARN(1) but does not 
avoid
unnecessary killing of OOM victims.

If you look at https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 
, you will
notice that both 23766 and 23767 are killed due to task_will_free_mem(current) 
== false.
This is "unnecessary killing of additional processes".

[  365.869417] syz-executor2 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  365.878899] CPU: 0 PID: 23767 Comm: syz-executor2 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  366.487490] Tasks state (memory values in pages):
[  366.492349] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  366.501237] [  23766] 0 2376617620 8221   1269760
 0 syz-executor3
[  366.510367] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) 
score 8252000 or sacrifice child
[  366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, 
anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
[  366.540456] oom_reaper: reaped process 23766 (syz-executor3), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  366.550949] syz-executor3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  366.560374] CPU: 1 PID: 23766 Comm: syz-executor3 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  367.138136] Tasks state (memory values in pages):
[  367.142986] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  367.151889] [  23766] 0 2376617620 8002   1269760
 0 syz-executor3
[  367.160946] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) 
score 8249000 or sacrifice child
[  367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, 
anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
[  367.192101] oom_reaper: reaped process 23767 (syz-executor2), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  367.202986] [ cut here ]
[  367.207845] Memory cgroup charge failed because of no reclaimable memory! 
This looks like a misconfiguration or a kernel bug.
[  367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 
try_charge+0x734/0x1680
[  367.227540] Kernel panic - not syncing: panic_on_warn set ...



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 5:34, Michal Hocko wrote:
> On Tue 07-08-18 05:26:23, Tetsuo Handa wrote:
>> On 2018/08/07 2:56, Michal Hocko wrote:
>>> So the oom victim indeed passed the above force path after the oom
>>> invocation. But later on hit the page fault path and that behaved
>>> differently and for some reason the force path hasn't triggered. I am
>>> wondering how could we hit the page fault path in the first place. The
>>> task is already killed! So what the hell is going on here.
>>>
>>> I must be missing something obvious here.
>>>
>> YOU ARE OBVIOUSLY MISSING MY MAIL!
>>
>> I already said this is "mm, oom: task_will_free_mem(current) should ignore 
>> MMF_OOM_SKIP for once."
>> problem which you are refusing at 
>> https://www.spinics.net/lists/linux-mm/msg133774.html .
>> And you again ignored my mail. Very sad...
> 
> Your suggestion simply didn't make much sense. There is nothing like
> first check is different from the rest.
> 

I don't think your patch is appropriate. It avoids hitting WARN(1) but does not 
avoid
unnecessary killing of OOM victims.

If you look at https://syzkaller.appspot.com/text?tag=CrashLog=15a1c77040 
, you will
notice that both 23766 and 23767 are killed due to task_will_free_mem(current) 
== false.
This is "unnecessary killing of additional processes".

[  365.869417] syz-executor2 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  365.878899] CPU: 0 PID: 23767 Comm: syz-executor2 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  366.487490] Tasks state (memory values in pages):
[  366.492349] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  366.501237] [  23766] 0 2376617620 8221   1269760
 0 syz-executor3
[  366.510367] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  366.519409] Memory cgroup out of memory: Kill process 23766 (syz-executor3) 
score 8252000 or sacrifice child
[  366.529422] Killed process 23766 (syz-executor3) total-vm:70480kB, 
anon-rss:116kB, file-rss:32768kB, shmem-rss:0kB
[  366.540456] oom_reaper: reaped process 23766 (syz-executor3), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  366.550949] syz-executor3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
order=0, oom_score_adj=0
[  366.560374] CPU: 1 PID: 23766 Comm: syz-executor3 Not tainted 
4.18.0-rc6-next-20180725+ #18
(...snipped...)
[  367.138136] Tasks state (memory values in pages):
[  367.142986] [  pid  ]   uid  tgid total_vm  rss pgtables_bytes swapents 
oom_score_adj name
[  367.151889] [  23766] 0 2376617620 8002   1269760
 0 syz-executor3
[  367.160946] [  23767] 0 2376717618 8218   1269760
 0 syz-executor2
[  367.169994] Memory cgroup out of memory: Kill process 23767 (syz-executor2) 
score 8249000 or sacrifice child
[  367.180119] Killed process 23767 (syz-executor2) total-vm:70472kB, 
anon-rss:104kB, file-rss:32768kB, shmem-rss:0kB
[  367.192101] oom_reaper: reaped process 23767 (syz-executor2), now 
anon-rss:0kB, file-rss:32000kB, shmem-rss:0kB
[  367.202986] [ cut here ]
[  367.207845] Memory cgroup charge failed because of no reclaimable memory! 
This looks like a misconfiguration or a kernel bug.
[  367.207965] WARNING: CPU: 1 PID: 23767 at mm/memcontrol.c:1710 
try_charge+0x734/0x1680
[  367.227540] Kernel panic - not syncing: panic_on_warn set ...



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 2:56, Michal Hocko wrote:
> So the oom victim indeed passed the above force path after the oom
> invocation. But later on hit the page fault path and that behaved
> differently and for some reason the force path hasn't triggered. I am
> wondering how could we hit the page fault path in the first place. The
> task is already killed! So what the hell is going on here.
> 
> I must be missing something obvious here.
> 
YOU ARE OBVIOUSLY MISSING MY MAIL!

I already said this is "mm, oom: task_will_free_mem(current) should ignore 
MMF_OOM_SKIP for once."
problem which you are refusing at 
https://www.spinics.net/lists/linux-mm/msg133774.html .
And you again ignored my mail. Very sad...


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 2:56, Michal Hocko wrote:
> So the oom victim indeed passed the above force path after the oom
> invocation. But later on hit the page fault path and that behaved
> differently and for some reason the force path hasn't triggered. I am
> wondering how could we hit the page fault path in the first place. The
> task is already killed! So what the hell is going on here.
> 
> I must be missing something obvious here.
> 
YOU ARE OBVIOUSLY MISSING MY MAIL!

I already said this is "mm, oom: task_will_free_mem(current) should ignore 
MMF_OOM_SKIP for once."
problem which you are refusing at 
https://www.spinics.net/lists/linux-mm/msg133774.html .
And you again ignored my mail. Very sad...


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 0:42, syzbot wrote:
> Hello,
> 
> syzbot has tested the proposed patch but the reproducer still triggered crash:
> WARNING in try_charge
> 
> Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB, 
> file-rss:0kB, shmem-rss:0kB
> oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB
> task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
> [ cut here ]
> Memory cgroup charge failed because of no reclaimable memory! This looks like 
> a misconfiguration or a kernel bug.
> WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 mem_cgroup_oom 
> mm/memcontrol.c:1706 [inline]
> WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680 
> mm/memcontrol.c:2264
> Kernel panic - not syncing: panic_on_warn set ...

Michal, this is "mm, oom: task_will_free_mem(current) should ignore 
MMF_OOM_SKIP for once."
problem which you are refusing at 
https://www.spinics.net/lists/linux-mm/msg133774.html .


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/07 0:42, syzbot wrote:
> Hello,
> 
> syzbot has tested the proposed patch but the reproducer still triggered crash:
> WARNING in try_charge
> 
> Killed process 6410 (syz-executor5) total-vm:37708kB, anon-rss:2128kB, 
> file-rss:0kB, shmem-rss:0kB
> oom_reaper: reaped process 6410 (syz-executor5), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB
> task=syz-executor5 pid=6410 invoked memcg oom killer. oom_victim=1
> [ cut here ]
> Memory cgroup charge failed because of no reclaimable memory! This looks like 
> a misconfiguration or a kernel bug.
> WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 mem_cgroup_oom 
> mm/memcontrol.c:1706 [inline]
> WARNING: CPU: 1 PID: 6410 at mm/memcontrol.c:1707 try_charge+0x734/0x1680 
> mm/memcontrol.c:2264
> Kernel panic - not syncing: panic_on_warn set ...

Michal, this is "mm, oom: task_will_free_mem(current) should ignore 
MMF_OOM_SKIP for once."
problem which you are refusing at 
https://www.spinics.net/lists/linux-mm/msg133774.html .


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
Now, let's test next-20180803 .

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
116b181bb646afedd770985de20a68721bdb2648

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
bool ret;
 
mutex_lock(_lock);
+   pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+   current->comm, current->pid, 
tsk_is_oom_victim(current));
ret = out_of_memory();
mutex_unlock(_lock);
return ret;



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
Now, let's test next-20180803 .

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
116b181bb646afedd770985de20a68721bdb2648

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
bool ret;
 
mutex_lock(_lock);
+   pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+   current->comm, current->pid, 
tsk_is_oom_victim(current));
ret = out_of_memory();
mutex_unlock(_lock);
return ret;



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 23:58, Michal Hocko wrote:
>> Since I can't find mm related changes between next-20180803 (syzbot can 
>> reproduce) and
>> next-20180806 (syzbot has not reproduced), I can't guess what makes this 
>> problem go away.
> 
> Hmm, but original report was against 4.18.0-rc6-next-20180725+ kernel.
> And that one had the old group oom code. /me confused.
> 

Yes. But I confirmed that syzbot can reproduce this problem with next-20180803
which already dropped the old group oom code. Therefore, I think that syzbot is
hitting a problem other than the old group oom code.


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 23:58, Michal Hocko wrote:
>> Since I can't find mm related changes between next-20180803 (syzbot can 
>> reproduce) and
>> next-20180806 (syzbot has not reproduced), I can't guess what makes this 
>> problem go away.
> 
> Hmm, but original report was against 4.18.0-rc6-next-20180725+ kernel.
> And that one had the old group oom code. /me confused.
> 

Yes. But I confirmed that syzbot can reproduce this problem with next-20180803
which already dropped the old group oom code. Therefore, I think that syzbot is
hitting a problem other than the old group oom code.


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 23:54, David Howells wrote:
> Do you have a link to the problem?
> 
> David
> 

https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ

syzbot found a reproducer, and the reproducer was working until next-20180803.
But the reproducer is failing to reproduce this problem in next-20180806 despite
there is no mm related change between next-20180803 and next-20180806.

Therefore, I suspect that the reproducer is no longer working as intended. And
there was parser change (your patch) between next-20180803 and next-20180806.



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 23:54, David Howells wrote:
> Do you have a link to the problem?
> 
> David
> 

https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ

syzbot found a reproducer, and the reproducer was working until next-20180803.
But the reproducer is failing to reproduce this problem in next-20180806 despite
there is no mm related change between next-20180803 and next-20180806.

Therefore, I suspect that the reproducer is no longer working as intended. And
there was parser change (your patch) between next-20180803 and next-20180806.



Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
+David Howells

On 2018/08/06 20:32, Michal Hocko wrote:
> On Mon 06-08-18 04:27:02, syzbot wrote:
>> Hello,
>>
>> syzbot has tested the proposed patch and the reproducer did not trigger
>> crash:
>>
>> Reported-and-tested-by:
>> syzbot+bab151e82a4e973fa...@syzkaller.appspotmail.com
>>
>> Tested on:
>>
>> commit: 8c8399e0a3fb Add linux-next specific files for 20180806
>> git tree:   linux-next
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
>> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>> patch:  https://syzkaller.appspot.com/x/patch.diff?x=14fe18e240
>>
>> Note: testing is done by a robot and is best-effort only.
> 
> OK, so this smells like a problem in the previous group oom changes. Or
> maybe it is not very easy to reproduce?
> 

Since I can't find mm related changes between next-20180803 (syzbot can 
reproduce) and
next-20180806 (syzbot has not reproduced), I can't guess what makes this 
problem go away.

But since this problem did not occur for 3.5 hours on next-20180806 (when this 
problem
was occurring once per 60-90 minutes), the reproducer might not be working as 
intended
due to "kernfs, sysfs, cgroup, intel_rdt: Support fs_context" or something...

  ./kernel/cgroup/cgroup-internal.h  |  
  3
  ./kernel/cgroup/cgroup-v1.c|  
211
  ./kernel/cgroup/cgroup.c   |  
 81


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
+David Howells

On 2018/08/06 20:32, Michal Hocko wrote:
> On Mon 06-08-18 04:27:02, syzbot wrote:
>> Hello,
>>
>> syzbot has tested the proposed patch and the reproducer did not trigger
>> crash:
>>
>> Reported-and-tested-by:
>> syzbot+bab151e82a4e973fa...@syzkaller.appspotmail.com
>>
>> Tested on:
>>
>> commit: 8c8399e0a3fb Add linux-next specific files for 20180806
>> git tree:   linux-next
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=1b6bc1781e49e93e
>> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>> patch:  https://syzkaller.appspot.com/x/patch.diff?x=14fe18e240
>>
>> Note: testing is done by a robot and is best-effort only.
> 
> OK, so this smells like a problem in the previous group oom changes. Or
> maybe it is not very easy to reproduce?
> 

Since I can't find mm related changes between next-20180803 (syzbot can 
reproduce) and
next-20180806 (syzbot has not reproduced), I can't guess what makes this 
problem go away.

But since this problem did not occur for 3.5 hours on next-20180806 (when this 
problem
was occurring once per 60-90 minutes), the reproducer might not be working as 
intended
due to "kernfs, sysfs, cgroup, intel_rdt: Support fs_context" or something...

  ./kernel/cgroup/cgroup-internal.h  |  
  3
  ./kernel/cgroup/cgroup-v1.c|  
211
  ./kernel/cgroup/cgroup.c   |  
 81


Re: INFO: task hung in generic_file_write_iter

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 19:09, Jan Kara wrote:
>> syzbot reproduced this problem ( 
>> https://syzkaller.appspot.com/text?tag=CrashLog=11f2fc4440 ) .
>> It says that grow_dev_page() is returning 1 but __find_get_block() is 
>> failing forever. Any idea?

So far syzbot reproduced 7 times, and all reports are saying that
grow_dev_page() is returning 1 but __find_get_block() is failing forever.

  INFO: task hung in __get_super
  https://syzkaller.appspot.com/text?tag=CrashLog=1734727240

  INFO: task hung in __blkdev_get (2)
  https://syzkaller.appspot.com/text?tag=CrashLog=144347fc40
  https://syzkaller.appspot.com/text?tag=CrashLog=12382bfc40

  INFO: task hung in __writeback_inodes_sb_nr
  https://syzkaller.appspot.com/text?tag=CrashLog=11f2fc4440
  https://syzkaller.appspot.com/text?tag=CrashLog=13c2449c40

  INFO: task hung in filename_create
  https://syzkaller.appspot.com/text?tag=CrashLog=174a672c40

  INFO: task hung in lookup_slow
  https://syzkaller.appspot.com/text?tag=CrashLog=16113bdc40

> 
> Looks like some kind of a race where device block size gets changed while
> getblk() runs (and creates buffers for underlying page). I don't have time
> to nail it down at this moment can have a look into it later unless someone
> beats me to it.
> 

It seems that loop device is relevant to this problem.

144347fc40:

[  557.484258] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  560.491589] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  563.500432] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  566.508502] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  569.516546] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  572.524800] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.532499] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.803688] INFO: task syz-executor7:7893 blocked for more than 140 seconds.
[  575.810957]   Not tainted 4.18.0-rc7-next-20180801+ #29
[  575.816685] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  575.824663] syz-executor7   D24464  7893   4272 0x0004
[  575.830298] Call Trace:
[  575.832911]  __schedule+0x87c/0x1ec0
[  575.879299]  schedule+0xfb/0x450
[  575.912448]  rwsem_down_read_failed+0x362/0x610
[  575.951922]  call_rwsem_down_read_failed+0x18/0x30
[  575.956843]  down_read+0xc3/0x1d0
[  575.990222]  __get_super.part.12+0x20f/0x2e0
[  575.994654]  get_super+0x2d/0x50
[  575.998016]  fsync_bdev+0x17/0xc0
[  576.001469]  invalidate_partition+0x35/0x60
[  576.005800]  drop_partitions.isra.13+0xe8/0x200
[  576.023366]  rescan_partitions+0x75/0x910
[  576.037534]  __blkdev_reread_part+0x1ad/0x230
[  576.042023]  blkdev_reread_part+0x26/0x40
[  576.046173]  loop_reread_partitions+0x163/0x190
[  576.055795]  loop_set_status+0xb95/0x1010
[  576.059938]  loop_set_status64+0xaa/0x100
[  576.078629]  lo_ioctl+0x90e/0x1d70
[  576.095290]  blkdev_ioctl+0x9cd/0x2030
[  576.146746]  block_ioctl+0xee/0x130
[  576.154704]  do_vfs_ioctl+0x1de/0x1720
[  576.183069]  ksys_ioctl+0xa9/0xd0
[  576.186519]  __x64_sys_ioctl+0x73/0xb0
[  576.190423]  do_syscall_64+0x1b9/0x820



It seems that there was loop_reread_partitions() failure before this message 
starts.

12382bfc40:

[  205.467816] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  205.467816] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  205.905418] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  205.905418] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  206.113592] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  206.113592] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  206.767225] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  206.767225] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  206.990060] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  206.990060] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  208.630329] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  208.630329] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  228.101929] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  228.101929] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  228.605840] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  228.605840] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  229.400629] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  229.400629] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  229.622379] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  229.622379] <8D> 

Re: INFO: task hung in generic_file_write_iter

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 19:09, Jan Kara wrote:
>> syzbot reproduced this problem ( 
>> https://syzkaller.appspot.com/text?tag=CrashLog=11f2fc4440 ) .
>> It says that grow_dev_page() is returning 1 but __find_get_block() is 
>> failing forever. Any idea?

So far syzbot reproduced 7 times, and all reports are saying that
grow_dev_page() is returning 1 but __find_get_block() is failing forever.

  INFO: task hung in __get_super
  https://syzkaller.appspot.com/text?tag=CrashLog=1734727240

  INFO: task hung in __blkdev_get (2)
  https://syzkaller.appspot.com/text?tag=CrashLog=144347fc40
  https://syzkaller.appspot.com/text?tag=CrashLog=12382bfc40

  INFO: task hung in __writeback_inodes_sb_nr
  https://syzkaller.appspot.com/text?tag=CrashLog=11f2fc4440
  https://syzkaller.appspot.com/text?tag=CrashLog=13c2449c40

  INFO: task hung in filename_create
  https://syzkaller.appspot.com/text?tag=CrashLog=174a672c40

  INFO: task hung in lookup_slow
  https://syzkaller.appspot.com/text?tag=CrashLog=16113bdc40

> 
> Looks like some kind of a race where device block size gets changed while
> getblk() runs (and creates buffers for underlying page). I don't have time
> to nail it down at this moment can have a look into it later unless someone
> beats me to it.
> 

It seems that loop device is relevant to this problem.

144347fc40:

[  557.484258] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  560.491589] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  563.500432] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  566.508502] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  569.516546] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  572.524800] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.532499] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.803688] INFO: task syz-executor7:7893 blocked for more than 140 seconds.
[  575.810957]   Not tainted 4.18.0-rc7-next-20180801+ #29
[  575.816685] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  575.824663] syz-executor7   D24464  7893   4272 0x0004
[  575.830298] Call Trace:
[  575.832911]  __schedule+0x87c/0x1ec0
[  575.879299]  schedule+0xfb/0x450
[  575.912448]  rwsem_down_read_failed+0x362/0x610
[  575.951922]  call_rwsem_down_read_failed+0x18/0x30
[  575.956843]  down_read+0xc3/0x1d0
[  575.990222]  __get_super.part.12+0x20f/0x2e0
[  575.994654]  get_super+0x2d/0x50
[  575.998016]  fsync_bdev+0x17/0xc0
[  576.001469]  invalidate_partition+0x35/0x60
[  576.005800]  drop_partitions.isra.13+0xe8/0x200
[  576.023366]  rescan_partitions+0x75/0x910
[  576.037534]  __blkdev_reread_part+0x1ad/0x230
[  576.042023]  blkdev_reread_part+0x26/0x40
[  576.046173]  loop_reread_partitions+0x163/0x190
[  576.055795]  loop_set_status+0xb95/0x1010
[  576.059938]  loop_set_status64+0xaa/0x100
[  576.078629]  lo_ioctl+0x90e/0x1d70
[  576.095290]  blkdev_ioctl+0x9cd/0x2030
[  576.146746]  block_ioctl+0xee/0x130
[  576.154704]  do_vfs_ioctl+0x1de/0x1720
[  576.183069]  ksys_ioctl+0xa9/0xd0
[  576.186519]  __x64_sys_ioctl+0x73/0xb0
[  576.190423]  do_syscall_64+0x1b9/0x820



It seems that there was loop_reread_partitions() failure before this message 
starts.

12382bfc40:

[  205.467816] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  205.467816] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  205.905418] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  205.905418] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  206.113592] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  206.113592] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  206.767225] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  206.767225] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  206.990060] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  206.990060] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  208.630329] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  208.630329] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  228.101929] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  228.101929] <8D> <8F>^OESC<88>\<8B>) failed (rc=-16)

[  228.605840] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  228.605840] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  229.400629] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  229.400629] <8D> <8F>^OESC<88>\<8B>) failed (rc=1)

[  229.622379] loop_reread_partitions: partition scan of loop0 
(wS]}d^MI^A!<92>f<9B>8WF5Eice^W^O<80>Z^H%<8D>}
[  229.622379] <8D> 

Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 19:39, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko  wrote:
>> Btw. running with the above diff on top might help us to ideantify
>> whether this is a pre-mature warning or a valid one. Still useful to
>> find out.

Since syzbot already found a syz reproducer, you can ask syzbot to test it.

> 
> The bug report has a reproducer, so you can run it with the patch. Or
> ask syzbot to test your patch:
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches
> Which basically boils down to saying:
> 
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> master

Excuse me, but this is linux-next only problem. Therefore,

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
master

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
bool ret;
 
mutex_lock(_lock);
+   pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+   current->comm, current->pid, 
tsk_is_oom_victim(current));
ret = out_of_memory();
mutex_unlock(_lock);
return ret;

F.Y.I. Waiting until __mmput() completes (with timeout using OOM score feedback)
( https://syzkaller.appspot.com/x/patch.diff?x=101e449c40 ) solves this 
race.


Re: WARNING in try_charge

2018-08-06 Thread Tetsuo Handa
On 2018/08/06 19:39, Dmitry Vyukov wrote:
> On Mon, Aug 6, 2018 at 11:48 AM, Michal Hocko  wrote:
>> Btw. running with the above diff on top might help us to ideantify
>> whether this is a pre-mature warning or a valid one. Still useful to
>> find out.

Since syzbot already found a syz reproducer, you can ask syzbot to test it.

> 
> The bug report has a reproducer, so you can run it with the patch. Or
> ask syzbot to test your patch:
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches
> Which basically boils down to saying:
> 
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> master

Excuse me, but this is linux-next only problem. Therefore,

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
master

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4603ad75c9a9..852cd3dbdcd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1388,6 +1388,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
bool ret;
 
mutex_lock(_lock);
+   pr_info("task=%s pid=%d invoked memcg oom killer. oom_victim=%d\n",
+   current->comm, current->pid, 
tsk_is_oom_victim(current));
ret = out_of_memory();
mutex_unlock(_lock);
return ret;

F.Y.I. Waiting until __mmput() completes (with timeout using OOM score feedback)
( https://syzkaller.appspot.com/x/patch.diff?x=101e449c40 ) solves this 
race.


Re: WARNING in try_charge

2018-08-05 Thread Tetsuo Handa
On 2018/08/04 22:45, Tetsuo Handa wrote:
> syzbot is hitting WARN(1) because of mem_cgroup_out_of_memory() == false.

Since syzbot found a syz reproducer, I asked syzbot to try two patches.

Setting MMF_OOM_SKIP under oom_lock to prevent from races
( https://syzkaller.appspot.com/x/patch.diff?x=10fb3fd040 ) was not 
sufficient.

Waiting until __mmput() completes (with timeout using OOM score feedback)
( https://syzkaller.appspot.com/x/patch.diff?x=101e449c40 ) solved this 
race.


Re: WARNING in try_charge

2018-08-05 Thread Tetsuo Handa
On 2018/08/04 22:45, Tetsuo Handa wrote:
> syzbot is hitting WARN(1) because of mem_cgroup_out_of_memory() == false.

Since syzbot found a syz reproducer, I asked syzbot to try two patches.

Setting MMF_OOM_SKIP under oom_lock to prevent from races
( https://syzkaller.appspot.com/x/patch.diff?x=10fb3fd040 ) was not 
sufficient.

Waiting until __mmput() completes (with timeout using OOM score feedback)
( https://syzkaller.appspot.com/x/patch.diff?x=101e449c40 ) solved this 
race.


Re: WARNING in try_charge

2018-08-04 Thread Tetsuo Handa
syzbot is hitting WARN(1) because of mem_cgroup_out_of_memory() == false.
At first I suspected that syzbot is hitting

  static bool oom_kill_memcg_victim(struct oom_control *oc)
  {
  if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
  return oc->chosen_memcg;

case because

  /* We have one or more terminating processes at this point. */
  oc->chosen_task = INFLIGHT_VICTIM;

is not called. But since that patch was dropped from next-20180803, syzbot
seems to be hitting a different race condition
( https://syzkaller.appspot.com/text?tag=CrashLog=1207165440 ).

Therefore, next culprit I suspect is

mm, oom: remove oom_lock from oom_reaper

oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
close race with exiting task").  We do not really need the lock anymore
though.  212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
now.

which is in next-20180803, and my "mm, oom: Fix unnecessary killing of 
additional processes."
( 
https://marc.info/?i=1533389386-3501-4-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 )
could mitigate it. Michal and David, please respond.



Re: WARNING in try_charge

2018-08-04 Thread Tetsuo Handa
syzbot is hitting WARN(1) because of mem_cgroup_out_of_memory() == false.
At first I suspected that syzbot is hitting

  static bool oom_kill_memcg_victim(struct oom_control *oc)
  {
  if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
  return oc->chosen_memcg;

case because

  /* We have one or more terminating processes at this point. */
  oc->chosen_task = INFLIGHT_VICTIM;

is not called. But since that patch was dropped from next-20180803, syzbot
seems to be hitting a different race condition
( https://syzkaller.appspot.com/text?tag=CrashLog=1207165440 ).

Therefore, next culprit I suspect is

mm, oom: remove oom_lock from oom_reaper

oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
close race with exiting task").  We do not really need the lock anymore
though.  212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
now.

which is in next-20180803, and my "mm, oom: Fix unnecessary killing of 
additional processes."
( 
https://marc.info/?i=1533389386-3501-4-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 )
could mitigate it. Michal and David, please respond.



Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-02 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?
> 
> I will send the patch to Andrew if the patch is ok. 
> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 

Now that Roman's cgroup aware OOM killer patchset will be dropped from 
linux-next.git ,
linux-next.git will get the sleeping point removed. Please send this patch to 
linux-next.git .


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-02 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?
> 
> I will send the patch to Andrew if the patch is ok. 
> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 

Now that Roman's cgroup aware OOM killer patchset will be dropped from 
linux-next.git ,
linux-next.git will get the sleeping point removed. Please send this patch to 
linux-next.git .


Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

2018-08-02 Thread Tetsuo Handa
On 2018/08/02 20:21, Michal Hocko wrote:
> On Thu 02-08-18 19:53:13, Tetsuo Handa wrote:
>> On 2018/08/02 9:32, Roman Gushchin wrote:
> [...]
>>> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
>>> +   struct mem_cgroup *oom_domain)
>>> +{
>>> +   struct mem_cgroup *oom_group = NULL;
>>> +   struct mem_cgroup *memcg;
>>> +
>>> +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>> +   return NULL;
>>> +
>>> +   if (!oom_domain)
>>> +   oom_domain = root_mem_cgroup;
>>> +
>>> +   rcu_read_lock();
>>> +
>>> +   memcg = mem_cgroup_from_task(victim);
>>
>> Isn't this racy? I guess that memcg of this "victim" can change to
>> somewhere else from the one as of determining the final candidate.
> 
> How is this any different from the existing code? We select a victim and
> then kill it. The victim might move away and won't be part of the oom
> memcg anymore but we will still kill it. I do not remember this ever
> being a problem. Migration is a privileged operation. If you loose this
> restriction you shouldn't allow to move outside of the oom domain.

The existing code kills one process (plus other processes sharing mm if any).
But oom_cgroup kills multiple processes. Thus, whether we made decision based
on correct memcg becomes important.

> 
>> This "victim" might have already passed exit_mm()/cgroup_exit() from 
>> do_exit().
> 
> Why does this matter? The victim hasn't been killed yet so if it exists
> by its own I do not think we really have to tear the whole cgroup down.

The existing code does not send SIGKILL if find_lock_task_mm() failed. Who can
guarantee that the victim is not inside do_exit() yet when this code is 
executed?



Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

2018-08-02 Thread Tetsuo Handa
On 2018/08/02 20:21, Michal Hocko wrote:
> On Thu 02-08-18 19:53:13, Tetsuo Handa wrote:
>> On 2018/08/02 9:32, Roman Gushchin wrote:
> [...]
>>> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
>>> +   struct mem_cgroup *oom_domain)
>>> +{
>>> +   struct mem_cgroup *oom_group = NULL;
>>> +   struct mem_cgroup *memcg;
>>> +
>>> +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>> +   return NULL;
>>> +
>>> +   if (!oom_domain)
>>> +   oom_domain = root_mem_cgroup;
>>> +
>>> +   rcu_read_lock();
>>> +
>>> +   memcg = mem_cgroup_from_task(victim);
>>
>> Isn't this racy? I guess that memcg of this "victim" can change to
>> somewhere else from the one as of determining the final candidate.
> 
> How is this any different from the existing code? We select a victim and
> then kill it. The victim might move away and won't be part of the oom
> memcg anymore but we will still kill it. I do not remember this ever
> being a problem. Migration is a privileged operation. If you loose this
> restriction you shouldn't allow to move outside of the oom domain.

The existing code kills one process (plus other processes sharing mm if any).
But oom_cgroup kills multiple processes. Thus, whether we made decision based
on correct memcg becomes important.

> 
>> This "victim" might have already passed exit_mm()/cgroup_exit() from 
>> do_exit().
> 
> Why does this matter? The victim hasn't been killed yet so if it exists
> by its own I do not think we really have to tear the whole cgroup down.

The existing code does not send SIGKILL if find_lock_task_mm() failed. Who can
guarantee that the victim is not inside do_exit() yet when this code is 
executed?



Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

2018-08-02 Thread Tetsuo Handa
On 2018/08/02 9:32, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
> 
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
>all outstanding processes.
> 
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.

We could start a one-time userspace agent which handles
an cgroup OOM event and then terminates...



> +/**
> + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> + * @victim: task to be killed by the OOM killer
> + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> + *
> + * Returns a pointer to a memory cgroup, which has to be cleaned up
> + * by killing all belonging OOM-killable tasks.
> + *
> + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg.
> + */
> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> + struct mem_cgroup *oom_domain)
> +{
> + struct mem_cgroup *oom_group = NULL;
> + struct mem_cgroup *memcg;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return NULL;
> +
> + if (!oom_domain)
> + oom_domain = root_mem_cgroup;
> +
> + rcu_read_lock();
> +
> + memcg = mem_cgroup_from_task(victim);

Isn't this racy? I guess that memcg of this "victim" can change to
somewhere else from the one as of determining the final candidate.
This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().
This "victim" might be moving to a memcg which is different from the one
determining the final candidate.

> + if (memcg == root_mem_cgroup)
> + goto out;
> +
> + /*
> +  * Traverse the memory cgroup hierarchy from the victim task's
> +  * cgroup up to the OOMing cgroup (or root) to find the
> +  * highest-level memory cgroup with oom.group set.
> +  */
> + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> + if (memcg->oom_group)
> + oom_group = memcg;
> +
> + if (memcg == oom_domain)
> + break;
> + }
> +
> + if (oom_group)
> + css_get(_group->css);
> +out:
> + rcu_read_unlock();
> +
> + return oom_group;
> +}



> @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>   }
>   read_unlock(_lock);
>  
> + /*
> +  * Do we need to kill the entire memory cgroup?
> +  * Or even one of the ancestor memory cgroups?
> +  * Check this out before killing the victim task.
> +  */
> + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> +
>   __oom_kill_process(victim);
> +
> + /*
> +  * If necessary, kill all tasks in the selected memory cgroup.
> +  */
> + if (oom_group) {

Isn't "killing a child process of the biggest memory hog" and "killing all
processes which belongs to a memcg which the child process of the biggest
memory hog belongs to" strange? The intent of selecting a child is to try
to minimize lost work while the intent of oom_cgroup is to try to discard
all work. If oom_cgroup is enabled, I feel that we should

  pr_err("%s: Kill all processes in ", message);
  pr_cont_cgroup_path(memcg->css.cgroup);
  pr_cont(" due to memory.oom.group set\n");

without

  pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, 
task_pid_nr(p), p->comm, points);

(I mean, don't try to select a child).

> + mem_cgroup_print_oom_group(oom_group);
> + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> + mem_cgroup_put(oom_group);
> + }
>  }
>  
>  /*



Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group

2018-08-02 Thread Tetsuo Handa
On 2018/08/02 9:32, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
> 
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
>all outstanding processes.
> 
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.

We could start a one-time userspace agent which handles
an cgroup OOM event and then terminates...



> +/**
> + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> + * @victim: task to be killed by the OOM killer
> + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> + *
> + * Returns a pointer to a memory cgroup, which has to be cleaned up
> + * by killing all belonging OOM-killable tasks.
> + *
> + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg.
> + */
> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> + struct mem_cgroup *oom_domain)
> +{
> + struct mem_cgroup *oom_group = NULL;
> + struct mem_cgroup *memcg;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return NULL;
> +
> + if (!oom_domain)
> + oom_domain = root_mem_cgroup;
> +
> + rcu_read_lock();
> +
> + memcg = mem_cgroup_from_task(victim);

Isn't this racy? I guess that memcg of this "victim" can change to
somewhere else from the one as of determining the final candidate.
This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit().
This "victim" might be moving to a memcg which is different from the one
determining the final candidate.

> + if (memcg == root_mem_cgroup)
> + goto out;
> +
> + /*
> +  * Traverse the memory cgroup hierarchy from the victim task's
> +  * cgroup up to the OOMing cgroup (or root) to find the
> +  * highest-level memory cgroup with oom.group set.
> +  */
> + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> + if (memcg->oom_group)
> + oom_group = memcg;
> +
> + if (memcg == oom_domain)
> + break;
> + }
> +
> + if (oom_group)
> + css_get(_group->css);
> +out:
> + rcu_read_unlock();
> +
> + return oom_group;
> +}



> @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>   }
>   read_unlock(_lock);
>  
> + /*
> +  * Do we need to kill the entire memory cgroup?
> +  * Or even one of the ancestor memory cgroups?
> +  * Check this out before killing the victim task.
> +  */
> + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> +
>   __oom_kill_process(victim);
> +
> + /*
> +  * If necessary, kill all tasks in the selected memory cgroup.
> +  */
> + if (oom_group) {

Isn't "killing a child process of the biggest memory hog" and "killing all
processes which belongs to a memcg which the child process of the biggest
memory hog belongs to" strange? The intent of selecting a child is to try
to minimize lost work while the intent of oom_cgroup is to try to discard
all work. If oom_cgroup is enabled, I feel that we should

  pr_err("%s: Kill all processes in ", message);
  pr_cont_cgroup_path(memcg->css.cgroup);
  pr_cont(" due to memory.oom.group set\n");

without

  pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, 
task_pid_nr(p), p->comm, points);

(I mean, don't try to select a child).

> + mem_cgroup_print_oom_group(oom_group);
> + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> + mem_cgroup_put(oom_group);
> + }
>  }
>  
>  /*



Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 20:15, Michal Hocko wrote:
>>> I will send the patch to Andrew if the patch is ok. 
>>
>> Andrew, can we send the "we used to have a sleeping point in the oom path 
>> but this has
>> been removed recently" patch to linux.git ?
> 
> This can really wait for the next merge window IMHO.
> 

"mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping 
point.
Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the 
next
merge window? I'm confused...


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 20:15, Michal Hocko wrote:
>>> I will send the patch to Andrew if the patch is ok. 
>>
>> Andrew, can we send the "we used to have a sleeping point in the oom path 
>> but this has
>> been removed recently" patch to linux.git ?
> 
> This can really wait for the next merge window IMHO.
> 

"mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping 
point.
Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the 
next
merge window? I'm confused...


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?

I'm saying that "we used to have a sleeping point in the oom path but this has 
been
removed recently" is not true. You need to send that patch to linux.git first 
if you
want to refer that patch in this patch.

> 
> I will send the patch to Andrew if the patch is ok. 

Andrew, can we send the "we used to have a sleeping point in the oom path but 
this has
been removed recently" patch to linux.git ?

> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?

I'm saying that "we used to have a sleeping point in the oom path but this has 
been
removed recently" is not true. You need to send that patch to linux.git first 
if you
want to refer that patch in this patch.

> 
> I will send the patch to Andrew if the patch is ok. 

Andrew, can we send the "we used to have a sleeping point in the oom path but 
this has
been removed recently" patch to linux.git ?

> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

linux.git has not removed the sleeping point in the OOM path yet. Since 
removing the
sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
immediately.
(And that change will conflict with Roman's cgroup aware OOM killer patchset. 
But it
should be easy to rebase.)


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

linux.git has not removed the sleeping point in the OOM path yet. Since 
removing the
sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
immediately.
(And that change will conflict with Roman's cgroup aware OOM killer patchset. 
But it
should be easy to rebase.)


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 23:54, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
>> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
>>> On 2018/07/30 18:32, Michal Hocko wrote:
>> [...]
>>>> This one is waiting for draining and we are in mm_percpu_wq WQ context
>>>> which has its rescuer so no other activity can block us for ever. So
>>>> this certainly shouldn't deadlock. It can be dead slow but well, this is
>>>> what you will get when your shoot your system to death.
>>>
>>> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
>>> wake up. (Tejun,
>>> is my understanding correct?) Lack of schedule_timeout_*() does block 
>>> WQ_MEM_RECLAIM
>>> workqueues forever.
>>
>> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
>> guarantees. If you are right then the whole thing sounds quite fragile
>> to me TBH.
> 
> Workqueue doesn't think the cpu is stalled as long as one of the
> per-cpu kworkers is running.  The assumption is that kernel threads
> are not supposed to be busy-looping indefinitely (and they really
> shouldn't).

WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
when schedule_timeout_*() is called).

Is this correct?

>  We can add timeout mechanism to workqueue so that it
> kicks off other kworkers if one of them is in running state for too
> long, but idk, if there's an indefinite busy loop condition in kernel
> threads, we really should get rid of them and hung task watchdog is
> pretty effective at finding these cases (at least with preemption
> disabled).

Currently the page allocator has a path which can loop forever with
only cond_resched().

> 
> Thanks.
> 



Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 23:54, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
>> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
>>> On 2018/07/30 18:32, Michal Hocko wrote:
>> [...]
>>>> This one is waiting for draining and we are in mm_percpu_wq WQ context
>>>> which has its rescuer so no other activity can block us for ever. So
>>>> this certainly shouldn't deadlock. It can be dead slow but well, this is
>>>> what you will get when your shoot your system to death.
>>>
>>> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
>>> wake up. (Tejun,
>>> is my understanding correct?) Lack of schedule_timeout_*() does block 
>>> WQ_MEM_RECLAIM
>>> workqueues forever.
>>
>> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
>> guarantees. If you are right then the whole thing sounds quite fragile
>> to me TBH.
> 
> Workqueue doesn't think the cpu is stalled as long as one of the
> per-cpu kworkers is running.  The assumption is that kernel threads
> are not supposed to be busy-looping indefinitely (and they really
> shouldn't).

WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
when schedule_timeout_*() is called).

Is this correct?

>  We can add timeout mechanism to workqueue so that it
> kicks off other kworkers if one of them is in running state for too
> long, but idk, if there's an indefinite busy loop condition in kernel
> threads, we really should get rid of them and hung task watchdog is
> pretty effective at finding these cases (at least with preemption
> disabled).

Currently the page allocator has a path which can loop forever with
only cond_resched().

> 
> Thanks.
> 



Re: INFO: task hung in generic_file_write_iter

2018-07-30 Thread Tetsuo Handa
On 2018/07/21 5:06, Andrew Morton wrote:
> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa 
>  wrote:
> 
>>>
>>> This report is stalling after mount() completed and process used 
>>> remap_file_pages().
>>> I think that we might need to use debug printk(). But I don't know what to 
>>> examine.
>>>
>>
>> Andrew, can you pick up this debug printk() patch?
>> I guess we can get the result within one week.
> 
> Sure, let's toss it in -next for a while.
> 
>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa 
>> Date: Fri, 20 Jul 2018 19:29:06 +0900
>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall 
>> problem
>>
>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
>> __getblk_gfp() line in the backtrace. Since there is a comment block that
>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
>> attempt from grow_dev_page() is failing, let's start from checking whether
>> syzbot is hitting that case. This change will be removed after the bug is
>> fixed.
> 
> I'm not sure that grow_dev_page() is hanging.  It has often been
> suspected, but always is proven innocent.  Lets see.

syzbot reproduced this problem ( 
https://syzkaller.appspot.com/text?tag=CrashLog=11f2fc4440 ) .
It says that grow_dev_page() is returning 1 but __find_get_block() is failing 
forever. Any idea?



Re: INFO: task hung in generic_file_write_iter

2018-07-30 Thread Tetsuo Handa
On 2018/07/21 5:06, Andrew Morton wrote:
> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa 
>  wrote:
> 
>>>
>>> This report is stalling after mount() completed and process used 
>>> remap_file_pages().
>>> I think that we might need to use debug printk(). But I don't know what to 
>>> examine.
>>>
>>
>> Andrew, can you pick up this debug printk() patch?
>> I guess we can get the result within one week.
> 
> Sure, let's toss it in -next for a while.
> 
>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa 
>> Date: Fri, 20 Jul 2018 19:29:06 +0900
>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall 
>> problem
>>
>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
>> __getblk_gfp() line in the backtrace. Since there is a comment block that
>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
>> attempt from grow_dev_page() is failing, let's start from checking whether
>> syzbot is hitting that case. This change will be removed after the bug is
>> fixed.
> 
> I'm not sure that grow_dev_page() is hanging.  It has often been
> suspected, but always is proven innocent.  Lets see.

syzbot reproduced this problem ( 
https://syzkaller.appspot.com/text?tag=CrashLog=11f2fc4440 ) .
It says that grow_dev_page() is returning 1 but __find_get_block() is failing 
forever. Any idea?



Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 18:32, Michal Hocko wrote:
>>>> Since the patch shown below was suggested by Michal Hocko at
>>>> https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.
>>>>
>>>> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
>>>> From: Michal Hocko 
>>>> Date: Thu, 26 Jul 2018 14:40:03 +0900
>>>> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
>>>> should_reclaim_retry().
>>>>
>>>> Tetsuo Handa has reported that it is possible to bypass the short sleep
>>>> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
>>>> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
>>>> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
>>>> only when there are too many dirty and writeback pages") and lock up the
>>>> system if OOM.
>>>>
>>>> This is because we are implicitly counting on falling back to
>>>> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
>>>> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
>>>> called due to __zone_watermark_ok() == false.
>>>
>>> How do we rely on that?
>>
>> Unless disk_events_workfn() (PID=5) sleeps at 
>> schedule_timeout_uninterruptible()
>> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing 
>> __GFP_FS
>> allocation (PID=498) is waiting for cannot be started.
> 
> Now you are losing me again. What is going on about
> drain_local_pages_wq? I can see that all __GFP_FS allocations are
> waiting for a completion which depends on the kworker context to run but
> I fail to see why drain_local_pages_wq matters. The memory on the pcp
> lists is not accounted to NR_FREE_PAGES IIRC but that should be a
> relatively small amount of memory so this cannot be a fundamental
> problem here.

If you look at "busy workqueues and worker pools", you will find that not only
vmstat_update and drain_local_pages_wq (which is WQ_MEM_RECLAIM) but also
other works such as xlog_cil_push_work and xfs_reclaim_worker (which is also
WQ_MEM_RECLAIM) remain "pending:" state.

[  324.960731] Showing busy workqueues and worker pools:
[  324.962577] workqueue events: flags=0x0
[  324.964137]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  324.966231] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, 
vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, 
mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, 
mmdrop_async_fn, check_corruption, console_callback
[  324.973425] workqueue events_freezable: flags=0x4
[  324.975247]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.977393] pending: vmballoon_work [vmw_balloon]
[  324.979310] workqueue events_power_efficient: flags=0x80
[  324.981298]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  324.983543] pending: gc_worker [nf_conntrack], fb_flashcursor, 
neigh_periodic_work, neigh_periodic_work, check_lifetime
[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482] in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498)
[  324.77] workqueue ipv6_addrconf: flags=0x40008
[  325.001899]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  325.004092] pending: addrconf_verify_work
[  325.005911] workqueue mpt_poll_0: flags=0x8
[  325.007686]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.009914] pending: mpt_fault_reset_work [mptbase]
[  325.012044] workqueue xfs-cil/sda1: flags=0xc
[  325.013897]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.016190] pending: xlog_cil_push_work [xfs] BAR(2344)
[  325.018354] workqueue xfs-reclaim/sda1: flags=0xc
[  325.020293]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.022549] pending: xfs_reclaim_worker [xfs]
[  325.024540] workqueue xfs-sync/sda1: flags=0x4
[  325.026425]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.028691] pending: xfs_log_worker [xfs]
[  325.030546] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=189s workers=4 idle: 
977 65 13

[  444.206334] Showing busy workqueues and worker pools:
[  444.208472] workqueue events: flags=0x0
[  444.210193]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=15/256
[  444.212389] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, 
vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, 
mmdrop_async_fn, mmdrop_async_fn, e1000_w

<    4   5   6   7   8   9   10   11   12   13   >