On Wed 10-10-18 09:55:57, Dmitry Vyukov wrote: > On Wed, Oct 10, 2018 at 6:11 AM, 'David Rientjes' via syzkaller-bugs > <syzkaller-b...@googlegroups.com> wrote: > > On Wed, 10 Oct 2018, Tetsuo Handa wrote: > > > >> syzbot is hitting RCU stall due to memcg-OOM event. > >> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64 > >> > >> What should we do if memcg-OOM found no killable task because the > >> allocating task > >> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog > >> fires > >> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: > >> print proper > >> OOM header when no eligible victim left") because syzbot was terminating > >> the test > >> upon WARN(1) removed by that commit) is not a good behavior. > > > You want to say that most of the recent hangs and stalls are actually > caused by our attempt to sandbox test processes with memory cgroup? > The process with oom_score_adj == -1000 is not supposed to consume any > significant memory; we have another (test) process with oom_score_adj > == 0 that's actually consuming memory. > But should we refrain from using -1000? Perhaps it would be better to > use -500/500 for control/test process, or -999/1000?
oom disable on a task (especially when this is the only task in the memcg) is tricky. Look at the memcg report [ 935.562389] Memory limit reached of cgroup /syz0 [ 935.567398] memory: usage 204808kB, limit 204800kB, failcnt 6081 [ 935.573768] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 935.580650] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 935.586923] Memory cgroup stats for /syz0: cache:152KB rss:176336KB rss_huge:163840KB shmem:344KB mapped_file:264KB dirty:0KB writeback:0KB swap:0KB inactive_anon:260KB active_anon:176448KB inactive_file:4KB active_file:0KB There is still somebody holding anonymous (THP) memory. If there is no other eligible oom victim then it must be some of the oom disabled ones. You have suppressed the task list information so we do not know who that might be though. So it looks like there is some misconfiguration or a bug in the oom victim selection. -- Michal Hocko SUSE Labs