Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 26/01/2019 12:09, Tetsuo Handa wrote: > Arkadiusz, will you try this patch? Works. Several tries and always getting 0 pids.current after ~1s. Please use Reported-by: Arkadiusz Miśkiewicz (using gmail for transport only since vger postmasters aren't reasonable) And also Tested-by: Arkadiusz Miśkiewicz Testing was done on: 5.0.0-rc3-00104-gc04e2a780caf-dirty Thanks! > > From 48744b6339cf649a69b55997e138c17df1ecc897 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Sat, 26 Jan 2019 20:00:51 +0900 > Subject: [PATCH] oom, oom_reaper: do not enqueue same task twice > > Arkadiusz reported that enabling memcg's group oom killing causes > strange memcg statistics where there is no task in a memcg despite > the number of tasks in that memcg is not 0. It turned out that there > is a bug in wake_oom_reaper() which allows enqueuing same task twice > which makes impossible to decrease the number of tasks in that memcg > due to a refcount leak. > > This bug existed since the OOM reaper became invokable from > task_will_free_mem(current) path in out_of_memory() in Linux 4.7, > but memcg's group oom killing made it easier to trigger this bug by > calling wake_oom_reaper() on the same task from one out_of_memory() > request. > > Fix this bug using an approach used by commit 855b018325737f76 > ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). > Since task_will_free_mem(p) == false if p->mm == NULL, we can assume that > p->mm != NULL when wake_oom_reaper() is called from task_will_free_mem() > paths. As a side effect of this patch, this patch also avoids enqueuing > multiple threads sharing memory via task_will_free_mem(current) path. > > Signed-off-by: Tetsuo Handa > Reported-by: Arkadiusz Miśkiewicz > Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on > the oom_reaper_list head") > --- > mm/oom_kill.c | 28 +--- > 1 file changed, 13 insertions(+), 15 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index f0e8cd9..457f240 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm) > struct vm_area_struct *vma; > bool ret = true; > > - /* > - * Tell all users of get_user/copy_from_user etc... that the content > - * is no longer stable. No barriers really needed because unmapping > - * should imply barriers already and the reader would hit a page fault > - * if it stumbled over a reaped memory. > - */ > - set_bit(MMF_UNSTABLE, &mm->flags); > - > for (vma = mm->mmap ; vma; vma = vma->vm_next) { > if (!can_madv_dontneed_vma(vma)) > continue; > @@ -645,10 +637,15 @@ static int oom_reaper(void *unused) > return 0; > } > > -static void wake_oom_reaper(struct task_struct *tsk) > +static void wake_oom_reaper(struct task_struct *tsk, struct mm_struct *mm) > { > - /* tsk is already queued? */ > - if (tsk == oom_reaper_list || tsk->oom_reaper_list) > + /* > + * Tell all users of get_user/copy_from_user etc... that the content > + * is no longer stable. No barriers really needed because unmapping > + * should imply barriers already and the reader would hit a page fault > + * if it stumbled over a reaped memory. > + */ > + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) > return; > > get_task_struct(tsk); > @@ -668,7 +665,8 @@ static int __init oom_init(void) > } > subsys_initcall(oom_init) > #else > -static inline void wake_oom_reaper(struct task_struct *tsk) > +static inline void wake_oom_reaper(struct task_struct *tsk, > +struct mm_struct *mm) > { > } > #endif /* CONFIG_MMU */ > @@ -915,7 +913,7 @@ static void __oom_kill_process(struct task_struct *victim) > rcu_read_unlock(); > > if (can_oom_reap) > - wake_oom_reaper(victim); > + wake_oom_reaper(victim, mm); > > mmdrop(mm); > put_task_struct(victim); > @@ -955,7 +953,7 @@ static void oom_kill_process(struct oom_control *oc, > const char *message) > task_lock(p); > if (task_will_free_mem(p)) { > mark_oom_victim(p); > - wake_oom_reaper(p); > + wake_oom_reaper(p, p->mm); > task_unlock(p); > put_task_struct(p); > return; > @@ -1085,7 +1083,7 @@ bool out_of_memory(struct oom_control *oc) >*/ > if (task_will_free_mem(current)) { > mark_oom_victim(current); > - wake_oom_reaper(current); > + wake_oom_reaper(current, current->mm); > return true; > } > > -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
Arkadiusz, will you try this patch? >From 48744b6339cf649a69b55997e138c17df1ecc897 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Sat, 26 Jan 2019 20:00:51 +0900 Subject: [PATCH] oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). Since task_will_free_mem(p) == false if p->mm == NULL, we can assume that p->mm != NULL when wake_oom_reaper() is called from task_will_free_mem() paths. As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Signed-off-by: Tetsuo Handa Reported-by: Arkadiusz Miśkiewicz Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") --- mm/oom_kill.c | 28 +--- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f0e8cd9..457f240 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm) struct vm_area_struct *vma; bool ret = true; - /* -* Tell all users of get_user/copy_from_user etc... that the content -* is no longer stable. No barriers really needed because unmapping -* should imply barriers already and the reader would hit a page fault -* if it stumbled over a reaped memory. -*/ - set_bit(MMF_UNSTABLE, &mm->flags); - for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -645,10 +637,15 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +static void wake_oom_reaper(struct task_struct *tsk, struct mm_struct *mm) { - /* tsk is already queued? */ - if (tsk == oom_reaper_list || tsk->oom_reaper_list) + /* +* Tell all users of get_user/copy_from_user etc... that the content +* is no longer stable. No barriers really needed because unmapping +* should imply barriers already and the reader would hit a page fault +* if it stumbled over a reaped memory. +*/ + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) return; get_task_struct(tsk); @@ -668,7 +665,8 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static inline void wake_oom_reaper(struct task_struct *tsk) +static inline void wake_oom_reaper(struct task_struct *tsk, + struct mm_struct *mm) { } #endif /* CONFIG_MMU */ @@ -915,7 +913,7 @@ static void __oom_kill_process(struct task_struct *victim) rcu_read_unlock(); if (can_oom_reap) - wake_oom_reaper(victim); + wake_oom_reaper(victim, mm); mmdrop(mm); put_task_struct(victim); @@ -955,7 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) task_lock(p); if (task_will_free_mem(p)) { mark_oom_victim(p); - wake_oom_reaper(p); + wake_oom_reaper(p, p->mm); task_unlock(p); put_task_struct(p); return; @@ -1085,7 +1083,7 @@ bool out_of_memory(struct oom_control *oc) */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); + wake_oom_reaper(current, current->mm); return true; } -- 1.8.3.1
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 2019/01/26 15:10, Tetsuo Handa wrote: > On 2019/01/26 11:41, Arkadiusz Miśkiewicz wrote: >> dmesg: >> http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt OK. There is a refcount leak bug in wake_oom_reaper() which became visible by enabling oom_group setting. static void wake_oom_reaper(struct task_struct *tsk) { /* tsk is already queued? */ if (tsk == oom_reaper_list || tsk->oom_reaper_list) return; get_task_struct(tsk); spin_lock(&oom_reaper_lock); tsk->oom_reaper_list = oom_reaper_list; oom_reaper_list = tsk; spin_unlock(&oom_reaper_lock); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait); } (1) oom_reaper_list and tsk(*)->oom_reaper_list are initially NULL. (2) Since tsk(2160) != oom_reaper_list && tsk(2160)->oom_reaper_list == NULL, get_task_struct(tsk(2160)) is called. (3) tsk(2160)->oom_reaper_list = oom_reaper_list (which is NULL). (4) oom_reaper_list = tsk(2160) (which is not NULL). (5) Since tsk(2150) != oom_reaper_list (which is tsk(2160)) && tsk(2150)->oom_reaper_list == NULL, get_task_struct(tsk(2150)) is called. (6) Step (5) repeats on tsk(2153, 2164. 2166, 2163, 2159, 2161, 2154). (7) Since tsk(2160) != oom_reaper_list (which is tsk(2154)) && tsk(2160)->oom_reaper_list == NULL (because it was NULL as of (2)), get_task_struct(tsk(2160)) is called again. (8) oom_reap_task() calls put_task_struct(tsk(2160)) for only once because tsk(2160) appears on the oom_reaper_list for only once; leaking refcount. Jan 26 03:33:49 xps kernel: Memory cgroup out of memory: Kill process 2160 (python3) score 66 or sacrifice child Jan 26 03:33:49 xps kernel: Killed process 2160 (python3) total-vm:272176kB, anon-rss:31668kB, file-rss:3248kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Tasks in /test_2149 are going to be killed due to memory.oom.group set Jan 26 03:33:49 xps kernel: Killed process 2150 (python3) total-vm:272176kB, anon-rss:27572kB, file-rss:3248kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2153 (python3) total-vm:261936kB, anon-rss:17396kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2164 (python3) total-vm:272176kB, anon-rss:27572kB, file-rss:3248kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2166 (python3) total-vm:261936kB, anon-rss:15088kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2163 (python3) total-vm:261936kB, anon-rss:21496kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2159 (python3) total-vm:282420kB, anon-rss:23944kB, file-rss:3128kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2161 (python3) total-vm:251692kB, anon-rss:11172kB, file-rss:3060kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2154 (python3) total-vm:261936kB, anon-rss:15084kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2160 (python3) total-vm:272176kB, anon-rss:31668kB, file-rss:3248kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2155 (python3) total-vm:261936kB, anon-rss:17364kB, file-rss:3056kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2168 (python3) total-vm:272176kB, anon-rss:29620kB, file-rss:3108kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2156 (python3) total-vm:251692kB, anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2157 (python3) total-vm:251692kB, anon-rss:11172kB, file-rss:3088kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2169 (python3) total-vm:261936kB, anon-rss:19448kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2158 (python3) total-vm:272176kB, anon-rss:23944kB, file-rss:3192kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2170 (python3) total-vm:251692kB, anon-rss:13252kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2171 (python3) total-vm:261936kB, anon-rss:17396kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2173 (python3) total-vm:251692kB, anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2152 (python3) total-vm:261936kB, anon-rss:13724kB, file-rss:3100kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2174 (python3) total-vm:251692kB, anon-rss:11160kB, file-rss:3120kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2172 (python3) total-vm:251692kB, anon-rss:9128kB, file-rss:3116kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2187 (python3) total-vm:261936kB, anon-rss:23516kB, file-rss:3116kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2184 (python3) total-vm:251692kB, anon-rss:13500kB, file-rss:3188kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2151 (python3) total-vm:251692kB, anon-rss:9108kB, file-rss:3096kB, shmem-rss:0kB Jan 26 03:33:49 xps kernel: Killed process 2182 (python3
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 2019/01/26 11:41, Arkadiusz Miśkiewicz wrote: > dmesg: > http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt What is wrong with this output? It seems to me that the OOM killer is killing processes as soon as python 3 fork()s one. Although "potentially unexpected fatal signal 7." messages due to handling SIGBUS at get_signal() after the OOM reaper reclaimed the OOM victim's memory is a bit noisy, I don't think that something is wrong. Jan 26 03:33:49 xps kernel: python3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0 Jan 26 03:33:52 xps kernel: potentially unexpected fatal signal 7. You seem to be monitoring cgroup statistic values for only 1 minute, but since the fork bombs might take longer than 1 minute, I'm not sure that the statistic values after terminating cg.py and waited for a while (e.g. 1 minute) is still showing unexpected values.
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 26/01/2019 02:27, Tetsuo Handa wrote: > On 2019/01/26 4:47, Arkadiusz Miśkiewicz wrote: >>> Can you please see whether the problem can be reproduced on the >>> current linux-next? >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git >> >> I can reproduce on next (5.0.0-rc3-next-20190125), too: >> > > Please try this patch. Doesn't help: [root@xps test]# python3 cg.py Created cgroup: /sys/fs/cgroup/test_2149 Start: pids.current: 0 Start: cgroup.procs: 0: pids.current: 97 0: cgroup.procs: 1: pids.current: 14 1: cgroup.procs: 2: pids.current: 14 2: cgroup.procs: 3: pids.current: 14 3: cgroup.procs: 4: pids.current: 14 4: cgroup.procs: 5: pids.current: 14 5: cgroup.procs: 6: pids.current: 14 6: cgroup.procs: 7: pids.current: 14 7: cgroup.procs: 8: pids.current: 14 8: cgroup.procs: 9: pids.current: 14 9: cgroup.procs: 10: pids.current: 14 10: cgroup.procs: 11: pids.current: 14 11: cgroup.procs: [root@xps test]# ps aux|grep python root 3160 0.0 0.0 234048 2160 pts/2S+ 03:34 0:00 grep python [root@xps test]# uname -a Linux xps 5.0.0-rc3-00104-gc04e2a780caf-dirty #289 SMP PREEMPT Sat Jan 26 03:29:45 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux kernel config: http://ixion.pld-linux.org/~arekm/cgroup-oom-kernelconf-2.txt dmesg: http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt > > Subject: [PATCH v2] memcg: killed threads should not invoke memcg OOM killer > From: Tetsuo Handa > To: Andrew Morton , > Johannes Weiner , David Rientjes > Cc: Michal Hocko , linux...@kvack.org, > Kirill Tkhai , > Linus Torvalds > Message-ID: <01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp> > Date: Tue, 15 Jan 2019 19:17:27 +0900 > > If $N > $M, a single process with $N threads in a memcg group can easily > kill all $M processes in that memcg group, for mem_cgroup_out_of_memory() > does not check if current thread needs to invoke the memcg OOM killer. > > T1@P1 |T2...$N@P1|P2...$M |OOM reaper > --+--+--+-- > # all sleeping > try_charge() > mem_cgroup_out_of_memory() > mutex_lock(oom_lock) > try_charge() >mem_cgroup_out_of_memory() > mutex_lock(oom_lock) > out_of_memory() > select_bad_process() > oom_kill_process(P1) > wake_oom_reaper() >oom_reap_task() # ignores P1 > mutex_unlock(oom_lock) > out_of_memory() >select_bad_process(P2...$M) > # all killed by T2...$N@P1 >wake_oom_reaper() >oom_reap_task() # ignores P2...$M > mutex_unlock(oom_lock) > > We don't need to invoke the memcg OOM killer if current thread was killed > when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) can count > on try_charge() when mem_cgroup_oom_synchronize(true) can not make forward > progress because try_charge() allows already killed/exiting threads to > make forward progress, and memory_max_write() can bail out upon signals. > > At first Michal thought that fatal signal check is racy compared to > tsk_is_oom_victim() check. But an experiment showed that trying to call > mark_oom_victim() on all killed thread groups is more racy than fatal > signal check due to task_will_free_mem(current) path in out_of_memory(). > > Therefore, this patch changes mem_cgroup_out_of_memory() to bail out upon > should_force_charge() == T rather than upon fatal_signal_pending() == T, > for should_force_charge() == T && signal_pending(current) == F at > memory_max_write() can't happen because current thread won't call > memory_max_write() after getting PF_EXITING. > > Signed-off-by: Tetsuo Handa > Acked-by: Michal Hocko > Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") > Fixes: 3100dab2aa09 ("mm: memcontrol: print proper OOM header when no > eligible victim left") > Cc: sta...@vger.kernel.org # 4.19+ > --- > mm/memcontrol.c | 19 ++- > 1 file changed, 14 insertions(+), 5 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index af7f18b..79a7d2a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -248,6 +248,12 @@ enum res_type { >iter != NULL; \ >iter = mem_cgroup_iter(NULL, iter, NULL)) > > +static inline bool should_force_charge(void) > +{ > + return tsk_is_oom_victim(current) || fatal_signal_pending(current) || > + (current->flags & PF_EXITING); > +} > + > /* Some nice accessors for the vmpressure. */ > struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg) > { > @@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup > *memcg, gfp_t gfp_mask, > }; > bool ret; > > - mutex_lock(&oom_lock); > - ret = out_of_memory(&oc); > + if (mutex_lock_killable(&oo
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 26/01/2019 02:41, Roman Gushchin wrote: > On Fri, Jan 25, 2019 at 08:47:57PM +0100, Arkadiusz Miśkiewicz wrote: >> On 25/01/2019 17:37, Tejun Heo wrote: >>> On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote: On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote: > On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote: >> On 17/01/2019 13:25, Aleksa Sarai wrote: >>> On 2019-01-17, Arkadiusz Miśkiewicz wrote: Using kernel 4.19.13. For one cgroup I noticed weird behaviour: # cat pids.current 60 # cat cgroup.procs # >>> >>> Are there any zombies in the cgroup? pids.current is linked up directly >>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct >>> actually being freed will decrease it). >>> >> >> There are no zombie processes. >> >> In mean time the problem shows on multiple servers and so far saw it >> only in cgroups that were OOMed. >> >> What has changed on these servers (yesterday) is turning on >> memory.oom.group=1 for all cgroups and changing memory.high from 1G to >> "max" (leaving memory.max=2G limit only). >> >> Previously there was no such problem. >> > > I'm attaching reproducer. This time tried on different distribution > kernel (arch linux). > > After 60s pids.current still shows 37 processes even if there are no > processes running (according to ps aux). The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to reproduce bug. No processes in cgroup but pids.current reports 91. >>> >>> Can you please see whether the problem can be reproduced on the >>> current linux-next? >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git >> >> I can reproduce on next (5.0.0-rc3-next-20190125), too: > > How reliably you can reproduce it? I've tried to run your reproducer > several times with different parameters, but wasn't lucky so far. Hmm, I'm able to reproduce each time using that python3 script. > What's yours cpu number and total ram size? On different machines: 1) old pc, 1 x intel E6600, 4GB of ram (arch linux 4.20.3 distro kernel), using python script 2) virtualbox vm, 1 cpu with 2 cores, 8GB of ram (pld kernel and custom built kernels like 5.0rc3 and 5.0rc3 next), using python script 3) server with 2 x intel E5405, 32GB of ram (4.19.13), with real life scenario 4) server with 1 x intel E5-2630 v2, 64GB of ram (4.19.15), with real life scenario > > Can you, please, provide the corresponding dmesg output? >From my virtualbox vm, after 60s pids.current reports 7 despite no processes: http://ixion.pld-linux.org/~arekm/cgroup-oom-1.txt kernel config: http://ixion.pld-linux.org/~arekm/cgroup-oom-kernelconf-1.txt Using 5.0.0-rc3-next-20190125 on that vm. > I've checked the code again, and my wild guess is that these missing > tasks are waiting (maybe hopelessly) for the OOM reaper. Dmesg output > might be very useful here. > > Thanks! > -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On Fri, Jan 25, 2019 at 08:47:57PM +0100, Arkadiusz Miśkiewicz wrote: > On 25/01/2019 17:37, Tejun Heo wrote: > > On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote: > >> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote: > >>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote: > On 17/01/2019 13:25, Aleksa Sarai wrote: > > On 2019-01-17, Arkadiusz Miśkiewicz wrote: > >> Using kernel 4.19.13. > >> > >> For one cgroup I noticed weird behaviour: > >> > >> # cat pids.current > >> 60 > >> # cat cgroup.procs > >> # > > > > Are there any zombies in the cgroup? pids.current is linked up directly > > to __put_task_struct (so exit(2) won't decrease it, only the task_struct > > actually being freed will decrease it). > > > > There are no zombie processes. > > In mean time the problem shows on multiple servers and so far saw it > only in cgroups that were OOMed. > > What has changed on these servers (yesterday) is turning on > memory.oom.group=1 for all cgroups and changing memory.high from 1G to > "max" (leaving memory.max=2G limit only). > > Previously there was no such problem. > > >>> > >>> I'm attaching reproducer. This time tried on different distribution > >>> kernel (arch linux). > >>> > >>> After 60s pids.current still shows 37 processes even if there are no > >>> processes running (according to ps aux). > >> > >> > >> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to > >> reproduce bug. No processes in cgroup but pids.current reports 91. > > > > Can you please see whether the problem can be reproduced on the > > current linux-next? > > > > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > > I can reproduce on next (5.0.0-rc3-next-20190125), too: How reliably you can reproduce it? I've tried to run your reproducer several times with different parameters, but wasn't lucky so far. What's yours cpu number and total ram size? Can you, please, provide the corresponding dmesg output? I've checked the code again, and my wild guess is that these missing tasks are waiting (maybe hopelessly) for the OOM reaper. Dmesg output might be very useful here. Thanks!
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 2019/01/26 4:47, Arkadiusz Miśkiewicz wrote: >> Can you please see whether the problem can be reproduced on the >> current linux-next? >> >> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > > I can reproduce on next (5.0.0-rc3-next-20190125), too: > Please try this patch. Subject: [PATCH v2] memcg: killed threads should not invoke memcg OOM killer From: Tetsuo Handa To: Andrew Morton , Johannes Weiner , David Rientjes Cc: Michal Hocko , linux...@kvack.org, Kirill Tkhai , Linus Torvalds Message-ID: <01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp> Date: Tue, 15 Jan 2019 19:17:27 +0900 If $N > $M, a single process with $N threads in a memcg group can easily kill all $M processes in that memcg group, for mem_cgroup_out_of_memory() does not check if current thread needs to invoke the memcg OOM killer. T1@P1 |T2...$N@P1|P2...$M |OOM reaper --+--+--+-- # all sleeping try_charge() mem_cgroup_out_of_memory() mutex_lock(oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(oom_lock) out_of_memory() select_bad_process() oom_kill_process(P1) wake_oom_reaper() oom_reap_task() # ignores P1 mutex_unlock(oom_lock) out_of_memory() select_bad_process(P2...$M) # all killed by T2...$N@P1 wake_oom_reaper() oom_reap_task() # ignores P2...$M mutex_unlock(oom_lock) We don't need to invoke the memcg OOM killer if current thread was killed when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) can count on try_charge() when mem_cgroup_oom_synchronize(true) can not make forward progress because try_charge() allows already killed/exiting threads to make forward progress, and memory_max_write() can bail out upon signals. At first Michal thought that fatal signal check is racy compared to tsk_is_oom_victim() check. But an experiment showed that trying to call mark_oom_victim() on all killed thread groups is more racy than fatal signal check due to task_will_free_mem(current) path in out_of_memory(). Therefore, this patch changes mem_cgroup_out_of_memory() to bail out upon should_force_charge() == T rather than upon fatal_signal_pending() == T, for should_force_charge() == T && signal_pending(current) == F at memory_max_write() can't happen because current thread won't call memory_max_write() after getting PF_EXITING. Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") Fixes: 3100dab2aa09 ("mm: memcontrol: print proper OOM header when no eligible victim left") Cc: sta...@vger.kernel.org # 4.19+ --- mm/memcontrol.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index af7f18b..79a7d2a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -248,6 +248,12 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) +static inline bool should_force_charge(void) +{ + return tsk_is_oom_victim(current) || fatal_signal_pending(current) || + (current->flags & PF_EXITING); +} + /* Some nice accessors for the vmpressure. */ struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg) { @@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - mutex_lock(&oom_lock); - ret = out_of_memory(&oc); + if (mutex_lock_killable(&oom_lock)) + return true; + /* +* A few threads which were not waiting at mutex_lock_killable() can +* fail to bail out. Therefore, check again after holding oom_lock. +*/ + ret = should_force_charge() || out_of_memory(&oc); mutex_unlock(&oom_lock); return ret; } @@ -2209,9 +2220,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * bypass the last charges so that they can exit quickly and * free their memory. */ - if (unlikely(tsk_is_oom_victim(current) || -fatal_signal_pending(current) || -current->flags & PF_EXITING)) + if (unlikely(should_force_charge())) goto force; /* -- 1.8.3.1
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 25/01/2019 17:37, Tejun Heo wrote: > On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote: >> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote: >>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote: On 17/01/2019 13:25, Aleksa Sarai wrote: > On 2019-01-17, Arkadiusz Miśkiewicz wrote: >> Using kernel 4.19.13. >> >> For one cgroup I noticed weird behaviour: >> >> # cat pids.current >> 60 >> # cat cgroup.procs >> # > > Are there any zombies in the cgroup? pids.current is linked up directly > to __put_task_struct (so exit(2) won't decrease it, only the task_struct > actually being freed will decrease it). > There are no zombie processes. In mean time the problem shows on multiple servers and so far saw it only in cgroups that were OOMed. What has changed on these servers (yesterday) is turning on memory.oom.group=1 for all cgroups and changing memory.high from 1G to "max" (leaving memory.max=2G limit only). Previously there was no such problem. >>> >>> I'm attaching reproducer. This time tried on different distribution >>> kernel (arch linux). >>> >>> After 60s pids.current still shows 37 processes even if there are no >>> processes running (according to ps aux). >> >> >> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to >> reproduce bug. No processes in cgroup but pids.current reports 91. > > Can you please see whether the problem can be reproduced on the > current linux-next? > > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git I can reproduce on next (5.0.0-rc3-next-20190125), too: [root@xps test]# python3 cg.py Created cgroup: /sys/fs/cgroup/test_2501 Start: pids.current: 0 Start: cgroup.procs: 0: pids.current: 65 0: cgroup.procs: 1: pids.current: 44 1: cgroup.procs: 2: pids.current: 44 2: cgroup.procs: 3: pids.current: 44 3: cgroup.procs: 4: pids.current: 44 4: cgroup.procs: 5: pids.current: 44 5: cgroup.procs: 6: pids.current: 44 6: cgroup.procs: 7: pids.current: 44 7: cgroup.procs: 8: pids.current: 44 8: cgroup.procs: 9: pids.current: 44 9: cgroup.procs: 10: pids.current: 44 10: cgroup.procs: 11: pids.current: 44 11: cgroup.procs: [root@xps test]# uname -a Linux xps 5.0.0-rc3-next-20190125 #2 SMP PREEMPT Fri Jan 25 19:11:40 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux [root@xps test]# mount |grep cgroup2 cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime) I'm booting kernel with cgroup_no_v1=all -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote: > On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote: > > On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote: > >> On 17/01/2019 13:25, Aleksa Sarai wrote: > >>> On 2019-01-17, Arkadiusz Miśkiewicz wrote: > Using kernel 4.19.13. > > For one cgroup I noticed weird behaviour: > > # cat pids.current > 60 > # cat cgroup.procs > # > >>> > >>> Are there any zombies in the cgroup? pids.current is linked up directly > >>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct > >>> actually being freed will decrease it). > >>> > >> > >> There are no zombie processes. > >> > >> In mean time the problem shows on multiple servers and so far saw it > >> only in cgroups that were OOMed. > >> > >> What has changed on these servers (yesterday) is turning on > >> memory.oom.group=1 for all cgroups and changing memory.high from 1G to > >> "max" (leaving memory.max=2G limit only). > >> > >> Previously there was no such problem. > >> > > > > I'm attaching reproducer. This time tried on different distribution > > kernel (arch linux). > > > > After 60s pids.current still shows 37 processes even if there are no > > processes running (according to ps aux). > > > The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to > reproduce bug. No processes in cgroup but pids.current reports 91. Can you please see whether the problem can be reproduced on the current linux-next? git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git Thanks. -- tejun
Re: pids.current with invalid value for hours [5.0.0 rc3 git]
On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote: > On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote: >> On 17/01/2019 13:25, Aleksa Sarai wrote: >>> On 2019-01-17, Arkadiusz Miśkiewicz wrote: Using kernel 4.19.13. For one cgroup I noticed weird behaviour: # cat pids.current 60 # cat cgroup.procs # >>> >>> Are there any zombies in the cgroup? pids.current is linked up directly >>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct >>> actually being freed will decrease it). >>> >> >> There are no zombie processes. >> >> In mean time the problem shows on multiple servers and so far saw it >> only in cgroups that were OOMed. >> >> What has changed on these servers (yesterday) is turning on >> memory.oom.group=1 for all cgroups and changing memory.high from 1G to >> "max" (leaving memory.max=2G limit only). >> >> Previously there was no such problem. >> > > I'm attaching reproducer. This time tried on different distribution > kernel (arch linux). > > After 60s pids.current still shows 37 processes even if there are no > processes running (according to ps aux). The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to reproduce bug. No processes in cgroup but pids.current reports 91. memory.oom.group=0 - everything works fine, pids are counted properly memory.oom.group=1 - bug becomes visible [root@xps test]# python3 cg.py Created cgroup: /sys/fs/cgroup/test_5277 Start: pids.current: 0 Start: cgroup.procs: 0: pids.current: 103 0: cgroup.procs: 1: pids.current: 91 1: cgroup.procs: 2: pids.current: 91 2: cgroup.procs: 3: pids.current: 91 3: cgroup.procs: 4: pids.current: 91 4: cgroup.procs: 5: pids.current: 91 5: cgroup.procs: 6: pids.current: 91 6: cgroup.procs: 7: pids.current: 91 7: cgroup.procs: 8: pids.current: 91 8: cgroup.procs: 9: pids.current: 91 9: cgroup.procs: 10: pids.current: 91 10: cgroup.procs: 11: pids.current: 91 11: cgroup.procs: [root@xps test]# uname -a Linux xps 5.0.0-rc3-00104-gc04e2a780caf #288 SMP PREEMPT Thu Jan 24 19:00:32 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux cc relevant people script is here: https://www.spinics.net/lists/cgroups/msg21330.html > > [root@warm ~]# uname -a > Linux warm 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC > 2019 x86_64 GNU/Linux > [root@warm ~]# python3 cg.py > Created cgroup: /sys/fs/cgroup/test_26207 > Start: pids.current: 0 > Start: cgroup.procs: > 0: pids.current: 62 > 0: cgroup.procs: > 1: pids.current: 37 > 1: cgroup.procs: > 2: pids.current: 37 > 2: cgroup.procs: > 3: pids.current: 37 > 3: cgroup.procs: > 4: pids.current: 37 > 4: cgroup.procs: > 5: pids.current: 37 > 5: cgroup.procs: > 6: pids.current: 37 > 6: cgroup.procs: > 7: pids.current: 37 > 7: cgroup.procs: > 8: pids.current: 37 > 8: cgroup.procs: > 9: pids.current: 37 > 9: cgroup.procs: > 10: pids.current: 37 > 10: cgroup.procs: > 11: pids.current: 37 > 11: cgroup.procs: > -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )