Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-26 Thread Arkadiusz Miśkiewicz
On 26/01/2019 12:09, Tetsuo Handa wrote:
> Arkadiusz, will you try this patch?


Works. Several tries and always getting 0 pids.current after ~1s.

Please use

Reported-by: Arkadiusz Miśkiewicz 

(using gmail for transport only since vger postmasters aren't reasonable)

And also
Tested-by: Arkadiusz Miśkiewicz 

Testing was done on:
5.0.0-rc3-00104-gc04e2a780caf-dirty

Thanks!

> 
> From 48744b6339cf649a69b55997e138c17df1ecc897 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Sat, 26 Jan 2019 20:00:51 +0900
> Subject: [PATCH] oom, oom_reaper: do not enqueue same task twice
> 
> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.
> 
> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> Since task_will_free_mem(p) == false if p->mm == NULL, we can assume that
> p->mm != NULL when wake_oom_reaper() is called from task_will_free_mem()
> paths. As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.
> 
> Signed-off-by: Tetsuo Handa 
> Reported-by: Arkadiusz Miśkiewicz 
> Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on 
> the oom_reaper_list head")
> ---
>  mm/oom_kill.c | 28 +---
>  1 file changed, 13 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f0e8cd9..457f240 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>   struct vm_area_struct *vma;
>   bool ret = true;
>  
> - /*
> -  * Tell all users of get_user/copy_from_user etc... that the content
> -  * is no longer stable. No barriers really needed because unmapping
> -  * should imply barriers already and the reader would hit a page fault
> -  * if it stumbled over a reaped memory.
> -  */
> - set_bit(MMF_UNSTABLE, >flags);
> -
>   for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>   if (!can_madv_dontneed_vma(vma))
>   continue;
> @@ -645,10 +637,15 @@ static int oom_reaper(void *unused)
>   return 0;
>  }
>  
> -static void wake_oom_reaper(struct task_struct *tsk)
> +static void wake_oom_reaper(struct task_struct *tsk, struct mm_struct *mm)
>  {
> - /* tsk is already queued? */
> - if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> + /*
> +  * Tell all users of get_user/copy_from_user etc... that the content
> +  * is no longer stable. No barriers really needed because unmapping
> +  * should imply barriers already and the reader would hit a page fault
> +  * if it stumbled over a reaped memory.
> +  */
> + if (test_and_set_bit(MMF_UNSTABLE, >flags))
>   return;
>  
>   get_task_struct(tsk);
> @@ -668,7 +665,8 @@ static int __init oom_init(void)
>  }
>  subsys_initcall(oom_init)
>  #else
> -static inline void wake_oom_reaper(struct task_struct *tsk)
> +static inline void wake_oom_reaper(struct task_struct *tsk,
> +struct mm_struct *mm)
>  {
>  }
>  #endif /* CONFIG_MMU */
> @@ -915,7 +913,7 @@ static void __oom_kill_process(struct task_struct *victim)
>   rcu_read_unlock();
>  
>   if (can_oom_reap)
> - wake_oom_reaper(victim);
> + wake_oom_reaper(victim, mm);
>  
>   mmdrop(mm);
>   put_task_struct(victim);
> @@ -955,7 +953,7 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>   task_lock(p);
>   if (task_will_free_mem(p)) {
>   mark_oom_victim(p);
> - wake_oom_reaper(p);
> + wake_oom_reaper(p, p->mm);
>   task_unlock(p);
>   put_task_struct(p);
>   return;
> @@ -1085,7 +1083,7 @@ bool out_of_memory(struct oom_control *oc)
>*/
>   if (task_will_free_mem(current)) {
>   mark_oom_victim(current);
> - wake_oom_reaper(current);
> + wake_oom_reaper(current, current->mm);
>   return true;
>   }
>  
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )


Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-26 Thread Tetsuo Handa
Arkadiusz, will you try this patch?

>From 48744b6339cf649a69b55997e138c17df1ecc897 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Sat, 26 Jan 2019 20:00:51 +0900
Subject: [PATCH] oom, oom_reaper: do not enqueue same task twice

Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite
the number of tasks in that memcg is not 0. It turned out that there
is a bug in wake_oom_reaper() which allows enqueuing same task twice
which makes impossible to decrease the number of tasks in that memcg
due to a refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325737f76
("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
Since task_will_free_mem(p) == false if p->mm == NULL, we can assume that
p->mm != NULL when wake_oom_reaper() is called from task_will_free_mem()
paths. As a side effect of this patch, this patch also avoids enqueuing
multiple threads sharing memory via task_will_free_mem(current) path.

Signed-off-by: Tetsuo Handa 
Reported-by: Arkadiusz Miśkiewicz 
Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the 
oom_reaper_list head")
---
 mm/oom_kill.c | 28 +---
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..457f240 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
struct vm_area_struct *vma;
bool ret = true;
 
-   /*
-* Tell all users of get_user/copy_from_user etc... that the content
-* is no longer stable. No barriers really needed because unmapping
-* should imply barriers already and the reader would hit a page fault
-* if it stumbled over a reaped memory.
-*/
-   set_bit(MMF_UNSTABLE, >flags);
-
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
if (!can_madv_dontneed_vma(vma))
continue;
@@ -645,10 +637,15 @@ static int oom_reaper(void *unused)
return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct task_struct *tsk, struct mm_struct *mm)
 {
-   /* tsk is already queued? */
-   if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+   /*
+* Tell all users of get_user/copy_from_user etc... that the content
+* is no longer stable. No barriers really needed because unmapping
+* should imply barriers already and the reader would hit a page fault
+* if it stumbled over a reaped memory.
+*/
+   if (test_and_set_bit(MMF_UNSTABLE, >flags))
return;
 
get_task_struct(tsk);
@@ -668,7 +665,8 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk,
+  struct mm_struct *mm)
 {
 }
 #endif /* CONFIG_MMU */
@@ -915,7 +913,7 @@ static void __oom_kill_process(struct task_struct *victim)
rcu_read_unlock();
 
if (can_oom_reap)
-   wake_oom_reaper(victim);
+   wake_oom_reaper(victim, mm);
 
mmdrop(mm);
put_task_struct(victim);
@@ -955,7 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
task_lock(p);
if (task_will_free_mem(p)) {
mark_oom_victim(p);
-   wake_oom_reaper(p);
+   wake_oom_reaper(p, p->mm);
task_unlock(p);
put_task_struct(p);
return;
@@ -1085,7 +1083,7 @@ bool out_of_memory(struct oom_control *oc)
 */
if (task_will_free_mem(current)) {
mark_oom_victim(current);
-   wake_oom_reaper(current);
+   wake_oom_reaper(current, current->mm);
return true;
}
 
-- 
1.8.3.1



Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Tetsuo Handa
On 2019/01/26 15:10, Tetsuo Handa wrote:
> On 2019/01/26 11:41, Arkadiusz Miśkiewicz wrote:
>> dmesg:
>> http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt

OK. There is a refcount leak bug in wake_oom_reaper()
which became visible by enabling oom_group setting.

static void wake_oom_reaper(struct task_struct *tsk)
{
/* tsk is already queued? */
if (tsk == oom_reaper_list || tsk->oom_reaper_list)
return;

get_task_struct(tsk);

spin_lock(_reaper_lock);
tsk->oom_reaper_list = oom_reaper_list;
oom_reaper_list = tsk;
spin_unlock(_reaper_lock);
trace_wake_reaper(tsk->pid);
wake_up(_reaper_wait);
}

(1) oom_reaper_list and tsk(*)->oom_reaper_list are initially NULL.
(2) Since tsk(2160) != oom_reaper_list && tsk(2160)->oom_reaper_list == NULL,
get_task_struct(tsk(2160)) is called.
(3) tsk(2160)->oom_reaper_list = oom_reaper_list (which is NULL).
(4) oom_reaper_list = tsk(2160) (which is not NULL).
(5) Since tsk(2150) != oom_reaper_list (which is tsk(2160)) && 
tsk(2150)->oom_reaper_list == NULL,
get_task_struct(tsk(2150)) is called.
(6) Step (5) repeats on tsk(2153, 2164. 2166, 2163, 2159, 2161, 2154).
(7) Since tsk(2160) != oom_reaper_list (which is tsk(2154)) && 
tsk(2160)->oom_reaper_list == NULL
(because it was NULL as of (2)), get_task_struct(tsk(2160)) is called again.
(8) oom_reap_task() calls put_task_struct(tsk(2160)) for only once because
tsk(2160) appears on the oom_reaper_list for only once; leaking refcount.

Jan 26 03:33:49 xps kernel: Memory cgroup out of memory: Kill process 2160 
(python3) score 66 or sacrifice child
Jan 26 03:33:49 xps kernel: Killed process 2160 (python3) total-vm:272176kB, 
anon-rss:31668kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Tasks in /test_2149 are going to be killed due to 
memory.oom.group set
Jan 26 03:33:49 xps kernel: Killed process 2150 (python3) total-vm:272176kB, 
anon-rss:27572kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2153 (python3) total-vm:261936kB, 
anon-rss:17396kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2164 (python3) total-vm:272176kB, 
anon-rss:27572kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2166 (python3) total-vm:261936kB, 
anon-rss:15088kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2163 (python3) total-vm:261936kB, 
anon-rss:21496kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2159 (python3) total-vm:282420kB, 
anon-rss:23944kB, file-rss:3128kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2161 (python3) total-vm:251692kB, 
anon-rss:11172kB, file-rss:3060kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2154 (python3) total-vm:261936kB, 
anon-rss:15084kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2160 (python3) total-vm:272176kB, 
anon-rss:31668kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2155 (python3) total-vm:261936kB, 
anon-rss:17364kB, file-rss:3056kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2168 (python3) total-vm:272176kB, 
anon-rss:29620kB, file-rss:3108kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2156 (python3) total-vm:251692kB, 
anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2157 (python3) total-vm:251692kB, 
anon-rss:11172kB, file-rss:3088kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2169 (python3) total-vm:261936kB, 
anon-rss:19448kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2158 (python3) total-vm:272176kB, 
anon-rss:23944kB, file-rss:3192kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2170 (python3) total-vm:251692kB, 
anon-rss:13252kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2171 (python3) total-vm:261936kB, 
anon-rss:17396kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2173 (python3) total-vm:251692kB, 
anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2152 (python3) total-vm:261936kB, 
anon-rss:13724kB, file-rss:3100kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2174 (python3) total-vm:251692kB, 
anon-rss:11160kB, file-rss:3120kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2172 (python3) total-vm:251692kB, 
anon-rss:9128kB, file-rss:3116kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2187 (python3) total-vm:261936kB, 
anon-rss:23516kB, file-rss:3116kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2184 (python3) total-vm:251692kB, 
anon-rss:13500kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2151 (python3) total-vm:251692kB, 
anon-rss:9108kB, file-rss:3096kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2182 (python3) 

Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Tetsuo Handa
On 2019/01/26 11:41, Arkadiusz Miśkiewicz wrote:
> dmesg:
> http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt

What is wrong with this output? It seems to me that the OOM killer is killing
processes as soon as python 3 fork()s one. Although "potentially unexpected
fatal signal 7." messages due to handling SIGBUS at get_signal() after the
OOM reaper reclaimed the OOM victim's memory is a bit noisy, I don't think
that something is wrong.

  Jan 26 03:33:49 xps kernel: python3 invoked oom-killer: 
gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0
  Jan 26 03:33:52 xps kernel: potentially unexpected fatal signal 7.

You seem to be monitoring cgroup statistic values for only 1 minute, but
since the fork bombs might take longer than 1 minute, I'm not sure that
the statistic values after terminating cg.py and waited for a while (e.g.
1 minute) is still showing unexpected values.



Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Arkadiusz Miśkiewicz
On 26/01/2019 02:27, Tetsuo Handa wrote:
> On 2019/01/26 4:47, Arkadiusz Miśkiewicz wrote:
>>> Can you please see whether the problem can be reproduced on the
>>> current linux-next?
>>>
>>>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>>
>> I can reproduce on next (5.0.0-rc3-next-20190125), too:
>>
> 
> Please try this patch.

Doesn't help:

[root@xps test]# python3 cg.py
Created cgroup: /sys/fs/cgroup/test_2149
Start: pids.current: 0
Start: cgroup.procs:
0: pids.current: 97
0: cgroup.procs:
1: pids.current: 14
1: cgroup.procs:
2: pids.current: 14
2: cgroup.procs:
3: pids.current: 14
3: cgroup.procs:
4: pids.current: 14
4: cgroup.procs:
5: pids.current: 14
5: cgroup.procs:
6: pids.current: 14
6: cgroup.procs:
7: pids.current: 14
7: cgroup.procs:
8: pids.current: 14
8: cgroup.procs:
9: pids.current: 14
9: cgroup.procs:
10: pids.current: 14
10: cgroup.procs:
11: pids.current: 14
11: cgroup.procs:
[root@xps test]# ps aux|grep python
root  3160  0.0  0.0 234048  2160 pts/2S+   03:34   0:00 grep python
[root@xps test]# uname -a
Linux xps 5.0.0-rc3-00104-gc04e2a780caf-dirty #289 SMP PREEMPT Sat Jan
26 03:29:45 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz
PLD Linux


kernel config:
http://ixion.pld-linux.org/~arekm/cgroup-oom-kernelconf-2.txt

dmesg:
http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt


> 
> Subject: [PATCH v2] memcg: killed threads should not invoke memcg OOM killer
> From: Tetsuo Handa 
> To: Andrew Morton ,
>  Johannes Weiner , David Rientjes 
> Cc: Michal Hocko , linux...@kvack.org,
>  Kirill Tkhai ,
>  Linus Torvalds 
> Message-ID: <01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp>
> Date: Tue, 15 Jan 2019 19:17:27 +0900
> 
> If $N > $M, a single process with $N threads in a memcg group can easily
> kill all $M processes in that memcg group, for mem_cgroup_out_of_memory()
> does not check if current thread needs to invoke the memcg OOM killer.
> 
>   T1@P1 |T2...$N@P1|P2...$M   |OOM reaper
>   --+--+--+--
> # all sleeping
>   try_charge()
> mem_cgroup_out_of_memory()
>   mutex_lock(oom_lock)
>  try_charge()
>mem_cgroup_out_of_memory()
>  mutex_lock(oom_lock)
>   out_of_memory()
> select_bad_process()
> oom_kill_process(P1)
> wake_oom_reaper()
>oom_reap_task() # ignores P1
>   mutex_unlock(oom_lock)
>  out_of_memory()
>select_bad_process(P2...$M)
> # all killed by T2...$N@P1
>wake_oom_reaper()
>oom_reap_task() # ignores P2...$M
>  mutex_unlock(oom_lock)
> 
> We don't need to invoke the memcg OOM killer if current thread was killed
> when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) can count
> on try_charge() when mem_cgroup_oom_synchronize(true) can not make forward
> progress because try_charge() allows already killed/exiting threads to
> make forward progress, and memory_max_write() can bail out upon signals.
> 
> At first Michal thought that fatal signal check is racy compared to
> tsk_is_oom_victim() check. But an experiment showed that trying to call
> mark_oom_victim() on all killed thread groups is more racy than fatal
> signal check due to task_will_free_mem(current) path in out_of_memory().
> 
> Therefore, this patch changes mem_cgroup_out_of_memory() to bail out upon
> should_force_charge() == T rather than upon fatal_signal_pending() == T,
> for should_force_charge() == T && signal_pending(current) == F at
> memory_max_write() can't happen because current thread won't call
> memory_max_write() after getting PF_EXITING.
> 
> Signed-off-by: Tetsuo Handa 
> Acked-by: Michal Hocko 
> Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
> Fixes: 3100dab2aa09 ("mm: memcontrol: print proper OOM header when no 
> eligible victim left")
> Cc: sta...@vger.kernel.org # 4.19+
> ---
>  mm/memcontrol.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index af7f18b..79a7d2a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -248,6 +248,12 @@ enum res_type {
>iter != NULL;  \
>iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> +static inline bool should_force_charge(void)
> +{
> + return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> + (current->flags & PF_EXITING);
> +}
> +
>  /* Some nice accessors for the vmpressure. */
>  struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
>  {
> @@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
> *memcg, gfp_t gfp_mask,
>   };
>   bool ret;
>  
> - mutex_lock(_lock);
> - ret = out_of_memory();
> + if (mutex_lock_killable(_lock))
> 

Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Arkadiusz Miśkiewicz
On 26/01/2019 02:41, Roman Gushchin wrote:
> On Fri, Jan 25, 2019 at 08:47:57PM +0100, Arkadiusz Miśkiewicz wrote:
>> On 25/01/2019 17:37, Tejun Heo wrote:
>>> On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
 On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
>> On 17/01/2019 13:25, Aleksa Sarai wrote:
>>> On 2019-01-17, Arkadiusz Miśkiewicz  wrote:
 Using kernel 4.19.13.

 For one cgroup I noticed weird behaviour:

 # cat pids.current
 60
 # cat cgroup.procs
 #
>>>
>>> Are there any zombies in the cgroup? pids.current is linked up directly
>>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
>>> actually being freed will decrease it).
>>>
>>
>> There are no zombie processes.
>>
>> In mean time the problem shows on multiple servers and so far saw it
>> only in cgroups that were OOMed.
>>
>> What has changed on these servers (yesterday) is turning on
>> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
>> "max" (leaving memory.max=2G limit only).
>>
>> Previously there was no such problem.
>>
>
> I'm attaching reproducer. This time tried on different distribution
> kernel (arch linux).
>
> After 60s pids.current still shows 37 processes even if there are no
> processes running (according to ps aux).


 The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
 reproduce bug. No processes in cgroup but pids.current reports 91.
>>>
>>> Can you please see whether the problem can be reproduced on the
>>> current linux-next?
>>>
>>>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>>
>> I can reproduce on next (5.0.0-rc3-next-20190125), too:
> 
> How reliably you can reproduce it? I've tried to run your reproducer
> several times with different parameters, but wasn't lucky so far.

Hmm, I'm able to reproduce each time using that python3 script.


> What's yours cpu number and total ram size?

On different machines:

1) old pc, 1 x intel E6600, 4GB of ram (arch linux 4.20.3 distro
kernel), using python script

2) virtualbox vm, 1 cpu with 2 cores, 8GB of ram (pld kernel and custom
built kernels like 5.0rc3 and 5.0rc3 next), using python script

3) server with 2 x intel E5405, 32GB of ram (4.19.13), with real life
scenario

4) server with 1 x intel E5-2630 v2, 64GB of ram (4.19.15), with real
life scenario
> 
> Can you, please, provide the corresponding dmesg output?

>From my virtualbox vm, after 60s pids.current reports 7 despite no
processes:

http://ixion.pld-linux.org/~arekm/cgroup-oom-1.txt

kernel config:
http://ixion.pld-linux.org/~arekm/cgroup-oom-kernelconf-1.txt

Using 5.0.0-rc3-next-20190125 on that vm.

> I've checked the code again, and my wild guess is that these missing
> tasks are waiting (maybe hopelessly) for the OOM reaper. Dmesg output
> might be very useful here.



> 
> Thanks!
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )


Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Roman Gushchin
On Fri, Jan 25, 2019 at 08:47:57PM +0100, Arkadiusz Miśkiewicz wrote:
> On 25/01/2019 17:37, Tejun Heo wrote:
> > On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
> >> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> >>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
>  On 17/01/2019 13:25, Aleksa Sarai wrote:
> > On 2019-01-17, Arkadiusz Miśkiewicz  wrote:
> >> Using kernel 4.19.13.
> >>
> >> For one cgroup I noticed weird behaviour:
> >>
> >> # cat pids.current
> >> 60
> >> # cat cgroup.procs
> >> #
> >
> > Are there any zombies in the cgroup? pids.current is linked up directly
> > to __put_task_struct (so exit(2) won't decrease it, only the task_struct
> > actually being freed will decrease it).
> >
> 
>  There are no zombie processes.
> 
>  In mean time the problem shows on multiple servers and so far saw it
>  only in cgroups that were OOMed.
> 
>  What has changed on these servers (yesterday) is turning on
>  memory.oom.group=1 for all cgroups and changing memory.high from 1G to
>  "max" (leaving memory.max=2G limit only).
> 
>  Previously there was no such problem.
> 
> >>>
> >>> I'm attaching reproducer. This time tried on different distribution
> >>> kernel (arch linux).
> >>>
> >>> After 60s pids.current still shows 37 processes even if there are no
> >>> processes running (according to ps aux).
> >>
> >>
> >> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
> >> reproduce bug. No processes in cgroup but pids.current reports 91.
> > 
> > Can you please see whether the problem can be reproduced on the
> > current linux-next?
> > 
> >  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> 
> I can reproduce on next (5.0.0-rc3-next-20190125), too:

How reliably you can reproduce it? I've tried to run your reproducer
several times with different parameters, but wasn't lucky so far.
What's yours cpu number and total ram size?

Can you, please, provide the corresponding dmesg output?

I've checked the code again, and my wild guess is that these missing
tasks are waiting (maybe hopelessly) for the OOM reaper. Dmesg output
might be very useful here.

Thanks!


Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Tetsuo Handa
On 2019/01/26 4:47, Arkadiusz Miśkiewicz wrote:
>> Can you please see whether the problem can be reproduced on the
>> current linux-next?
>>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> 
> I can reproduce on next (5.0.0-rc3-next-20190125), too:
> 

Please try this patch.

Subject: [PATCH v2] memcg: killed threads should not invoke memcg OOM killer
From: Tetsuo Handa 
To: Andrew Morton ,
 Johannes Weiner , David Rientjes 
Cc: Michal Hocko , linux...@kvack.org,
 Kirill Tkhai ,
 Linus Torvalds 
Message-ID: <01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp>
Date: Tue, 15 Jan 2019 19:17:27 +0900

If $N > $M, a single process with $N threads in a memcg group can easily
kill all $M processes in that memcg group, for mem_cgroup_out_of_memory()
does not check if current thread needs to invoke the memcg OOM killer.

  T1@P1 |T2...$N@P1|P2...$M   |OOM reaper
  --+--+--+--
# all sleeping
  try_charge()
mem_cgroup_out_of_memory()
  mutex_lock(oom_lock)
 try_charge()
   mem_cgroup_out_of_memory()
 mutex_lock(oom_lock)
  out_of_memory()
select_bad_process()
oom_kill_process(P1)
wake_oom_reaper()
   oom_reap_task() # ignores P1
  mutex_unlock(oom_lock)
 out_of_memory()
   select_bad_process(P2...$M)
# all killed by T2...$N@P1
   wake_oom_reaper()
   oom_reap_task() # ignores P2...$M
 mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) can count
on try_charge() when mem_cgroup_oom_synchronize(true) can not make forward
progress because try_charge() allows already killed/exiting threads to
make forward progress, and memory_max_write() can bail out upon signals.

At first Michal thought that fatal signal check is racy compared to
tsk_is_oom_victim() check. But an experiment showed that trying to call
mark_oom_victim() on all killed thread groups is more racy than fatal
signal check due to task_will_free_mem(current) path in out_of_memory().

Therefore, this patch changes mem_cgroup_out_of_memory() to bail out upon
should_force_charge() == T rather than upon fatal_signal_pending() == T,
for should_force_charge() == T && signal_pending(current) == F at
memory_max_write() can't happen because current thread won't call
memory_max_write() after getting PF_EXITING.

Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Fixes: 3100dab2aa09 ("mm: memcontrol: print proper OOM header when no eligible 
victim left")
Cc: sta...@vger.kernel.org # 4.19+
---
 mm/memcontrol.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b..79a7d2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 iter != NULL;  \
 iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool should_force_charge(void)
+{
+   return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+   (current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
};
bool ret;
 
-   mutex_lock(_lock);
-   ret = out_of_memory();
+   if (mutex_lock_killable(_lock))
+   return true;
+   /*
+* A few threads which were not waiting at mutex_lock_killable() can
+* fail to bail out. Therefore, check again after holding oom_lock.
+*/
+   ret = should_force_charge() || out_of_memory();
mutex_unlock(_lock);
return ret;
 }
@@ -2209,9 +2220,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
 * bypass the last charges so that they can exit quickly and
 * free their memory.
 */
-   if (unlikely(tsk_is_oom_victim(current) ||
-fatal_signal_pending(current) ||
-current->flags & PF_EXITING))
+   if (unlikely(should_force_charge()))
goto force;
 
/*
-- 
1.8.3.1



Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Arkadiusz Miśkiewicz
On 25/01/2019 17:37, Tejun Heo wrote:
> On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
>> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
>>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
 On 17/01/2019 13:25, Aleksa Sarai wrote:
> On 2019-01-17, Arkadiusz Miśkiewicz  wrote:
>> Using kernel 4.19.13.
>>
>> For one cgroup I noticed weird behaviour:
>>
>> # cat pids.current
>> 60
>> # cat cgroup.procs
>> #
>
> Are there any zombies in the cgroup? pids.current is linked up directly
> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
> actually being freed will decrease it).
>

 There are no zombie processes.

 In mean time the problem shows on multiple servers and so far saw it
 only in cgroups that were OOMed.

 What has changed on these servers (yesterday) is turning on
 memory.oom.group=1 for all cgroups and changing memory.high from 1G to
 "max" (leaving memory.max=2G limit only).

 Previously there was no such problem.

>>>
>>> I'm attaching reproducer. This time tried on different distribution
>>> kernel (arch linux).
>>>
>>> After 60s pids.current still shows 37 processes even if there are no
>>> processes running (according to ps aux).
>>
>>
>> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
>> reproduce bug. No processes in cgroup but pids.current reports 91.
> 
> Can you please see whether the problem can be reproduced on the
> current linux-next?
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git

I can reproduce on next (5.0.0-rc3-next-20190125), too:

[root@xps test]# python3 cg.py
Created cgroup: /sys/fs/cgroup/test_2501
Start: pids.current: 0
Start: cgroup.procs:
0: pids.current: 65
0: cgroup.procs:
1: pids.current: 44
1: cgroup.procs:
2: pids.current: 44
2: cgroup.procs:
3: pids.current: 44
3: cgroup.procs:
4: pids.current: 44
4: cgroup.procs:
5: pids.current: 44
5: cgroup.procs:
6: pids.current: 44
6: cgroup.procs:
7: pids.current: 44
7: cgroup.procs:
8: pids.current: 44
8: cgroup.procs:
9: pids.current: 44
9: cgroup.procs:
10: pids.current: 44
10: cgroup.procs:
11: pids.current: 44
11: cgroup.procs:
[root@xps test]# uname -a
Linux xps 5.0.0-rc3-next-20190125 #2 SMP PREEMPT Fri Jan 25 19:11:40 CET
2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux
[root@xps test]# mount |grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)


I'm booting kernel with cgroup_no_v1=all

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )


Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-25 Thread Tejun Heo
On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> > On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
> >> On 17/01/2019 13:25, Aleksa Sarai wrote:
> >>> On 2019-01-17, Arkadiusz Miśkiewicz  wrote:
>  Using kernel 4.19.13.
> 
>  For one cgroup I noticed weird behaviour:
> 
>  # cat pids.current
>  60
>  # cat cgroup.procs
>  #
> >>>
> >>> Are there any zombies in the cgroup? pids.current is linked up directly
> >>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
> >>> actually being freed will decrease it).
> >>>
> >>
> >> There are no zombie processes.
> >>
> >> In mean time the problem shows on multiple servers and so far saw it
> >> only in cgroups that were OOMed.
> >>
> >> What has changed on these servers (yesterday) is turning on
> >> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
> >> "max" (leaving memory.max=2G limit only).
> >>
> >> Previously there was no such problem.
> >>
> > 
> > I'm attaching reproducer. This time tried on different distribution
> > kernel (arch linux).
> > 
> > After 60s pids.current still shows 37 processes even if there are no
> > processes running (according to ps aux).
> 
> 
> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
> reproduce bug. No processes in cgroup but pids.current reports 91.

Can you please see whether the problem can be reproduced on the
current linux-next?

 git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git

Thanks.

-- 
tejun


Re: pids.current with invalid value for hours [5.0.0 rc3 git]

2019-01-24 Thread Arkadiusz Miśkiewicz
On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
>> On 17/01/2019 13:25, Aleksa Sarai wrote:
>>> On 2019-01-17, Arkadiusz Miśkiewicz  wrote:
 Using kernel 4.19.13.

 For one cgroup I noticed weird behaviour:

 # cat pids.current
 60
 # cat cgroup.procs
 #
>>>
>>> Are there any zombies in the cgroup? pids.current is linked up directly
>>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
>>> actually being freed will decrease it).
>>>
>>
>> There are no zombie processes.
>>
>> In mean time the problem shows on multiple servers and so far saw it
>> only in cgroups that were OOMed.
>>
>> What has changed on these servers (yesterday) is turning on
>> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
>> "max" (leaving memory.max=2G limit only).
>>
>> Previously there was no such problem.
>>
> 
> I'm attaching reproducer. This time tried on different distribution
> kernel (arch linux).
> 
> After 60s pids.current still shows 37 processes even if there are no
> processes running (according to ps aux).


The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
reproduce bug. No processes in cgroup but pids.current reports 91.

memory.oom.group=0 - everything works fine, pids are counted properly
memory.oom.group=1 - bug becomes visible

[root@xps test]# python3 cg.py
Created cgroup: /sys/fs/cgroup/test_5277
Start: pids.current: 0
Start: cgroup.procs:
0: pids.current: 103
0: cgroup.procs:
1: pids.current: 91
1: cgroup.procs:
2: pids.current: 91
2: cgroup.procs:
3: pids.current: 91
3: cgroup.procs:
4: pids.current: 91
4: cgroup.procs:
5: pids.current: 91
5: cgroup.procs:
6: pids.current: 91
6: cgroup.procs:
7: pids.current: 91
7: cgroup.procs:
8: pids.current: 91
8: cgroup.procs:
9: pids.current: 91
9: cgroup.procs:
10: pids.current: 91
10: cgroup.procs:
11: pids.current: 91
11: cgroup.procs:
[root@xps test]# uname -a
Linux xps 5.0.0-rc3-00104-gc04e2a780caf #288 SMP PREEMPT Thu Jan 24
19:00:32 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux


cc relevant people

script is here: https://www.spinics.net/lists/cgroups/msg21330.html

> 
> [root@warm ~]# uname -a
> Linux warm 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC
> 2019 x86_64 GNU/Linux
> [root@warm ~]# python3 cg.py
> Created cgroup: /sys/fs/cgroup/test_26207
> Start: pids.current: 0
> Start: cgroup.procs:
> 0: pids.current: 62
> 0: cgroup.procs:
> 1: pids.current: 37
> 1: cgroup.procs:
> 2: pids.current: 37
> 2: cgroup.procs:
> 3: pids.current: 37
> 3: cgroup.procs:
> 4: pids.current: 37
> 4: cgroup.procs:
> 5: pids.current: 37
> 5: cgroup.procs:
> 6: pids.current: 37
> 6: cgroup.procs:
> 7: pids.current: 37
> 7: cgroup.procs:
> 8: pids.current: 37
> 8: cgroup.procs:
> 9: pids.current: 37
> 9: cgroup.procs:
> 10: pids.current: 37
> 10: cgroup.procs:
> 11: pids.current: 37
> 11: cgroup.procs:
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )