Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 22:51:00, Tetsuo Handa wrote:
> On 2018/10/09 22:26, Michal Hocko wrote:
> > On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> >> On 2018/10/09 21:58, Michal Hocko wrote:
> >>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>  On 2018/10/09 20:10, Michal Hocko wrote:
> > On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>> reap the mm in the rare case of the race.
> >>
> >> That is no problem. The mistake we made in 4.6 was that we updated 
> >> oom_score_adj
> >> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >
> > I do not follow.
> >
> 
>  http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> >>>
> >>> Ahh, so you are not referring to the current upstream code. Do you see
> >>> any specific problem with the current one (well, except for the possible
> >>> race which I have tried to evaluate).
> >>>
> >>
> >> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
> >> MMF_OOM_SKIP
> >> being already set" is a problem for clone(CLONE_VM without 
> >> CLONE_THREAD/CLONE_SIGHAND)
> >> with the current code.
> > 
> > a) I fail to see how that is related to your previous post and b) could
> > you be more specific. Is there any other scenario from the two described
> > in my earlier email?
> > 
> 
> I do not follow. Just reverting commit 44a70adec910d692 and commit 
> 97fd49c2355ffded
> is sufficient for closing the copy_process() versus __set_oom_adj() race.

Please go back and see why this has been done in the first place.

> We went too far towards complete "struct mm_struct" based OOM handling. But 
> stepping
> back to "struct signal_struct" based OOM handling solves Yong-Taek's 
> for_each_process()
> latency problem and your copy_process() versus __set_oom_adj() race problem 
> and my
> task_will_free_mem(current) race problem.

And again, I have put an evaluation of the race and try to see what is
the effect. Then you have started to fire hard to follow notes and it is
not clear whether the analysis/conclusions is wrong/incomplete.

So an we get back to that analysis and stick to the topic please?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 22:51:00, Tetsuo Handa wrote:
> On 2018/10/09 22:26, Michal Hocko wrote:
> > On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> >> On 2018/10/09 21:58, Michal Hocko wrote:
> >>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>  On 2018/10/09 20:10, Michal Hocko wrote:
> > On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>> reap the mm in the rare case of the race.
> >>
> >> That is no problem. The mistake we made in 4.6 was that we updated 
> >> oom_score_adj
> >> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >
> > I do not follow.
> >
> 
>  http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> >>>
> >>> Ahh, so you are not referring to the current upstream code. Do you see
> >>> any specific problem with the current one (well, except for the possible
> >>> race which I have tried to evaluate).
> >>>
> >>
> >> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
> >> MMF_OOM_SKIP
> >> being already set" is a problem for clone(CLONE_VM without 
> >> CLONE_THREAD/CLONE_SIGHAND)
> >> with the current code.
> > 
> > a) I fail to see how that is related to your previous post and b) could
> > you be more specific. Is there any other scenario from the two described
> > in my earlier email?
> > 
> 
> I do not follow. Just reverting commit 44a70adec910d692 and commit 
> 97fd49c2355ffded
> is sufficient for closing the copy_process() versus __set_oom_adj() race.

Please go back and see why this has been done in the first place.

> We went too far towards complete "struct mm_struct" based OOM handling. But 
> stepping
> back to "struct signal_struct" based OOM handling solves Yong-Taek's 
> for_each_process()
> latency problem and your copy_process() versus __set_oom_adj() race problem 
> and my
> task_will_free_mem(current) race problem.

And again, I have put an evaluation of the race and try to see what is
the effect. Then you have started to fire hard to follow notes and it is
not clear whether the analysis/conclusions is wrong/incomplete.

So an we get back to that analysis and stick to the topic please?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 22:26, Michal Hocko wrote:
> On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
>> On 2018/10/09 21:58, Michal Hocko wrote:
>>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
 On 2018/10/09 20:10, Michal Hocko wrote:
> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>> reap the mm in the rare case of the race.
>>
>> That is no problem. The mistake we made in 4.6 was that we updated 
>> oom_score_adj
>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
>
> I do not follow.
>

 http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
>>>
>>> Ahh, so you are not referring to the current upstream code. Do you see
>>> any specific problem with the current one (well, except for the possible
>>> race which I have tried to evaluate).
>>>
>>
>> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
>> MMF_OOM_SKIP
>> being already set" is a problem for clone(CLONE_VM without 
>> CLONE_THREAD/CLONE_SIGHAND)
>> with the current code.
> 
> a) I fail to see how that is related to your previous post and b) could
> you be more specific. Is there any other scenario from the two described
> in my earlier email?
> 

I do not follow. Just reverting commit 44a70adec910d692 and commit 
97fd49c2355ffded
is sufficient for closing the copy_process() versus __set_oom_adj() race.

We went too far towards complete "struct mm_struct" based OOM handling. But 
stepping
back to "struct signal_struct" based OOM handling solves Yong-Taek's 
for_each_process()
latency problem and your copy_process() versus __set_oom_adj() race problem and 
my
task_will_free_mem(current) race problem.



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 22:26, Michal Hocko wrote:
> On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
>> On 2018/10/09 21:58, Michal Hocko wrote:
>>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
 On 2018/10/09 20:10, Michal Hocko wrote:
> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>> reap the mm in the rare case of the race.
>>
>> That is no problem. The mistake we made in 4.6 was that we updated 
>> oom_score_adj
>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
>
> I do not follow.
>

 http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
>>>
>>> Ahh, so you are not referring to the current upstream code. Do you see
>>> any specific problem with the current one (well, except for the possible
>>> race which I have tried to evaluate).
>>>
>>
>> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
>> MMF_OOM_SKIP
>> being already set" is a problem for clone(CLONE_VM without 
>> CLONE_THREAD/CLONE_SIGHAND)
>> with the current code.
> 
> a) I fail to see how that is related to your previous post and b) could
> you be more specific. Is there any other scenario from the two described
> in my earlier email?
> 

I do not follow. Just reverting commit 44a70adec910d692 and commit 
97fd49c2355ffded
is sufficient for closing the copy_process() versus __set_oom_adj() race.

We went too far towards complete "struct mm_struct" based OOM handling. But 
stepping
back to "struct signal_struct" based OOM handling solves Yong-Taek's 
for_each_process()
latency problem and your copy_process() versus __set_oom_adj() race problem and 
my
task_will_free_mem(current) race problem.



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> On 2018/10/09 21:58, Michal Hocko wrote:
> > On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> >> On 2018/10/09 20:10, Michal Hocko wrote:
> >>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> > reap the mm in the rare case of the race.
> 
>  That is no problem. The mistake we made in 4.6 was that we updated 
>  oom_score_adj
>  to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >>>
> >>> I do not follow.
> >>>
> >>
> >> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> > 
> > Ahh, so you are not referring to the current upstream code. Do you see
> > any specific problem with the current one (well, except for the possible
> > race which I have tried to evaluate).
> > 
> 
> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
> MMF_OOM_SKIP
> being already set" is a problem for clone(CLONE_VM without 
> CLONE_THREAD/CLONE_SIGHAND)
> with the current code.

a) I fail to see how that is related to your previous post and b) could
you be more specific. Is there any other scenario from the two described
in my earlier email?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> On 2018/10/09 21:58, Michal Hocko wrote:
> > On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> >> On 2018/10/09 20:10, Michal Hocko wrote:
> >>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> > reap the mm in the rare case of the race.
> 
>  That is no problem. The mistake we made in 4.6 was that we updated 
>  oom_score_adj
>  to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >>>
> >>> I do not follow.
> >>>
> >>
> >> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> > 
> > Ahh, so you are not referring to the current upstream code. Do you see
> > any specific problem with the current one (well, except for the possible
> > race which I have tried to evaluate).
> > 
> 
> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
> MMF_OOM_SKIP
> being already set" is a problem for clone(CLONE_VM without 
> CLONE_THREAD/CLONE_SIGHAND)
> with the current code.

a) I fail to see how that is related to your previous post and b) could
you be more specific. Is there any other scenario from the two described
in my earlier email?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 21:58, Michal Hocko wrote:
> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>> On 2018/10/09 20:10, Michal Hocko wrote:
>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> reap the mm in the rare case of the race.

 That is no problem. The mistake we made in 4.6 was that we updated 
 oom_score_adj
 to -1000 (and allowed unprivileged users to OOM-lockup the system).
>>>
>>> I do not follow.
>>>
>>
>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> 
> Ahh, so you are not referring to the current upstream code. Do you see
> any specific problem with the current one (well, except for the possible
> race which I have tried to evaluate).
> 

Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
MMF_OOM_SKIP
being already set" is a problem for clone(CLONE_VM without 
CLONE_THREAD/CLONE_SIGHAND)
with the current code.



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 21:58, Michal Hocko wrote:
> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>> On 2018/10/09 20:10, Michal Hocko wrote:
>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> reap the mm in the rare case of the race.

 That is no problem. The mistake we made in 4.6 was that we updated 
 oom_score_adj
 to -1000 (and allowed unprivileged users to OOM-lockup the system).
>>>
>>> I do not follow.
>>>
>>
>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> 
> Ahh, so you are not referring to the current upstream code. Do you see
> any specific problem with the current one (well, except for the possible
> race which I have tried to evaluate).
> 

Yes. "task_will_free_mem(current) in out_of_memory() returns false due to 
MMF_OOM_SKIP
being already set" is a problem for clone(CLONE_VM without 
CLONE_THREAD/CLONE_SIGHAND)
with the current code.



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> On 2018/10/09 20:10, Michal Hocko wrote:
> > On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>> reap the mm in the rare case of the race.
> >>
> >> That is no problem. The mistake we made in 4.6 was that we updated 
> >> oom_score_adj
> >> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> > 
> > I do not follow.
> > 
> 
> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

Ahh, so you are not referring to the current upstream code. Do you see
any specific problem with the current one (well, except for the possible
race which I have tried to evaluate).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> On 2018/10/09 20:10, Michal Hocko wrote:
> > On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>> reap the mm in the rare case of the race.
> >>
> >> That is no problem. The mistake we made in 4.6 was that we updated 
> >> oom_score_adj
> >> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> > 
> > I do not follow.
> > 
> 
> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

Ahh, so you are not referring to the current upstream code. Do you see
any specific problem with the current one (well, except for the possible
race which I have tried to evaluate).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 20:10, Michal Hocko wrote:
> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>> reap the mm in the rare case of the race.
>>
>> That is no problem. The mistake we made in 4.6 was that we updated 
>> oom_score_adj
>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> 
> I do not follow.
> 

http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

[  177.722853] a.out invoked oom-killer: 
gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 
#291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  394128880  124  14   30   
  0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780 784   60   
  0 a.out
[  177.876274] [ 3963]  1000  3963 1078   21   7   30   
   1000 a.out
[  177.878261] [ 3964]  1000  3964 1078   21   7   30   
   1000 a.out
[  177.880194] [ 3965]  1000  3965 1078   21   7   30   
   1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice 
child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, 
file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: 
gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 
#291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  394128880  124  14   30   
  0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308 785   60   
  0 a.out
[  179.840328] [ 3963]  1000  3963 10780   7   30   
  -1000 a.out
[  179.842443] [ 3965]  1000  3965 10780   7   30   
   1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice 
child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 20:10, Michal Hocko wrote:
> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>> reap the mm in the rare case of the race.
>>
>> That is no problem. The mistake we made in 4.6 was that we updated 
>> oom_score_adj
>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> 
> I do not follow.
> 

http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

[  177.722853] a.out invoked oom-killer: 
gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 
#291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  394128880  124  14   30   
  0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780 784   60   
  0 a.out
[  177.876274] [ 3963]  1000  3963 1078   21   7   30   
   1000 a.out
[  177.878261] [ 3964]  1000  3964 1078   21   7   30   
   1000 a.out
[  177.880194] [ 3965]  1000  3965 1078   21   7   30   
   1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice 
child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, 
file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: 
gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 
#291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  394128880  124  14   30   
  0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308 785   60   
  0 a.out
[  179.840328] [ 3963]  1000  3963 10780   7   30   
  -1000 a.out
[  179.842443] [ 3965]  1000  3965 10780   7   30   
   1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice 
child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> On 2018/10/09 16:50, Michal Hocko wrote:
[...]
> >   Well, that is unfortunate indeed and it
> > breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> > 1) do not care and encourage users to use a saner way to set
> > OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> > setting it before [v]fork & exec. Btw. do we know about an actual user
> > who would care?
> 
> I'm not talking about [v]fork & exec. Why are you talking about [v]fork & 
> exec ?

Because that is the only raceless way to set your oom_score_adj.

> > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> > reap the mm in the rare case of the race.
> 
> That is no problem. The mistake we made in 4.6 was that we updated 
> oom_score_adj
> to -1000 (and allowed unprivileged users to OOM-lockup the system).

I do not follow.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> On 2018/10/09 16:50, Michal Hocko wrote:
[...]
> >   Well, that is unfortunate indeed and it
> > breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> > 1) do not care and encourage users to use a saner way to set
> > OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> > setting it before [v]fork & exec. Btw. do we know about an actual user
> > who would care?
> 
> I'm not talking about [v]fork & exec. Why are you talking about [v]fork & 
> exec ?

Because that is the only raceless way to set your oom_score_adj.

> > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> > reap the mm in the rare case of the race.
> 
> That is no problem. The mistake we made in 4.6 was that we updated 
> oom_score_adj
> to -1000 (and allowed unprivileged users to OOM-lockup the system).

I do not follow.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 16:50, Michal Hocko wrote:
> On Tue 09-10-18 08:35:41, Michal Hocko wrote:
>> [I have only now noticed that the patch has been reposted]
>>
>> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
>>> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> [...]
 Thank you for your suggestion. But i think it would be better to seperate 
 to 2 issues. How about think these
 issues separately because there are no dependency between race issue and 
 my patch. As i already explained,
 for_each_process path is meaningless if there is only one thread group 
 with many threads(mm_users > 1 but 
 no other thread group sharing same mm). Do you have any other idea to 
 avoid meaningless loop ? 
>>>
>>> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
>>> processes
>>> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded 
>>> ("mm, oom:
>>> kill all tasks sharing the mm").
>>
>> This would require a lot of other work for something as border line as
>> weird threading model like this. I will think about something more
>> appropriate - e.g. we can take mmap_sem for read while doing this check
>> and that should prevent from races with [v]fork.
> 
> Not really. We do not even take the mmap_sem when CLONE_VM. So this is
> not the way. Doing a proper synchronization seems much harder. So let's
> consider what is the worst case scenario. We would basically hit a race
> window between copy_signal and copy_mm and the only relevant case would
> be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread".

The "between copy_signal() and copy_mm()" race window is merely whether we need
to run for_each_process() loop. The race window is much larger than that; it is
between "copy_signal() copies oom_score_adj/oom_score_adj_min" and "the created
thread becomes accessible from for_each_process() loop".

>OOM
> killer could then pick up the "thread" and kill it along with the whole
> process group sharing the mm.

Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded is
sufficient.

>   Well, that is unfortunate indeed and it
> breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> 1) do not care and encourage users to use a saner way to set
> OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> setting it before [v]fork & exec. Btw. do we know about an actual user
> who would care?

I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ?

> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> reap the mm in the rare case of the race.

That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
to -1000 (and allowed unprivileged users to OOM-lockup the system). Now that we 
set
MMF_OOM_SKIP, there is no need to worry about "oom_score_adj != -1000" thread 
group
and "oom_score_adj == -1000" thread group sharing the same mm. Since updating
oom_score_adj to -1000 is a privileged operation, it is administrator's wish if
such case happened; the kernel should respect the administrator's wish.

> 
> I would prefer the firs but if this race really has to be addressed then
> the 2 sounds more reasonable than the wholesale revert.
> 



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Tetsuo Handa
On 2018/10/09 16:50, Michal Hocko wrote:
> On Tue 09-10-18 08:35:41, Michal Hocko wrote:
>> [I have only now noticed that the patch has been reposted]
>>
>> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
>>> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> [...]
 Thank you for your suggestion. But i think it would be better to seperate 
 to 2 issues. How about think these
 issues separately because there are no dependency between race issue and 
 my patch. As i already explained,
 for_each_process path is meaningless if there is only one thread group 
 with many threads(mm_users > 1 but 
 no other thread group sharing same mm). Do you have any other idea to 
 avoid meaningless loop ? 
>>>
>>> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
>>> processes
>>> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded 
>>> ("mm, oom:
>>> kill all tasks sharing the mm").
>>
>> This would require a lot of other work for something as border line as
>> weird threading model like this. I will think about something more
>> appropriate - e.g. we can take mmap_sem for read while doing this check
>> and that should prevent from races with [v]fork.
> 
> Not really. We do not even take the mmap_sem when CLONE_VM. So this is
> not the way. Doing a proper synchronization seems much harder. So let's
> consider what is the worst case scenario. We would basically hit a race
> window between copy_signal and copy_mm and the only relevant case would
> be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread".

The "between copy_signal() and copy_mm()" race window is merely whether we need
to run for_each_process() loop. The race window is much larger than that; it is
between "copy_signal() copies oom_score_adj/oom_score_adj_min" and "the created
thread becomes accessible from for_each_process() loop".

>OOM
> killer could then pick up the "thread" and kill it along with the whole
> process group sharing the mm.

Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded is
sufficient.

>   Well, that is unfortunate indeed and it
> breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> 1) do not care and encourage users to use a saner way to set
> OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> setting it before [v]fork & exec. Btw. do we know about an actual user
> who would care?

I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ?

> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> reap the mm in the rare case of the race.

That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
to -1000 (and allowed unprivileged users to OOM-lockup the system). Now that we 
set
MMF_OOM_SKIP, there is no need to worry about "oom_score_adj != -1000" thread 
group
and "oom_score_adj == -1000" thread group sharing the same mm. Since updating
oom_score_adj to -1000 is a privileged operation, it is administrator's wish if
such case happened; the kernel should respect the administrator's wish.

> 
> I would prefer the firs but if this race really has to be addressed then
> the 2 sounds more reasonable than the wholesale revert.
> 



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 08:23:30, Michal Hocko wrote:
> [Cc Oleg]

JFYI there was new submission 
http://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8
> 
> On Fri 05-10-18 15:32:08, Yong-Taek Lee wrote:
> > It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
> > processes sharing mm have same view of oom_score_adj"). Most of
> > user process's mm_users is bigger than 1 but only one thread group.
> > In this case, for_each_process loop meaninglessly try to find processes
> > which sharing same mm even though there is only one thread group.
> > 
> > My idea is that target task's nr thread is smaller than mm_users if there
> > are more thread groups sharing the same mm. So we can skip loop
> 
> I remember trying to optimize this but ended up with nothing that would
> work reliable. E.g. what prevents a thread terminating right after we
> read mm reference count and result in early break and other process
> not being updated properly?
> 
> > if mm_user and nr_thread are same.
> > 
> > test result
> > while true; do count=0; time while [ $count -lt 1 ]; do echo -1000 > 
> > /proc/
> > 1457/oom_score_adj; count=$((count+1)); done; done;
> 
> Is this overhead noticeable in a real work usecases though? Or are you
> updating oom_score_adj that often really?
> 
> > before patch
> > 0m00.59s real 0m00.09s user 0m00.51s system
> > 0m00.59s real 0m00.14s user 0m00.45s system
> > 0m00.58s real 0m00.11s user 0m00.47s system
> > 0m00.58s real 0m00.10s user 0m00.48s system
> > 0m00.59s real 0m00.11s user 0m00.48s system
> > 
> > after patch
> > 0m00.15s real 0m00.07s user 0m00.08s system
> > 0m00.14s real 0m00.10s user 0m00.04s system
> > 0m00.14s real 0m00.10s user 0m00.05s system
> > 0m00.14s real 0m00.08s user 0m00.07s system
> > 0m00.14s real 0m00.08s user 0m00.07s system
> > 
> > Signed-off-by: Lee YongTaek 
> > ---
> >  fs/proc/base.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index f9f72aee6d45..54b2fb5e9c51 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj,
> > bool legacy)
> > struct mm_struct *mm = NULL;
> > struct task_struct *task;
> > int err = 0;
> > +   int mm_users = 0;
> > 
> > task = get_proc_task(file_inode(file));
> > if (!task)
> > @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj,
> > bool legacy)
> > struct task_struct *p = find_lock_task_mm(task);
> > 
> > if (p) {
> > -   if (atomic_read(>mm->mm_users) > 1) {
> > +   mm_users = atomic_read(>mm->mm_users);
> > +   if ((mm_users > 1) && (mm_users != 
> > get_nr_threads(p)))
> > {
> > mm = p->mm;
> > atomic_inc(>mm_count);
> > }
> > --
> > 
> > *
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 08:23:30, Michal Hocko wrote:
> [Cc Oleg]

JFYI there was new submission 
http://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8
> 
> On Fri 05-10-18 15:32:08, Yong-Taek Lee wrote:
> > It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
> > processes sharing mm have same view of oom_score_adj"). Most of
> > user process's mm_users is bigger than 1 but only one thread group.
> > In this case, for_each_process loop meaninglessly try to find processes
> > which sharing same mm even though there is only one thread group.
> > 
> > My idea is that target task's nr thread is smaller than mm_users if there
> > are more thread groups sharing the same mm. So we can skip loop
> 
> I remember trying to optimize this but ended up with nothing that would
> work reliable. E.g. what prevents a thread terminating right after we
> read mm reference count and result in early break and other process
> not being updated properly?
> 
> > if mm_user and nr_thread are same.
> > 
> > test result
> > while true; do count=0; time while [ $count -lt 1 ]; do echo -1000 > 
> > /proc/
> > 1457/oom_score_adj; count=$((count+1)); done; done;
> 
> Is this overhead noticeable in a real work usecases though? Or are you
> updating oom_score_adj that often really?
> 
> > before patch
> > 0m00.59s real 0m00.09s user 0m00.51s system
> > 0m00.59s real 0m00.14s user 0m00.45s system
> > 0m00.58s real 0m00.11s user 0m00.47s system
> > 0m00.58s real 0m00.10s user 0m00.48s system
> > 0m00.59s real 0m00.11s user 0m00.48s system
> > 
> > after patch
> > 0m00.15s real 0m00.07s user 0m00.08s system
> > 0m00.14s real 0m00.10s user 0m00.04s system
> > 0m00.14s real 0m00.10s user 0m00.05s system
> > 0m00.14s real 0m00.08s user 0m00.07s system
> > 0m00.14s real 0m00.08s user 0m00.07s system
> > 
> > Signed-off-by: Lee YongTaek 
> > ---
> >  fs/proc/base.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index f9f72aee6d45..54b2fb5e9c51 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj,
> > bool legacy)
> > struct mm_struct *mm = NULL;
> > struct task_struct *task;
> > int err = 0;
> > +   int mm_users = 0;
> > 
> > task = get_proc_task(file_inode(file));
> > if (!task)
> > @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj,
> > bool legacy)
> > struct task_struct *p = find_lock_task_mm(task);
> > 
> > if (p) {
> > -   if (atomic_read(>mm->mm_users) > 1) {
> > +   mm_users = atomic_read(>mm->mm_users);
> > +   if ((mm_users > 1) && (mm_users != 
> > get_nr_threads(p)))
> > {
> > mm = p->mm;
> > atomic_inc(>mm_count);
> > }
> > --
> > 
> > *
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs


Re: Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Mon 08-10-18 17:38:55, Yong-Taek Lee wrote:
> Do you have any other idea to avoid meaningless loop ? 

I have already asked in the earlier posting but let's follow up here.
Could you please expand on why this actually matters and what are the
consequences please?
-- 
Michal Hocko
SUSE Labs


Re: Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Mon 08-10-18 17:38:55, Yong-Taek Lee wrote:
> Do you have any other idea to avoid meaningless loop ? 

I have already asked in the earlier posting but let's follow up here.
Could you please expand on why this actually matters and what are the
consequences please?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 08:35:41, Michal Hocko wrote:
> [I have only now noticed that the patch has been reposted]
> 
> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> > On 2018/10/08 17:38, Yong-Taek Lee wrote:
[...]
> > > Thank you for your suggestion. But i think it would be better to seperate 
> > > to 2 issues. How about think these
> > > issues separately because there are no dependency between race issue and 
> > > my patch. As i already explained,
> > > for_each_process path is meaningless if there is only one thread group 
> > > with many threads(mm_users > 1 but 
> > > no other thread group sharing same mm). Do you have any other idea to 
> > > avoid meaningless loop ? 
> > 
> > Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
> > processes
> > sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded 
> > ("mm, oom:
> > kill all tasks sharing the mm").
> 
> This would require a lot of other work for something as border line as
> weird threading model like this. I will think about something more
> appropriate - e.g. we can take mmap_sem for read while doing this check
> and that should prevent from races with [v]fork.

Not really. We do not even take the mmap_sem when CLONE_VM. So this is
not the way. Doing a proper synchronization seems much harder. So let's
consider what is the worst case scenario. We would basically hit a race
window between copy_signal and copy_mm and the only relevant case would
be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread". OOM
killer could then pick up the "thread" and kill it along with the whole
process group sharing the mm. Well, that is unfortunate indeed and it
breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
1) do not care and encourage users to use a saner way to set
OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
setting it before [v]fork & exec. Btw. do we know about an actual user
who would care?
2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
reap the mm in the rare case of the race.

I would prefer the firs but if this race really has to be addressed then
the 2 sounds more reasonable than the wholesale revert.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 08:35:41, Michal Hocko wrote:
> [I have only now noticed that the patch has been reposted]
> 
> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> > On 2018/10/08 17:38, Yong-Taek Lee wrote:
[...]
> > > Thank you for your suggestion. But i think it would be better to seperate 
> > > to 2 issues. How about think these
> > > issues separately because there are no dependency between race issue and 
> > > my patch. As i already explained,
> > > for_each_process path is meaningless if there is only one thread group 
> > > with many threads(mm_users > 1 but 
> > > no other thread group sharing same mm). Do you have any other idea to 
> > > avoid meaningless loop ? 
> > 
> > Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
> > processes
> > sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded 
> > ("mm, oom:
> > kill all tasks sharing the mm").
> 
> This would require a lot of other work for something as border line as
> weird threading model like this. I will think about something more
> appropriate - e.g. we can take mmap_sem for read while doing this check
> and that should prevent from races with [v]fork.

Not really. We do not even take the mmap_sem when CLONE_VM. So this is
not the way. Doing a proper synchronization seems much harder. So let's
consider what is the worst case scenario. We would basically hit a race
window between copy_signal and copy_mm and the only relevant case would
be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread". OOM
killer could then pick up the "thread" and kill it along with the whole
process group sharing the mm. Well, that is unfortunate indeed and it
breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
1) do not care and encourage users to use a saner way to set
OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
setting it before [v]fork & exec. Btw. do we know about an actual user
who would care?
2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
reap the mm in the rare case of the race.

I would prefer the firs but if this race really has to be addressed then
the 2 sounds more reasonable than the wholesale revert.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
[I have only now noticed that the patch has been reposted]

On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> >>
> >> On 2018/10/08 15:14, Yong-Taek Lee wrote:
>  On 2018/10/08 10:19, Yong-Taek Lee wrote:
> > @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj, bool legacy)
> > struct mm_struct *mm = NULL;
> > struct task_struct *task;
> > int err = 0;
> > +   int mm_users = 0;
> >
> > task = get_proc_task(file_inode(file));
> > if (!task)
> > @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj, bool legacy)
> > struct task_struct *p = find_lock_task_mm(task);
> >
> > if (p) {
> > -   if (atomic_read(>mm->mm_users) > 1) {
> > +   mm_users = atomic_read(>mm->mm_users);
> > +   if ((mm_users > 1) && (mm_users != 
> > get_nr_threads(p))) {
> 
>  How can this work (even before this patch)? When clone(CLONE_VM without 
>  CLONE_THREAD/CLONE_SIGHAND)
>  is requested, copy_process() calls copy_signal() in order to copy 
>  sig->oom_score_adj and
>  sig->oom_score_adj_min before calling copy_mm() in order to increment 
>  mm->mm_users, doesn't it?
>  Then, we will get two different "struct signal_struct" with different 
>  oom_score_adj/oom_score_adj_min
>  but one "struct mm_struct" shared by two thread groups.
> 
> >>>
> >>> Are you talking about race between __set_oom_adj and copy_process?
> >>> If so, i agree with your opinion. It can not set oom_score_adj properly 
> >>> for copied process if __set_oom_adj
> >>> check mm_users before copy_process calls copy_mm after copy_signal. 
> >>> Please correct me if i misunderstood anything.
> >>
> >> You understand it correctly.
> >>
> >> Reversing copy_signal() and copy_mm() is not sufficient either. We need to 
> >> use a read/write lock
> >> (read lock for copy_process() and write lock for __set_oom_adj()) in order 
> >> to make sure that
> >> the thread created by clone() becomes reachable from for_each_process() 
> >> path in __set_oom_adj().
> >>
> > 
> > Thank you for your suggestion. But i think it would be better to seperate 
> > to 2 issues. How about think these
> > issues separately because there are no dependency between race issue and my 
> > patch. As i already explained,
> > for_each_process path is meaningless if there is only one thread group with 
> > many threads(mm_users > 1 but 
> > no other thread group sharing same mm). Do you have any other idea to avoid 
> > meaningless loop ? 
> 
> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
> processes
> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded 
> ("mm, oom:
> kill all tasks sharing the mm").

This would require a lot of other work for something as border line as
weird threading model like this. I will think about something more
appropriate - e.g. we can take mmap_sem for read while doing this check
and that should prevent from races with [v]fork.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
[I have only now noticed that the patch has been reposted]

On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> >>
> >> On 2018/10/08 15:14, Yong-Taek Lee wrote:
>  On 2018/10/08 10:19, Yong-Taek Lee wrote:
> > @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj, bool legacy)
> > struct mm_struct *mm = NULL;
> > struct task_struct *task;
> > int err = 0;
> > +   int mm_users = 0;
> >
> > task = get_proc_task(file_inode(file));
> > if (!task)
> > @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> > oom_adj, bool legacy)
> > struct task_struct *p = find_lock_task_mm(task);
> >
> > if (p) {
> > -   if (atomic_read(>mm->mm_users) > 1) {
> > +   mm_users = atomic_read(>mm->mm_users);
> > +   if ((mm_users > 1) && (mm_users != 
> > get_nr_threads(p))) {
> 
>  How can this work (even before this patch)? When clone(CLONE_VM without 
>  CLONE_THREAD/CLONE_SIGHAND)
>  is requested, copy_process() calls copy_signal() in order to copy 
>  sig->oom_score_adj and
>  sig->oom_score_adj_min before calling copy_mm() in order to increment 
>  mm->mm_users, doesn't it?
>  Then, we will get two different "struct signal_struct" with different 
>  oom_score_adj/oom_score_adj_min
>  but one "struct mm_struct" shared by two thread groups.
> 
> >>>
> >>> Are you talking about race between __set_oom_adj and copy_process?
> >>> If so, i agree with your opinion. It can not set oom_score_adj properly 
> >>> for copied process if __set_oom_adj
> >>> check mm_users before copy_process calls copy_mm after copy_signal. 
> >>> Please correct me if i misunderstood anything.
> >>
> >> You understand it correctly.
> >>
> >> Reversing copy_signal() and copy_mm() is not sufficient either. We need to 
> >> use a read/write lock
> >> (read lock for copy_process() and write lock for __set_oom_adj()) in order 
> >> to make sure that
> >> the thread created by clone() becomes reachable from for_each_process() 
> >> path in __set_oom_adj().
> >>
> > 
> > Thank you for your suggestion. But i think it would be better to seperate 
> > to 2 issues. How about think these
> > issues separately because there are no dependency between race issue and my 
> > patch. As i already explained,
> > for_each_process path is meaningless if there is only one thread group with 
> > many threads(mm_users > 1 but 
> > no other thread group sharing same mm). Do you have any other idea to avoid 
> > meaningless loop ? 
> 
> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
> processes
> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded 
> ("mm, oom:
> kill all tasks sharing the mm").

This would require a lot of other work for something as border line as
weird threading model like this. I will think about something more
appropriate - e.g. we can take mmap_sem for read while doing this check
and that should prevent from races with [v]fork.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
[Cc Oleg]

On Fri 05-10-18 15:32:08, Yong-Taek Lee wrote:
> It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj"). Most of
> user process's mm_users is bigger than 1 but only one thread group.
> In this case, for_each_process loop meaninglessly try to find processes
> which sharing same mm even though there is only one thread group.
> 
> My idea is that target task's nr thread is smaller than mm_users if there
> are more thread groups sharing the same mm. So we can skip loop

I remember trying to optimize this but ended up with nothing that would
work reliable. E.g. what prevents a thread terminating right after we
read mm reference count and result in early break and other process
not being updated properly?

> if mm_user and nr_thread are same.
> 
> test result
> while true; do count=0; time while [ $count -lt 1 ]; do echo -1000 > 
> /proc/
> 1457/oom_score_adj; count=$((count+1)); done; done;

Is this overhead noticeable in a real work usecases though? Or are you
updating oom_score_adj that often really?

> before patch
> 0m00.59s real 0m00.09s user 0m00.51s system
> 0m00.59s real 0m00.14s user 0m00.45s system
> 0m00.58s real 0m00.11s user 0m00.47s system
> 0m00.58s real 0m00.10s user 0m00.48s system
> 0m00.59s real 0m00.11s user 0m00.48s system
> 
> after patch
> 0m00.15s real 0m00.07s user 0m00.08s system
> 0m00.14s real 0m00.10s user 0m00.04s system
> 0m00.14s real 0m00.10s user 0m00.05s system
> 0m00.14s real 0m00.08s user 0m00.07s system
> 0m00.14s real 0m00.08s user 0m00.07s system
> 
> Signed-off-by: Lee YongTaek 
> ---
>  fs/proc/base.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f9f72aee6d45..54b2fb5e9c51 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> bool legacy)
> struct mm_struct *mm = NULL;
> struct task_struct *task;
> int err = 0;
> +   int mm_users = 0;
> 
> task = get_proc_task(file_inode(file));
> if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> bool legacy)
> struct task_struct *p = find_lock_task_mm(task);
> 
> if (p) {
> -   if (atomic_read(>mm->mm_users) > 1) {
> +   mm_users = atomic_read(>mm->mm_users);
> +   if ((mm_users > 1) && (mm_users != get_nr_threads(p)))
> {
> mm = p->mm;
> atomic_inc(>mm_count);
> }
> --
> 
> *

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-09 Thread Michal Hocko
[Cc Oleg]

On Fri 05-10-18 15:32:08, Yong-Taek Lee wrote:
> It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj"). Most of
> user process's mm_users is bigger than 1 but only one thread group.
> In this case, for_each_process loop meaninglessly try to find processes
> which sharing same mm even though there is only one thread group.
> 
> My idea is that target task's nr thread is smaller than mm_users if there
> are more thread groups sharing the same mm. So we can skip loop

I remember trying to optimize this but ended up with nothing that would
work reliable. E.g. what prevents a thread terminating right after we
read mm reference count and result in early break and other process
not being updated properly?

> if mm_user and nr_thread are same.
> 
> test result
> while true; do count=0; time while [ $count -lt 1 ]; do echo -1000 > 
> /proc/
> 1457/oom_score_adj; count=$((count+1)); done; done;

Is this overhead noticeable in a real work usecases though? Or are you
updating oom_score_adj that often really?

> before patch
> 0m00.59s real 0m00.09s user 0m00.51s system
> 0m00.59s real 0m00.14s user 0m00.45s system
> 0m00.58s real 0m00.11s user 0m00.47s system
> 0m00.58s real 0m00.10s user 0m00.48s system
> 0m00.59s real 0m00.11s user 0m00.48s system
> 
> after patch
> 0m00.15s real 0m00.07s user 0m00.08s system
> 0m00.14s real 0m00.10s user 0m00.04s system
> 0m00.14s real 0m00.10s user 0m00.05s system
> 0m00.14s real 0m00.08s user 0m00.07s system
> 0m00.14s real 0m00.08s user 0m00.07s system
> 
> Signed-off-by: Lee YongTaek 
> ---
>  fs/proc/base.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f9f72aee6d45..54b2fb5e9c51 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> bool legacy)
> struct mm_struct *mm = NULL;
> struct task_struct *task;
> int err = 0;
> +   int mm_users = 0;
> 
> task = get_proc_task(file_inode(file));
> if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> bool legacy)
> struct task_struct *p = find_lock_task_mm(task);
> 
> if (p) {
> -   if (atomic_read(>mm->mm_users) > 1) {
> +   mm_users = atomic_read(>mm->mm_users);
> +   if ((mm_users > 1) && (mm_users != get_nr_threads(p)))
> {
> mm = p->mm;
> atomic_inc(>mm_count);
> }
> --
> 
> *

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Tetsuo Handa
On 2018/10/08 17:38, Yong-Taek Lee wrote:
>>
>> On 2018/10/08 15:14, Yong-Taek Lee wrote:
 On 2018/10/08 10:19, Yong-Taek Lee wrote:
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct mm_struct *mm = NULL;
> struct task_struct *task;
> int err = 0;
> +   int mm_users = 0;
>
> task = get_proc_task(file_inode(file));
> if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct task_struct *p = find_lock_task_mm(task);
>
> if (p) {
> -   if (atomic_read(>mm->mm_users) > 1) {
> +   mm_users = atomic_read(>mm->mm_users);
> +   if ((mm_users > 1) && (mm_users != 
> get_nr_threads(p))) {

 How can this work (even before this patch)? When clone(CLONE_VM without 
 CLONE_THREAD/CLONE_SIGHAND)
 is requested, copy_process() calls copy_signal() in order to copy 
 sig->oom_score_adj and
 sig->oom_score_adj_min before calling copy_mm() in order to increment 
 mm->mm_users, doesn't it?
 Then, we will get two different "struct signal_struct" with different 
 oom_score_adj/oom_score_adj_min
 but one "struct mm_struct" shared by two thread groups.

>>>
>>> Are you talking about race between __set_oom_adj and copy_process?
>>> If so, i agree with your opinion. It can not set oom_score_adj properly for 
>>> copied process if __set_oom_adj
>>> check mm_users before copy_process calls copy_mm after copy_signal. Please 
>>> correct me if i misunderstood anything.
>>
>> You understand it correctly.
>>
>> Reversing copy_signal() and copy_mm() is not sufficient either. We need to 
>> use a read/write lock
>> (read lock for copy_process() and write lock for __set_oom_adj()) in order 
>> to make sure that
>> the thread created by clone() becomes reachable from for_each_process() path 
>> in __set_oom_adj().
>>
> 
> Thank you for your suggestion. But i think it would be better to seperate to 
> 2 issues. How about think these
> issues separately because there are no dependency between race issue and my 
> patch. As i already explained,
> for_each_process path is meaningless if there is only one thread group with 
> many threads(mm_users > 1 but 
> no other thread group sharing same mm). Do you have any other idea to avoid 
> meaningless loop ? 

Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
processes
sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, 
oom:
kill all tasks sharing the mm").

> 
>>>
> mm = p->mm;
> atomic_inc(>mm_count);
> }
>>>
>>
> 





Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Tetsuo Handa
On 2018/10/08 17:38, Yong-Taek Lee wrote:
>>
>> On 2018/10/08 15:14, Yong-Taek Lee wrote:
 On 2018/10/08 10:19, Yong-Taek Lee wrote:
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct mm_struct *mm = NULL;
> struct task_struct *task;
> int err = 0;
> +   int mm_users = 0;
>
> task = get_proc_task(file_inode(file));
> if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct task_struct *p = find_lock_task_mm(task);
>
> if (p) {
> -   if (atomic_read(>mm->mm_users) > 1) {
> +   mm_users = atomic_read(>mm->mm_users);
> +   if ((mm_users > 1) && (mm_users != 
> get_nr_threads(p))) {

 How can this work (even before this patch)? When clone(CLONE_VM without 
 CLONE_THREAD/CLONE_SIGHAND)
 is requested, copy_process() calls copy_signal() in order to copy 
 sig->oom_score_adj and
 sig->oom_score_adj_min before calling copy_mm() in order to increment 
 mm->mm_users, doesn't it?
 Then, we will get two different "struct signal_struct" with different 
 oom_score_adj/oom_score_adj_min
 but one "struct mm_struct" shared by two thread groups.

>>>
>>> Are you talking about race between __set_oom_adj and copy_process?
>>> If so, i agree with your opinion. It can not set oom_score_adj properly for 
>>> copied process if __set_oom_adj
>>> check mm_users before copy_process calls copy_mm after copy_signal. Please 
>>> correct me if i misunderstood anything.
>>
>> You understand it correctly.
>>
>> Reversing copy_signal() and copy_mm() is not sufficient either. We need to 
>> use a read/write lock
>> (read lock for copy_process() and write lock for __set_oom_adj()) in order 
>> to make sure that
>> the thread created by clone() becomes reachable from for_each_process() path 
>> in __set_oom_adj().
>>
> 
> Thank you for your suggestion. But i think it would be better to seperate to 
> 2 issues. How about think these
> issues separately because there are no dependency between race issue and my 
> patch. As i already explained,
> for_each_process path is meaningless if there is only one thread group with 
> many threads(mm_users > 1 but 
> no other thread group sharing same mm). Do you have any other idea to avoid 
> meaningless loop ? 

Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure 
processes
sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, 
oom:
kill all tasks sharing the mm").

> 
>>>
> mm = p->mm;
> atomic_inc(>mm_count);
> }
>>>
>>
> 





RE: Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Yong-Taek Lee
>
>On 2018/10/08 15:14, Yong-Taek Lee wrote:
>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
 @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
 oom_adj, bool legacy)
 struct mm_struct *mm = NULL;
 struct task_struct *task;
 int err = 0;
 +   int mm_users = 0;

 task = get_proc_task(file_inode(file));
 if (!task)
 @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
 oom_adj, bool legacy)
 struct task_struct *p = find_lock_task_mm(task);

 if (p) {
 -   if (atomic_read(>mm->mm_users) > 1) {
 +   mm_users = atomic_read(>mm->mm_users);
 +   if ((mm_users > 1) && (mm_users != 
 get_nr_threads(p))) {
>>>
>>> How can this work (even before this patch)? When clone(CLONE_VM without 
>>> CLONE_THREAD/CLONE_SIGHAND)
>>> is requested, copy_process() calls copy_signal() in order to copy 
>>> sig->oom_score_adj and
>>> sig->oom_score_adj_min before calling copy_mm() in order to increment 
>>> mm->mm_users, doesn't it?
>>> Then, we will get two different "struct signal_struct" with different 
>>> oom_score_adj/oom_score_adj_min
>>> but one "struct mm_struct" shared by two thread groups.
>>>
>>
>> Are you talking about race between __set_oom_adj and copy_process?
>> If so, i agree with your opinion. It can not set oom_score_adj properly for 
>> copied process if __set_oom_adj
>> check mm_users before copy_process calls copy_mm after copy_signal. Please 
>> correct me if i misunderstood anything.
>
> You understand it correctly.
>
> Reversing copy_signal() and copy_mm() is not sufficient either. We need to 
> use a read/write lock
> (read lock for copy_process() and write lock for __set_oom_adj()) in order to 
> make sure that
> the thread created by clone() becomes reachable from for_each_process() path 
> in __set_oom_adj().
>

Thank you for your suggestion. But i think it would be better to seperate to 2 
issues. How about think these
issues separately because there are no dependency between race issue and my 
patch. As i already explained,
for_each_process path is meaningless if there is only one thread group with 
many threads(mm_users > 1 but 
no other thread group sharing same mm). Do you have any other idea to avoid 
meaningless loop ? 

>>
 mm = p->mm;
 atomic_inc(>mm_count);
 }
>>
>


RE: Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Yong-Taek Lee
>
>On 2018/10/08 15:14, Yong-Taek Lee wrote:
>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
 @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
 oom_adj, bool legacy)
 struct mm_struct *mm = NULL;
 struct task_struct *task;
 int err = 0;
 +   int mm_users = 0;

 task = get_proc_task(file_inode(file));
 if (!task)
 @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
 oom_adj, bool legacy)
 struct task_struct *p = find_lock_task_mm(task);

 if (p) {
 -   if (atomic_read(>mm->mm_users) > 1) {
 +   mm_users = atomic_read(>mm->mm_users);
 +   if ((mm_users > 1) && (mm_users != 
 get_nr_threads(p))) {
>>>
>>> How can this work (even before this patch)? When clone(CLONE_VM without 
>>> CLONE_THREAD/CLONE_SIGHAND)
>>> is requested, copy_process() calls copy_signal() in order to copy 
>>> sig->oom_score_adj and
>>> sig->oom_score_adj_min before calling copy_mm() in order to increment 
>>> mm->mm_users, doesn't it?
>>> Then, we will get two different "struct signal_struct" with different 
>>> oom_score_adj/oom_score_adj_min
>>> but one "struct mm_struct" shared by two thread groups.
>>>
>>
>> Are you talking about race between __set_oom_adj and copy_process?
>> If so, i agree with your opinion. It can not set oom_score_adj properly for 
>> copied process if __set_oom_adj
>> check mm_users before copy_process calls copy_mm after copy_signal. Please 
>> correct me if i misunderstood anything.
>
> You understand it correctly.
>
> Reversing copy_signal() and copy_mm() is not sufficient either. We need to 
> use a read/write lock
> (read lock for copy_process() and write lock for __set_oom_adj()) in order to 
> make sure that
> the thread created by clone() becomes reachable from for_each_process() path 
> in __set_oom_adj().
>

Thank you for your suggestion. But i think it would be better to seperate to 2 
issues. How about think these
issues separately because there are no dependency between race issue and my 
patch. As i already explained,
for_each_process path is meaningless if there is only one thread group with 
many threads(mm_users > 1 but 
no other thread group sharing same mm). Do you have any other idea to avoid 
meaningless loop ? 

>>
 mm = p->mm;
 atomic_inc(>mm_count);
 }
>>
>


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Tetsuo Handa
On 2018/10/08 15:14, Yong-Taek Lee wrote:
>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
>>> oom_adj, bool legacy)
>>> struct mm_struct *mm = NULL;
>>> struct task_struct *task;
>>> int err = 0;
>>> +   int mm_users = 0;
>>>
>>> task = get_proc_task(file_inode(file));
>>> if (!task)
>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
>>> oom_adj, bool legacy)
>>> struct task_struct *p = find_lock_task_mm(task);
>>>
>>> if (p) {
>>> -   if (atomic_read(>mm->mm_users) > 1) {
>>> +   mm_users = atomic_read(>mm->mm_users);
>>> +   if ((mm_users > 1) && (mm_users != 
>>> get_nr_threads(p))) {
>>
>> How can this work (even before this patch)? When clone(CLONE_VM without 
>> CLONE_THREAD/CLONE_SIGHAND)
>> is requested, copy_process() calls copy_signal() in order to copy 
>> sig->oom_score_adj and
>> sig->oom_score_adj_min before calling copy_mm() in order to increment 
>> mm->mm_users, doesn't it?
>> Then, we will get two different "struct signal_struct" with different 
>> oom_score_adj/oom_score_adj_min
>> but one "struct mm_struct" shared by two thread groups.
>>
> 
> Are you talking about race between __set_oom_adj and copy_process?
> If so, i agree with your opinion. It can not set oom_score_adj properly for 
> copied process if __set_oom_adj
> check mm_users before copy_process calls copy_mm after copy_signal. Please 
> correct me if i misunderstood anything.

You understand it correctly.

Reversing copy_signal() and copy_mm() is not sufficient either. We need to use 
a read/write lock
(read lock for copy_process() and write lock for __set_oom_adj()) in order to 
make sure that
the thread created by clone() becomes reachable from for_each_process() path in 
__set_oom_adj().

> 
>>> mm = p->mm;
>>> atomic_inc(>mm_count);
>>> }
> 


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Tetsuo Handa
On 2018/10/08 15:14, Yong-Taek Lee wrote:
>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
>>> oom_adj, bool legacy)
>>> struct mm_struct *mm = NULL;
>>> struct task_struct *task;
>>> int err = 0;
>>> +   int mm_users = 0;
>>>
>>> task = get_proc_task(file_inode(file));
>>> if (!task)
>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
>>> oom_adj, bool legacy)
>>> struct task_struct *p = find_lock_task_mm(task);
>>>
>>> if (p) {
>>> -   if (atomic_read(>mm->mm_users) > 1) {
>>> +   mm_users = atomic_read(>mm->mm_users);
>>> +   if ((mm_users > 1) && (mm_users != 
>>> get_nr_threads(p))) {
>>
>> How can this work (even before this patch)? When clone(CLONE_VM without 
>> CLONE_THREAD/CLONE_SIGHAND)
>> is requested, copy_process() calls copy_signal() in order to copy 
>> sig->oom_score_adj and
>> sig->oom_score_adj_min before calling copy_mm() in order to increment 
>> mm->mm_users, doesn't it?
>> Then, we will get two different "struct signal_struct" with different 
>> oom_score_adj/oom_score_adj_min
>> but one "struct mm_struct" shared by two thread groups.
>>
> 
> Are you talking about race between __set_oom_adj and copy_process?
> If so, i agree with your opinion. It can not set oom_score_adj properly for 
> copied process if __set_oom_adj
> check mm_users before copy_process calls copy_mm after copy_signal. Please 
> correct me if i misunderstood anything.

You understand it correctly.

Reversing copy_signal() and copy_mm() is not sufficient either. We need to use 
a read/write lock
(read lock for copy_process() and write lock for __set_oom_adj()) in order to 
make sure that
the thread created by clone() becomes reachable from for_each_process() path in 
__set_oom_adj().

> 
>>> mm = p->mm;
>>> atomic_inc(>mm_count);
>>> }
> 


RE: Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Yong-Taek Lee
>On 2018/10/08 10:19, Yong-Taek Lee wrote:
>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
>> oom_adj, bool legacy)
>> struct mm_struct *mm = NULL;
>> struct task_struct *task;
>> int err = 0;
>> +   int mm_users = 0;
>>
>> task = get_proc_task(file_inode(file));
>> if (!task)
>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
>> oom_adj, bool legacy)
>> struct task_struct *p = find_lock_task_mm(task);
>>
>> if (p) {
>> -   if (atomic_read(>mm->mm_users) > 1) {
>> +   mm_users = atomic_read(>mm->mm_users);
>> +   if ((mm_users > 1) && (mm_users != 
>> get_nr_threads(p))) {
>
> How can this work (even before this patch)? When clone(CLONE_VM without 
> CLONE_THREAD/CLONE_SIGHAND)
> is requested, copy_process() calls copy_signal() in order to copy 
> sig->oom_score_adj and
> sig->oom_score_adj_min before calling copy_mm() in order to increment 
> mm->mm_users, doesn't it?
> Then, we will get two different "struct signal_struct" with different 
> oom_score_adj/oom_score_adj_min
> but one "struct mm_struct" shared by two thread groups.
>

Are you talking about race between __set_oom_adj and copy_process?
If so, i agree with your opinion. It can not set oom_score_adj properly for 
copied process if __set_oom_adj
check mm_users before copy_process calls copy_mm after copy_signal. Please 
correct me if i misunderstood anything.

>> mm = p->mm;
>> atomic_inc(>mm_count);
>> }


RE: Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-08 Thread Yong-Taek Lee
>On 2018/10/08 10:19, Yong-Taek Lee wrote:
>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
>> oom_adj, bool legacy)
>> struct mm_struct *mm = NULL;
>> struct task_struct *task;
>> int err = 0;
>> +   int mm_users = 0;
>>
>> task = get_proc_task(file_inode(file));
>> if (!task)
>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
>> oom_adj, bool legacy)
>> struct task_struct *p = find_lock_task_mm(task);
>>
>> if (p) {
>> -   if (atomic_read(>mm->mm_users) > 1) {
>> +   mm_users = atomic_read(>mm->mm_users);
>> +   if ((mm_users > 1) && (mm_users != 
>> get_nr_threads(p))) {
>
> How can this work (even before this patch)? When clone(CLONE_VM without 
> CLONE_THREAD/CLONE_SIGHAND)
> is requested, copy_process() calls copy_signal() in order to copy 
> sig->oom_score_adj and
> sig->oom_score_adj_min before calling copy_mm() in order to increment 
> mm->mm_users, doesn't it?
> Then, we will get two different "struct signal_struct" with different 
> oom_score_adj/oom_score_adj_min
> but one "struct mm_struct" shared by two thread groups.
>

Are you talking about race between __set_oom_adj and copy_process?
If so, i agree with your opinion. It can not set oom_score_adj properly for 
copied process if __set_oom_adj
check mm_users before copy_process calls copy_mm after copy_signal. Please 
correct me if i misunderstood anything.

>> mm = p->mm;
>> atomic_inc(>mm_count);
>> }


Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-07 Thread Tetsuo Handa
On 2018/10/08 10:19, Yong-Taek Lee wrote:
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct mm_struct *mm = NULL;
> struct task_struct *task;
> int err = 0;
> +   int mm_users = 0;
> 
> task = get_proc_task(file_inode(file));
> if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct task_struct *p = find_lock_task_mm(task);
> 
> if (p) {
> -   if (atomic_read(>mm->mm_users) > 1) {
> +   mm_users = atomic_read(>mm->mm_users);
> +   if ((mm_users > 1) && (mm_users != 
> get_nr_threads(p))) {

How can this work (even before this patch)? When clone(CLONE_VM without 
CLONE_THREAD/CLONE_SIGHAND)
is requested, copy_process() calls copy_signal() in order to copy 
sig->oom_score_adj and
sig->oom_score_adj_min before calling copy_mm() in order to increment 
mm->mm_users, doesn't it?
Then, we will get two different "struct signal_struct" with different 
oom_score_adj/oom_score_adj_min
but one "struct mm_struct" shared by two thread groups.

> mm = p->mm;
> atomic_inc(>mm_count);
> }



Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-07 Thread Tetsuo Handa
On 2018/10/08 10:19, Yong-Taek Lee wrote:
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct mm_struct *mm = NULL;
> struct task_struct *task;
> int err = 0;
> +   int mm_users = 0;
> 
> task = get_proc_task(file_inode(file));
> if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int 
> oom_adj, bool legacy)
> struct task_struct *p = find_lock_task_mm(task);
> 
> if (p) {
> -   if (atomic_read(>mm->mm_users) > 1) {
> +   mm_users = atomic_read(>mm->mm_users);
> +   if ((mm_users > 1) && (mm_users != 
> get_nr_threads(p))) {

How can this work (even before this patch)? When clone(CLONE_VM without 
CLONE_THREAD/CLONE_SIGHAND)
is requested, copy_process() calls copy_signal() in order to copy 
sig->oom_score_adj and
sig->oom_score_adj_min before calling copy_mm() in order to increment 
mm->mm_users, doesn't it?
Then, we will get two different "struct signal_struct" with different 
oom_score_adj/oom_score_adj_min
but one "struct mm_struct" shared by two thread groups.

> mm = p->mm;
> atomic_inc(>mm_count);
> }



[PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-07 Thread Yong-Taek Lee
It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj"). Most of
user process's mm_users is bigger than 1 but only one thread group.
In this case, for_each_process loop meaninglessly try to find processes
which sharing same mm even though there is only one thread group.

My idea is that target task's nr thread is smaller than mm_users if there
are more thread groups sharing the same mm. So we can skip loop
if mm_user and nr_thread are same. 

test result
while true; do count=0; time while [ $count -lt 1 ]; do echo -1000 > 
/proc/1457/oom_score_adj; count=$((count+1)); done; done;

before patch
0m00.59s real 0m00.09s user 0m00.51s system
0m00.59s real 0m00.14s user 0m00.45s system
0m00.58s real 0m00.11s user 0m00.47s system
0m00.58s real 0m00.10s user 0m00.48s system
0m00.59s real 0m00.11s user 0m00.48s system

after patch
0m00.15s real 0m00.07s user 0m00.08s system
0m00.14s real 0m00.10s user 0m00.04s system
0m00.14s real 0m00.10s user 0m00.05s system
0m00.14s real 0m00.08s user 0m00.07s system
0m00.14s real 0m00.08s user 0m00.07s system

Signed-off-by: Lee YongTaek 
---
 fs/proc/base.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f9f72aee6d45..54b2fb5e9c51 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
struct mm_struct *mm = NULL;
struct task_struct *task;
int err = 0;
+   int mm_users = 0;

task = get_proc_task(file_inode(file));
if (!task)
@@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
struct task_struct *p = find_lock_task_mm(task);

if (p) {
-   if (atomic_read(>mm->mm_users) > 1) {
+   mm_users = atomic_read(>mm->mm_users);
+   if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
mm = p->mm;
atomic_inc(>mm_count);
}
--


[PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm

2018-10-07 Thread Yong-Taek Lee
It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj"). Most of
user process's mm_users is bigger than 1 but only one thread group.
In this case, for_each_process loop meaninglessly try to find processes
which sharing same mm even though there is only one thread group.

My idea is that target task's nr thread is smaller than mm_users if there
are more thread groups sharing the same mm. So we can skip loop
if mm_user and nr_thread are same. 

test result
while true; do count=0; time while [ $count -lt 1 ]; do echo -1000 > 
/proc/1457/oom_score_adj; count=$((count+1)); done; done;

before patch
0m00.59s real 0m00.09s user 0m00.51s system
0m00.59s real 0m00.14s user 0m00.45s system
0m00.58s real 0m00.11s user 0m00.47s system
0m00.58s real 0m00.10s user 0m00.48s system
0m00.59s real 0m00.11s user 0m00.48s system

after patch
0m00.15s real 0m00.07s user 0m00.08s system
0m00.14s real 0m00.10s user 0m00.04s system
0m00.14s real 0m00.10s user 0m00.05s system
0m00.14s real 0m00.08s user 0m00.07s system
0m00.14s real 0m00.08s user 0m00.07s system

Signed-off-by: Lee YongTaek 
---
 fs/proc/base.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f9f72aee6d45..54b2fb5e9c51 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
struct mm_struct *mm = NULL;
struct task_struct *task;
int err = 0;
+   int mm_users = 0;

task = get_proc_task(file_inode(file));
if (!task)
@@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
struct task_struct *p = find_lock_task_mm(task);

if (p) {
-   if (atomic_read(>mm->mm_users) > 1) {
+   mm_users = atomic_read(>mm->mm_users);
+   if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
mm = p->mm;
atomic_inc(>mm_count);
}
--