Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-07 Thread Sameer Nanda
On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes wrote: > On Wed, 6 Nov 2013, Sameer Nanda wrote: > >> David -- I think we can make the duration that the tasklist_lock is >> held smaller by consolidating the process selection logic that is >> currently split across select_bad_process and

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-07 Thread Sameer Nanda
On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes rient...@google.com wrote: On Wed, 6 Nov 2013, Sameer Nanda wrote: David -- I think we can make the duration that the tasklist_lock is held smaller by consolidating the process selection logic that is currently split across select_bad_process and

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-06 Thread David Rientjes
On Wed, 6 Nov 2013, Sameer Nanda wrote: > David -- I think we can make the duration that the tasklist_lock is > held smaller by consolidating the process selection logic that is > currently split across select_bad_process and oom_kill_process into > one place in select_bad_process. The

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-06 Thread Sameer Nanda
(adding back context from thread history) On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes wrote: > On Tue, 5 Nov 2013, Sameer Nanda wrote: > >> The selection of the process to be killed happens in two spots -- first >> in select_bad_process and then a further refinement by looking for >> child

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-06 Thread Sameer Nanda
(adding back context from thread history) On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes rient...@google.com wrote: On Tue, 5 Nov 2013, Sameer Nanda wrote: The selection of the process to be killed happens in two spots -- first in select_bad_process and then a further refinement by looking for

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-06 Thread David Rientjes
On Wed, 6 Nov 2013, Sameer Nanda wrote: David -- I think we can make the duration that the tasklist_lock is held smaller by consolidating the process selection logic that is currently split across select_bad_process and oom_kill_process into one place in select_bad_process. The tasklist_lock

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Luigi Semenzato
Regarding other fixes: would it be possible to have the thread iterator insert a dummy marker element in the thread list before the scan? There would be one such dummy element per CPU, so that multiple CPUs can scan the list in parallel. The loop would skip such elements, and each dummy element

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Luigi Semenzato
It's interesting that this was known for 3+ years, but nobody bothered adding a small warning to the code. We noticed this because it's actually happening on Chromebooks in the field. We try to minimize OOM kills, but we can deal with them. Of course, a hung kernel we cannot deal with. On Tue,

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Vladimir Murzin
Ccing Oleg and Sergey On Tue, Nov 05, 2013 at 05:27:29PM -0800, David Rientjes wrote: > On Tue, 5 Nov 2013, Luigi Semenzato wrote: > > > It's not enough to hold a reference to the task struct, because it can > > still be taken out of the circular list of threads. The RCU > > assumptions don't

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread David Rientjes
On Tue, 5 Nov 2013, Luigi Semenzato wrote: > It's not enough to hold a reference to the task struct, because it can > still be taken out of the circular list of threads. The RCU > assumptions don't hold in that case. > Could you please post a proper bug report that isolates this at the cause?

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Luigi Semenzato
It's not enough to hold a reference to the task struct, because it can still be taken out of the circular list of threads. The RCU assumptions don't hold in that case. On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes wrote: > On Tue, 5 Nov 2013, Sameer Nanda wrote: > >> The selection of the

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread David Rientjes
On Tue, 5 Nov 2013, Sameer Nanda wrote: > The selection of the process to be killed happens in two spots -- first > in select_bad_process and then a further refinement by looking for > child processes in oom_kill_process. Since this is a two step process, > it is possible that the process

[PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Sameer Nanda
The selection of the process to be killed happens in two spots -- first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just

[PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Sameer Nanda
The selection of the process to be killed happens in two spots -- first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by select_bad_process may get a SIGKILL just

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread David Rientjes
On Tue, 5 Nov 2013, Sameer Nanda wrote: The selection of the process to be killed happens in two spots -- first in select_bad_process and then a further refinement by looking for child processes in oom_kill_process. Since this is a two step process, it is possible that the process selected by

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Luigi Semenzato
It's not enough to hold a reference to the task struct, because it can still be taken out of the circular list of threads. The RCU assumptions don't hold in that case. On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes rient...@google.com wrote: On Tue, 5 Nov 2013, Sameer Nanda wrote: The

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread David Rientjes
On Tue, 5 Nov 2013, Luigi Semenzato wrote: It's not enough to hold a reference to the task struct, because it can still be taken out of the circular list of threads. The RCU assumptions don't hold in that case. Could you please post a proper bug report that isolates this at the cause?

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Vladimir Murzin
Ccing Oleg and Sergey On Tue, Nov 05, 2013 at 05:27:29PM -0800, David Rientjes wrote: On Tue, 5 Nov 2013, Luigi Semenzato wrote: It's not enough to hold a reference to the task struct, because it can still be taken out of the circular list of threads. The RCU assumptions don't hold in

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Luigi Semenzato
It's interesting that this was known for 3+ years, but nobody bothered adding a small warning to the code. We noticed this because it's actually happening on Chromebooks in the field. We try to minimize OOM kills, but we can deal with them. Of course, a hung kernel we cannot deal with. On Tue,

Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Luigi Semenzato
Regarding other fixes: would it be possible to have the thread iterator insert a dummy marker element in the thread list before the scan? There would be one such dummy element per CPU, so that multiple CPUs can scan the list in parallel. The loop would skip such elements, and each dummy element