Re: [PATCH] intel_pstate: track and export frequency residency stats via sysfs.

2014-09-10 Thread Sameer Nanda
On Wed, Sep 10, 2014 at 5:04 PM, Rafael J. Wysocki  wrote:
> On Wednesday, September 10, 2014 04:39:05 PM Anup Chenthamarakshan wrote:
>> On Thu, Sep 11, 2014 at 12:49:48AM +0200, Rafael J. Wysocki wrote:
>> > On Wednesday, September 10, 2014 03:15:08 PM Anup Chenthamarakshan wrote:
>> > >
>> > > Tools like powertop and turbostat are not present by default on all 
>> > > systems,
>> > > so it is not always possible to use them :(
>> >
>> > Which systems are you referring to in particular?
>>
>> We're testing on Chrome OS devices (Chromebooks).
>
> How big of a deal is it to install the tools mentioned above on such a system?
>
> At least turbostat is shipped with the kernel source.

Given the web browser based front end of Chrome OS, installing these
tools will only get us so far -- if the system is in developer mode,
the tools are accessible but when the system is in normal (verified
boot mode) these tools cannot be launched directly.

We are in the process of switching Chrome OS x86 kernels from ondemand
governor to intel_pstate.  When debugging power consumption issues,
losing the ability to easily get CPU frequency related information as
a side-effect of this switch is less than ideal.

We are happy to spin this patch to expose aperf/mperf based CPU
frequency information if you think that is the better route to take
longer term.

>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.



-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix race between oom kill and task exit

2013-12-06 Thread Sameer Nanda
On Fri, Dec 6, 2013 at 7:19 AM, Oleg Nesterov  wrote:
> On 12/05, David Rientjes wrote:
>>
>> On Thu, 5 Dec 2013, Oleg Nesterov wrote:
>>
>> > > Your v2 series looks good and I suspect anybody trying them doesn't have
>> > > additional reports of the infinite loop?  Should they be marked for
>> > > stable?
>> >
>> > Unlikely...
>> >
>> > I think the patch from Sameer makes more sense for stable as a temporary
>> > (and obviously incomplete) fix.
>>
>> There's a problem because none of this is currently even in linux-next.  I
>> think we could make a case for getting Sameer's patch at
>> http://marc.info/?l=linux-kernel&m=138436313021133 to be merged for
>> stable,
>
> Probably.
>
> Ah, I just noticed that this change
>
> -   if (p->flags & PF_EXITING) {
> +   if (p->flags & PF_EXITING || !pid_alive(p)) {
>
> is not needed. !pid_alive(p) obviously implies PF_EXITING.

Ah right.

>
>> but then we'd have to revert it in linux-next
>
> Or perhaps Sameer can just send his fix to stable/gregkh.
>
> Just the changelog should clearly explain that this is the minimal
> workaround for stable. Once again it doesn't (and can't) fix all
> problems even in oom_kill_process() paths, but it helps anyway to
> avoid the easy-to-trigger hang.

I don't mind doing that if that seems to be the consensus.  FWIW, I've
already added my patch to the Chrome OS kernel repo.

>
>> before merging your
>> series at http://marc.info/?l=linux-kernel&m=138616217925981.
>
> Just in case, I won't mind to rediff my patches on top of Sameer's
> patch and then add git-revert patch.
>
>> All of the
>> issues you present in that series seem to be stable material, so why not
>> just go ahead with your series and mark it for stable for 3.13?
>
> OK... I can do this too.
>
> I do not really like this because it adds thread_head/node but doesn't
> remove the old ->thread_group. We will do this later, but obviously
> this is not the stable material.
>
> IOW, if we send this to stable, thread_head/node/for_each_thread will
> be only used by oom_kill.c.
>
> And this is risky. For example, 1/4 depends on (at least) another patch
> I sent in preparation for this change, commit 81907739851
> "kernel/fork.c:copy_process(): don't add the uninitialized
> child to thread/task/pid lists", perhaps on something else.
>
> So personally I'd prefer to simply send the workaround for stable.
>
> Oleg.
>



-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] introduce for_each_thread() to replace the buggy while_each_thread()

2013-12-03 Thread Sameer Nanda
On Mon, Dec 2, 2013 at 7:24 AM, Oleg Nesterov  wrote:
> while_each_thread() and next_thread() should die, almost every
> lockless usage is wrong.
>
> 1. Unless g == current, the lockless while_each_thread() is not safe.
>
>while_each_thread(g, t) can loop forever if g exits, next_thread()
>can't reach the unhashed thread in this case. Note that this can
>happen even if g is the group leader, it can exec.
>
> 2. Even if while_each_thread() itself was correct, people often use
>it wrongly.
>
>It was never safe to just take rcu_read_lock() and loop unless
>you verify that pid_alive(g) == T, even the first next_thread()
>can point to the already freed/reused memory.
>
> This patch adds signal_struct->thread_head and task->thread_node
> to create the normal rcu-safe list with the stable head. The new
> for_each_thread(g, t) helper is always safe under rcu_read_lock()
> as long as this task_struct can't go away.
>
> Note: of course it is ugly to have both task_struct->thread_node
> and the old task_struct->thread_group, we will kill it later, after
> we change the users of while_each_thread() to use for_each_thread().
>
> Perhaps we can kill it even before we convert all users, we can
> reimplement next_thread(t) using the new thread_head/thread_node.
> But we can't do this right now because this will lead to subtle
> behavioural changes. For example, do/while_each_thread() always
> sees at least one task, while for_each_thread() can do nothing if
> the whole thread group has died. Or thread_group_empty(), currently
> its semantics is not clear unless thread_group_leader(p) and we
> need to audit the callers before we can change it.
>
> So this patch adds the new interface which has to coexist with the
> old one for some time, hopefully the next changes will be more or
> less straightforward and the old one will go away soon.
>
> Signed-off-by: Oleg Nesterov 
Reviewed-by: Sameer Nanda 

> ---
>  include/linux/init_task.h |2 ++
>  include/linux/sched.h |   12 
>  kernel/exit.c |1 +
>  kernel/fork.c |7 +++
>  4 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index b0ed422..668aae0 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -40,6 +40,7 @@ extern struct fs_struct init_fs;
>
>  #define INIT_SIGNALS(sig) {\
> .nr_threads = 1,\
> +   .thread_head= LIST_HEAD_INIT(init_task.thread_node),\
> .wait_chldexit  = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
> .shared_pending = { \
> .list = LIST_HEAD_INIT(sig.shared_pending.list),\
> @@ -213,6 +214,7 @@ extern struct task_group root_task_group;
> [PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),\
> },  \
> .thread_group   = LIST_HEAD_INIT(tsk.thread_group), \
> +   .thread_node= LIST_HEAD_INIT(init_signals.thread_head), \
> INIT_IDS\
> INIT_PERF_EVENTS(tsk)   \
> INIT_TRACE_IRQFLAGS \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc48056..0550083 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -487,6 +487,7 @@ struct signal_struct {
> atomic_tsigcnt;
> atomic_tlive;
> int nr_threads;
> +   struct list_headthread_head;
>
> wait_queue_head_t   wait_chldexit;  /* for wait4() */
>
> @@ -1162,6 +1163,7 @@ struct task_struct {
> /* PID/PID hash table linkage. */
> struct pid_link pids[PIDTYPE_MAX];
> struct list_head thread_group;
> +   struct list_head thread_node;
>
> struct completion *vfork_done;  /* for vfork() */
> int __user *set_child_tid;  /* CLONE_CHILD_SETTID */
> @@ -2225,6 +2227,16 @@ extern bool current_is_single_threaded(void);
>  #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> +#define __for_each_thread(signal, t)   \
> +   list_for_each_entry_rcu(t, &(signal)->thread_head, thread_node)
> +
> +#define for_each_thread(p, t)  \
> +   __for_each_thread((p)->signal, t)
> +
> +/* Careful: this is a double loop, '

Re: [PATCH 2/2] oom_kill: change oom_kill.c to use for_each_thread()

2013-12-03 Thread Sameer Nanda
Thanks for helping get this fixed!  Couple of comments below.

On Mon, Dec 2, 2013 at 7:24 AM, Oleg Nesterov  wrote:
> 1. Change oom_kill.c to use for_each_thread() rather than the racy
>while_each_thread() which can loop forever if we race with exit.
>
>Note also that most users were buggy even if while_each_thread()
>was fine, the task can exit even _before_ rcu_read_lock().
>
>Fortunately the new for_each_thread() only requires the stable
>task_struct, so this change fixes both problems.
>
> 2. At least out_of_memory() calls has_intersects_mems_allowed()
>without even rcu_read_lock(), this is obviously buggy.
>
>Add the necessary rcu_read_lock(). This means that we can not
>simply return from the loop, we need "bool ret" and "break".
>
> Signed-off-by: Oleg Nesterov 
Reviewed-by: Sameer Nanda 

> ---
>  mm/oom_kill.c |   37 -
>  1 files changed, 20 insertions(+), 17 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1e4a600..47dd4ce 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -54,12 +54,14 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>   * shares the same mempolicy nodes as current if it is bound by such a policy
>   * and whether or not it has the same set of allowed cpuset nodes.
>   */
> -static bool has_intersects_mems_allowed(struct task_struct *tsk,
> +static bool has_intersects_mems_allowed(struct task_struct *start,
Comment block needs to be fixed up due to variable name change from
tsk to start.

> const nodemask_t *mask)
>  {
> -   struct task_struct *start = tsk;
> +   struct task_struct *tsk;
> +   bool ret = false;
>
> -   do {
> +   rcu_read_lock();
> +   for_each_thread(start, tsk) {
> if (mask) {
> /*
>  * If this is a mempolicy constrained oom, tsk's
> @@ -67,19 +69,20 @@ static bool has_intersects_mems_allowed(struct 
> task_struct *tsk,
>  * mempolicy intersects current, otherwise it may be
>  * needlessly killed.
>  */
> -   if (mempolicy_nodemask_intersects(tsk, mask))
> -   return true;
> +   ret = mempolicy_nodemask_intersects(tsk, mask);
> } else {
> /*
>  * This is not a mempolicy constrained oom, so only
>  * check the mems of tsk's cpuset.
>  */
> -   if (cpuset_mems_allowed_intersects(current, tsk))
> -   return true;
> +   ret = cpuset_mems_allowed_intersects(current, tsk);
> }
> -   } while_each_thread(start, tsk);
> +   if (ret)
> +   break;
> +   }
> +   rcu_read_unlock();
>
> -   return false;
> +   return ret;
>  }
>  #else
>  static bool has_intersects_mems_allowed(struct task_struct *tsk,
> @@ -97,14 +100,14 @@ static bool has_intersects_mems_allowed(struct 
> task_struct *tsk,
>   */
>  struct task_struct *find_lock_task_mm(struct task_struct *p)
>  {
> -   struct task_struct *t = p;
> +   struct task_struct *t;
>
> -   do {
> +   for_each_thread(p, t) {
> task_lock(t);
> if (likely(t->mm))
> return t;
> task_unlock(t);
> -   } while_each_thread(p, t);
> +   }
>
> return NULL;
>  }
> @@ -301,7 +304,7 @@ static struct task_struct *select_bad_process(unsigned 
> int *ppoints,
> unsigned long chosen_points = 0;
>
> rcu_read_lock();
> -   do_each_thread(g, p) {
> +   for_each_process_thread(g, p) {
> unsigned int points;
>
> switch (oom_scan_process_thread(p, totalpages, nodemask,
> @@ -323,7 +326,7 @@ static struct task_struct *select_bad_process(unsigned 
> int *ppoints,
> chosen = p;
> chosen_points = points;
> }
> -   } while_each_thread(g, p);
> +   }
> if (chosen)
> get_task_struct(chosen);
> rcu_read_unlock();
> @@ -406,7 +409,7 @@ void oom_kill_process(struct task_struct *p, gfp_t 
> gfp_mask, int order,
>  {
> struct task_struct *victim = p;
> struct task_struct *child;
> -   struct task_struct *t = p;
> +   struct task_struct *t;
> struct mm_struct *mm;
> 

Re: [PATCH 3/3] signals: change do_signal_stop/do_sigaction to use while_each_thread()

2013-11-22 Thread Sameer Nanda
On Fri, Nov 22, 2013 at 11:18 AM, Oleg Nesterov  wrote:
> Change do_signal_stop() and do_sigaction() to avoid next_thread()
> and use while_each_thread() instead.
>
> Signed-off-by: Oleg Nesterov 
Reviewed-by: Sameer Nanda 

> ---
>  kernel/signal.c |7 +++
>  1 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index ded28b9..086071d 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2047,8 +2047,8 @@ static bool do_signal_stop(int signr)
> if (task_set_jobctl_pending(current, signr | gstop))
> sig->group_stop_count++;
>
> -   for (t = next_thread(current); t != current;
> -t = next_thread(t)) {
> +   t = current;
> +   while_each_thread(current, t) {
> /*
>  * Setting state to TASK_STOPPED for a group
>  * stop is always done with the siglock held,
> @@ -3125,8 +3125,7 @@ int do_sigaction(int sig, struct k_sigaction *act, 
> struct k_sigaction *oact)
> rm_from_queue_full(&mask, &t->signal->shared_pending);
> do {
> rm_from_queue_full(&mask, &t->pending);
> -   t = next_thread(t);
> -   } while (t != current);
> +   } while_each_thread(current, t);
> }
> }
>
> --
> 1.5.5.1
>



-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] k_getrusage() can use while_each_thread()

2013-11-22 Thread Sameer Nanda
On Fri, Nov 22, 2013 at 11:18 AM, Oleg Nesterov  wrote:
> Change k_getrusage() to use while_each_thread(), no changes in
> the compiled code.
>
> Signed-off-by: Oleg Nesterov 
Reviewed-by: Sameer Nanda 

> ---
>  kernel/sys.c |3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 1d9a218..70d56d6 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1572,8 +1572,7 @@ static void k_getrusage(struct task_struct *p, int who, 
> struct rusage *r)
> t = p;
> do {
> accumulate_thread_rusage(t, r);
> -   t = next_thread(t);
> -   } while (t != p);
> +   } while_each_thread(p, t);
> break;
>
> default:
> --
> 1.5.5.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] proc: change do_task_stat() to use while_each_thread()

2013-11-22 Thread Sameer Nanda
On Fri, Nov 22, 2013 at 11:18 AM, Oleg Nesterov  wrote:
> do_task_stat() can use while_each_thread(), no changes in
> the compiled code.
>
> Signed-off-by: Oleg Nesterov 
Reviewed-by: Sameer Nanda 

> ---
>  fs/proc/array.c |3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index c995807..c59e74f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -442,8 +442,7 @@ static int do_task_stat(struct seq_file *m, struct 
> pid_namespace *ns,
> min_flt += t->min_flt;
> maj_flt += t->maj_flt;
> gtime += task_gtime(t);
> -   t = next_thread(t);
> -   } while (t != task);
> +   } while_each_thread(task, t);
>
> min_flt += sig->min_flt;
> maj_flt += sig->maj_flt;
> --
> 1.5.5.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6] mm, oom: Fix race when selecting process to kill

2013-11-14 Thread Sameer Nanda
On Thu, Nov 14, 2013 at 5:43 AM, dserrg  wrote:
> (sorry for html)
>
> Why do we even bother with locking?
> Why not just merge my original patch? (The link is in Vladimir's message)
> It provides much more elegant (and working!) solution for this problem.

As Oleg alluded to in that thread, that patch makes the race window
smaller, but doesn't close it completely.  Imagine if a SIGKILL gets
sent to the task p immediately after the fatal_signal_pending check.
In that case, the infinite loop in while_each_thread will still happen
since  __unhash_process would delete the task p from the thread_group
list while while_each_thread loop is in progress on another CPU.  This
is precisely why we need to hold read_lock(&tasklist_lock) _before_
checking the state of the process p and entering the while_each_thread
loop.

> David, how did you miss it in the first place?
>
> Oh.. and by the way. I was hitting the same bug in other
> while_each_thread loops in oom_kill.c.

> Anyway, goodluck ;)

Thanks!

>
> 14 нояб. 2013 г. 2:18 пользователь "Sameer Nanda" 
> написал:
>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>>
>> Signed-off-by: Sameer Nanda 
>> ---
>>  include/linux/sched.h |  5 +
>>  mm/oom_kill.c | 34 +-
>>  2 files changed, 26 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index e27baee..8975dbb 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
>>  #define do_each_thread(g, t) \
>> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; )
>> do
>>
>> +/*
>> + * Careful: while_each_thread is not RCU safe. Callers should hold
>> + * read_lock(tasklist_lock) across while_each_thread loops.
>> + */
>> +
>>  #define while_each_thread(g, t) \
>> while ((t = next_thread(t)) != g)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6738c47..0d1f804 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t
>> gfp_mask, int order,
>> static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>>   DEFAULT_RATELIMIT_BURST);
>>
>> +   if (__ratelimit(&oom_rs))
>> +   dump_header(p, gfp_mask, order, memcg, nodemask);
>> +
>> +   task_lock(p);
>> +   pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> +   message, task_pid_nr(p), p->comm, points);
>> +   task_unlock(p);
>> +
>> +   read_lock(&tasklist_lock);
>> +
>> /*
>>  * If the task is already exiting, don't alarm the sysadmin or
>> kill
>>  * its children or threads, just set TIF_MEMDIE so it can die
>> quickly
>>  */
>> -   if (p->flags & PF_EXITING) {
>> +   if (p->flags & PF_EXITING || !pid_alive(p)) {
>> set_tsk_thread_flag(p, TIF_MEMDIE);
>> put_task_struct(p);
>> +   read_unlock(&tasklist_lock);
>> return;
>> }
>>
>> -   if (__ratelimit(&oom_rs))
>> -   dump_header(p, gfp_mask, order, memcg, nodemask);
>> -
>> -   task_lock(p);
>> -   pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> -   message, task_pid_nr(p), p->comm, points);
>> -   task_unlock(p);
>> -
>> /*
>>  * If any of p's children has a different mm and is eligible for
>> kill,
>>  * the one with the highest oom_badness() score is sacrificed for
>> its
>>  * parent.  This attempts to lose the minimal amount of work done
>

[PATCH v6] mm, oom: Fix race when selecting process to kill

2013-11-13 Thread Sameer Nanda
The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda 
---
 include/linux/sched.h |  5 +
 mm/oom_kill.c | 34 +-
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e27baee..8975dbb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2156,6 +2156,11 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
 
+/*
+ * Careful: while_each_thread is not RCU safe. Callers should hold
+ * read_lock(tasklist_lock) across while_each_thread loops.
+ */
+
 #define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..0d1f804 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,31 +412,33 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
+   if (__ratelimit(&oom_rs))
+   dump_header(p, gfp_mask, order, memcg, nodemask);
+
+   task_lock(p);
+   pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+   message, task_pid_nr(p), p->comm, points);
+   task_unlock(p);
+
+   read_lock(&tasklist_lock);
+
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
-   if (p->flags & PF_EXITING) {
+   if (p->flags & PF_EXITING || !pid_alive(p)) {
set_tsk_thread_flag(p, TIF_MEMDIE);
put_task_struct(p);
+   read_unlock(&tasklist_lock);
return;
}
 
-   if (__ratelimit(&oom_rs))
-   dump_header(p, gfp_mask, order, memcg, nodemask);
-
-   task_lock(p);
-   pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-   task_unlock(p);
-
/*
 * If any of p's children has a different mm and is eligible for kill,
 * the one with the highest oom_badness() score is sacrificed for its
 * parent.  This attempts to lose the minimal amount of work done while
 * still freeing memory.
 */
-   read_lock(&tasklist_lock);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
@@ -456,12 +458,17 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
}
}
} while_each_thread(p, t);
-   read_unlock(&tasklist_lock);
 
-   rcu_read_lock();
p = find_lock_task_mm(victim);
+
+   /*
+* Since while_each_thread is currently not RCU safe, this unlock of
+* tasklist_lock may need to be moved further down if any additional
+* while_each_thread loops get added to this function.
+*/
+   read_unlock(&tasklist_lock);
+
if (!p) {
-   rcu_read_unlock();
put_task_struct(victim);
return;
} else if (victim != p) {
@@ -487,6 +494,7 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * That thread will now get access to memory reserves since it has a
 * pending fatal signal.
 */
+   rcu_read_lock();
for_each_process(p)
if (p->mm == mm && !same_thread_group(p, victim) &&
!(p->flags & PF_KTHREAD)) {
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] mm, oom: Fix race when selecting process to kill

2013-11-13 Thread Sameer Nanda
On Tue, Nov 12, 2013 at 6:33 PM, David Rientjes  wrote:
> On Tue, 12 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>>
>> Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31
>
> Is this needed?

No, it's not.  It's a side effect of using Chrome OS tools to manage
the patches.  I will make sure to remove it on the next version of the
patch.

>
>> Signed-off-by: Sameer Nanda 
>> ---
>>  mm/oom_kill.c | 42 +-
>>  1 file changed, 29 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6738c47..5108c2b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t 
>> gfp_mask, int order,
>>   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>> DEFAULT_RATELIMIT_BURST);
>>
>> + if (__ratelimit(&oom_rs))
>> + dump_header(p, gfp_mask, order, memcg, nodemask);
>> +
>> + task_lock(p);
>> + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> + message, task_pid_nr(p), p->comm, points);
>> + task_unlock(p);
>> +
>> + /*
>> +  * while_each_thread is currently not RCU safe. Lets hold the
>> +  * tasklist_lock across all invocations of while_each_thread (including
>> +  * the one in find_lock_task_mm) in this function.
>> +  */
>> + read_lock(&tasklist_lock);
>> +
>>   /*
>>* If the task is already exiting, don't alarm the sysadmin or kill
>>* its children or threads, just set TIF_MEMDIE so it can die quickly
>>*/
>> - if (p->flags & PF_EXITING) {
>> + if (p->flags & PF_EXITING || !pid_alive(p)) {
>> + pr_info("%s: Not killing process %d, just setting 
>> TIF_MEMDIE\n",
>> + message, task_pid_nr(p));
>
> That makes no sense in the kernel log to have
>
> Out of Memory: Kill process 1234 (comm) score 50 or sacrifice child
> Out of Memory: Not killing process 1234, just setting TIF_MEMDIE
>
> Those are contradictory statements (and will actually mess with kernel log
> parsing at Google) and nobody other than kernel developers are going to
> know what TIF_MEMDIE is.

Since the "Kill process" printk has now moved above the (p->flags &
PF_EXITING || !pid_alive(p)) check, it is possible that
oom_kill_process will emit the "Kill process" message but will not
actually try to kill the process or its child.  The new "Not killing
process" printk helps disambiguate this case from when a process is
actually killed by oom_kill_process.  However, since you are finding
it confusing, let me remove it.

>
>>   set_tsk_thread_flag(p, TIF_MEMDIE);
>>   put_task_struct(p);
>> + read_unlock(&tasklist_lock);
>>   return;
>>   }
>>
>> - if (__ratelimit(&oom_rs))
>> - dump_header(p, gfp_mask, order, memcg, nodemask);
>> -
>> - task_lock(p);
>> - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>> - message, task_pid_nr(p), p->comm, points);
>> - task_unlock(p);
>> -
>>   /*
>>* If any of p's children has a different mm and is eligible for kill,
>>* the one with the highest oom_badness() score is sacrificed for its
>>* parent.  This attempts to lose the minimal amount of work done while
>>* still freeing memory.
>>*/
>> - read_lock(&tasklist_lock);
>>   do {
>>   list_for_each_entry(child, &t->children, sibling) {
>>   unsigned int child_points;
>> @@ -456,12 +465,17 @@ void oom_kill_process(stru

[PATCH v5] mm, oom: Fix race when selecting process to kill

2013-11-12 Thread Sameer Nanda
The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Change-Id: I62f9652a780863467a8174e18ea5e19bbcd78c31
Signed-off-by: Sameer Nanda 
---
 mm/oom_kill.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..5108c2b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,31 +412,40 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
+   if (__ratelimit(&oom_rs))
+   dump_header(p, gfp_mask, order, memcg, nodemask);
+
+   task_lock(p);
+   pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+   message, task_pid_nr(p), p->comm, points);
+   task_unlock(p);
+
+   /*
+* while_each_thread is currently not RCU safe. Lets hold the
+* tasklist_lock across all invocations of while_each_thread (including
+* the one in find_lock_task_mm) in this function.
+*/
+   read_lock(&tasklist_lock);
+
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
-   if (p->flags & PF_EXITING) {
+   if (p->flags & PF_EXITING || !pid_alive(p)) {
+   pr_info("%s: Not killing process %d, just setting TIF_MEMDIE\n",
+   message, task_pid_nr(p));
set_tsk_thread_flag(p, TIF_MEMDIE);
put_task_struct(p);
+   read_unlock(&tasklist_lock);
return;
}
 
-   if (__ratelimit(&oom_rs))
-   dump_header(p, gfp_mask, order, memcg, nodemask);
-
-   task_lock(p);
-   pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-   task_unlock(p);
-
/*
 * If any of p's children has a different mm and is eligible for kill,
 * the one with the highest oom_badness() score is sacrificed for its
 * parent.  This attempts to lose the minimal amount of work done while
 * still freeing memory.
 */
-   read_lock(&tasklist_lock);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
@@ -456,12 +465,17 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
}
}
} while_each_thread(p, t);
-   read_unlock(&tasklist_lock);
 
-   rcu_read_lock();
p = find_lock_task_mm(victim);
+
+   /*
+* Since while_each_thread is currently not RCU safe, this unlock of
+* tasklist_lock may need to be moved further down if any additional
+* while_each_thread loops get added to this function.
+*/
+   read_unlock(&tasklist_lock);
+
if (!p) {
-   rcu_read_unlock();
put_task_struct(victim);
return;
} else if (victim != p) {
@@ -478,6 +492,8 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
K(get_mm_counter(victim->mm, MM_FILEPAGES)));
task_unlock(victim);
 
+   rcu_read_lock();
+
/*
 * Kill all user processes sharing victim->mm in other thread groups, if
 * any.  They don't get access to memory reserves, though, to avoid
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] mm, oom: Fix race when selecting process to kill

2013-11-12 Thread Sameer Nanda
On Tue, Nov 12, 2013 at 12:01 PM, Oleg Nesterov  wrote:
> On 11/11, Sameer Nanda wrote:
>>
>> The selection of the process to be killed happens in two spots:
>> first in select_bad_process and then a further refinement by
>> looking for child processes in oom_kill_process. Since this is
>> a two step process, it is possible that the process selected by
>> select_bad_process may get a SIGKILL just before oom_kill_process
>> executes. If this were to happen, __unhash_process deletes this
>> process from the thread_group list. This results in oom_kill_process
>> getting stuck in an infinite loop when traversing the thread_group
>> list of the selected process.
>>
>> Fix this race by adding a pid_alive check for the selected process
>> with tasklist_lock held in oom_kill_process.
>
> OK, looks correct to me. Thanks.
>
>
> Yes, this is a step backwards, hopefully we will revert this patch soon.
> I am starting to think something like while_each_thread_lame_but_safe()
> makes sense before we really fix this nasty (and afaics not simple)
> problem with with while_each_thread() (which should die).

Looking forward to a real fix for the nasty problems with
while_each_thread.  In the meanwhile, let me float one more
(hopefully, the last) version of this patch that should address
Michal's concern.  Thanks for your feedback!

>
> Oleg.
>



-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4] mm, oom: Fix race when selecting process to kill

2013-11-11 Thread Sameer Nanda
The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda 
---
 mm/oom_kill.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..57638ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
  DEFAULT_RATELIMIT_BURST);
 
/*
+* while_each_thread is currently not RCU safe. Lets hold the
+* tasklist_lock across all invocations of while_each_thread (including
+* the one in find_lock_task_mm) in this function.
+*/
+   read_lock(&tasklist_lock);
+
+   /*
 * If the task is already exiting, don't alarm the sysadmin or kill
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
-   if (p->flags & PF_EXITING) {
+   if (p->flags & PF_EXITING || !pid_alive(p)) {
set_tsk_thread_flag(p, TIF_MEMDIE);
put_task_struct(p);
+   read_unlock(&tasklist_lock);
return;
}
 
@@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * parent.  This attempts to lose the minimal amount of work done while
 * still freeing memory.
 */
-   read_lock(&tasklist_lock);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
@@ -456,12 +463,17 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
}
}
} while_each_thread(p, t);
-   read_unlock(&tasklist_lock);
 
-   rcu_read_lock();
p = find_lock_task_mm(victim);
+
+   /*
+* Since while_each_thread is currently not RCU safe, this unlock of
+* tasklist_lock may need to be moved further down if any additional
+* while_each_thread loops get added to this function.
+*/
+   read_unlock(&tasklist_lock);
+
if (!p) {
-   rcu_read_unlock();
put_task_struct(victim);
return;
} else if (victim != p) {
@@ -478,6 +490,8 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
K(get_mm_counter(victim->mm, MM_FILEPAGES)));
task_unlock(victim);
 
+   rcu_read_lock();
+
/*
 * Kill all user processes sharing victim->mm in other thread groups, if
 * any.  They don't get access to memory reserves, though, to avoid
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] mm, oom: Fix race when selecting process to kill

2013-11-08 Thread Sameer Nanda
The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Signed-off-by: Sameer Nanda 
---
 mm/oom_kill.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..7b28d9f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,12 +413,20 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
  DEFAULT_RATELIMIT_BURST);
 
/*
+* while_each_thread is currently not RCU safe. Lets hold the
+* tasklist_lock across all invocations of while_each_thread (including
+* the one in find_lock_task_mm) in this function.
+*/
+   read_lock(&tasklist_lock);
+
+   /*
 * If the task is already exiting, don't alarm the sysadmin or kill
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
-   if (p->flags & PF_EXITING) {
+   if (p->flags & PF_EXITING || !pid_alive(p)) {
set_tsk_thread_flag(p, TIF_MEMDIE);
put_task_struct(p);
+   read_unlock(&tasklist_lock);
return;
}
 
@@ -436,7 +444,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * parent.  This attempts to lose the minimal amount of work done while
 * still freeing memory.
 */
-   read_lock(&tasklist_lock);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
@@ -456,10 +463,18 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
}
}
} while_each_thread(p, t);
-   read_unlock(&tasklist_lock);
 
rcu_read_lock();
+
p = find_lock_task_mm(victim);
+
+   /*
+* Since while_each_thread is currently not RCU safe, this unlock of
+* tasklist_lock may need to be moved further down if any additional
+* while_each_thread loops get added to this function.
+*/
+   read_unlock(&tasklist_lock);
+
if (!p) {
rcu_read_unlock();
put_task_struct(victim);
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] mm, oom: Fix race when selecting process to kill

2013-11-08 Thread Sameer Nanda
The selection of the process to be killed happens in two spots:
first in select_bad_process and then a further refinement by
looking for child processes in oom_kill_process. Since this is
a two step process, it is possible that the process selected by
select_bad_process may get a SIGKILL just before oom_kill_process
executes. If this were to happen, __unhash_process deletes this
process from the thread_group list. This results in oom_kill_process
getting stuck in an infinite loop when traversing the thread_group
list of the selected process.

Fix this race by adding a pid_alive check for the selected process
with tasklist_lock held in oom_kill_process.

Change-Id: I865c64486ccfc0e4818e7045a8fa3353e2fb63f8
Signed-off-by: Sameer Nanda 
---
 mm/oom_kill.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..af99b1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,13 +412,16 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
+   read_lock(&tasklist_lock);
+
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
-   if (p->flags & PF_EXITING) {
+   if (p->flags & PF_EXITING || !pid_alive(p)) {
set_tsk_thread_flag(p, TIF_MEMDIE);
put_task_struct(p);
+   read_unlock(&tasklist_lock);
return;
}
 
@@ -436,7 +439,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * parent.  This attempts to lose the minimal amount of work done while
 * still freeing memory.
 */
-   read_lock(&tasklist_lock);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-07 Thread Sameer Nanda
On Wed, Nov 6, 2013 at 4:35 PM, David Rientjes  wrote:
> On Wed, 6 Nov 2013, Sameer Nanda wrote:
>
>> David -- I think we can make the duration that the tasklist_lock is
>> held smaller by consolidating the process selection logic that is
>> currently split across select_bad_process and oom_kill_process into
>> one place in select_bad_process.  The tasklist_lock would then need to
>> be held only when the thread lists are being traversed.  Would you be
>> ok with that?  I can re-spin the patch if that sounds like a workable
>> option.
>>
>
> No, this caused hundreds of machines to hit soft lockups for Google
> because there's no synchronization that prevents dozens of cpus to take
> tasklist_lock in the oom killer during parallel memcg oom conditions and
> never allow the write_lock_irq() on fork() or exit() to make progress.  We
> absolutely must hold tasklist_lock for as little time as possible in the
> oom killer.
>
> That said, I've never actually seen your reported bug manifest in our
> production environment so let's see if Oleg has any ideas.

Is the path you are referring to mem_cgroup_out_of_memory calling
oom_kill_process?  If so, then that path doesn't appear to suffer from
the two step select_bad_process, oom_kill_process race since
mem_cgroup_out_of_memory directly calls oom_kill_process without going
through select_bad_process.  This also means that the patch I sent is
incorrect since it removes the existing tasklist_lock protection in
oom_kill_process.

Respinning patch to take care of this case.

-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, oom: Fix race when selecting process to kill

2013-11-06 Thread Sameer Nanda
(adding back context from thread history)
On Tue, Nov 5, 2013 at 5:18 PM, David Rientjes  wrote:
> On Tue, 5 Nov 2013, Sameer Nanda wrote:
>
>> The selection of the process to be killed happens in two spots -- first
>> in select_bad_process and then a further refinement by looking for
>> child processes in oom_kill_process. Since this is a two step process,
>> it is possible that the process selected by select_bad_process may get a
>> SIGKILL just before oom_kill_process executes. If this were to happen,
>> __unhash_process deletes this process from the thread_group list. This
>> then results in oom_kill_process getting stuck in an infinite loop when
>> traversing the thread_group list of the selected process.
>>
>> Fix this race by holding the tasklist_lock across the calls to both
>> select_bad_process and oom_kill_process.
>>
>> Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
>> Signed-off-by: Sameer Nanda 
>
> Nack, we had to avoid taking tasklist_lock for this duration since it
> stalls out forks and exits on other cpus trying to take the writeside with
> irqs disabled to avoid watchdog problems.
>

David -- I think we can make the duration that the tasklist_lock is
held smaller by consolidating the process selection logic that is
currently split across select_bad_process and oom_kill_process into
one place in select_bad_process.  The tasklist_lock would then need to
be held only when the thread lists are being traversed.  Would you be
ok with that?  I can re-spin the patch if that sounds like a workable
option.

> What kernel version are you patching?  If you check the latest Linus tree,
> we hold a reference to the task_struct of the chosen process before
> calling oom_kill_process() so the hypothesis would seem incorrect.


On Tue, Nov 5, 2013 at 11:17 PM, Luigi Semenzato  wrote:
> Regarding other fixes: would it be possible to have the thread
> iterator insert a dummy marker element in the thread list before the
> scan?  There would be one such dummy element per CPU, so that multiple
> CPUs can scan the list in parallel.  The loop would skip such
> elements, and each dummy element would be removed at the end of each
> scan.
>
> I think this would work, i.e. it would have all the right properties,
> but I don't have a sense of whether the performance impact is
> acceptable.  Probably not, or it would have been proposed earlier.
>
>
>
> On Tue, Nov 5, 2013 at 8:45 PM, Luigi Semenzato  wrote:
>> It's interesting that this was known for 3+ years, but nobody bothered
>> adding a small warning to the code.
>>
>> We noticed this because it's actually happening on Chromebooks in the
>> field.  We try to minimize OOM kills, but we can deal with them.  Of
>> course, a hung kernel we cannot deal with.
>>
>> On Tue, Nov 5, 2013 at 7:04 PM, Sameer Nanda  wrote:
>>>
>>>
>>>
>>> On Tue, Nov 5, 2013 at 5:27 PM, David Rientjes  wrote:
>>>>
>>>> On Tue, 5 Nov 2013, Luigi Semenzato wrote:
>>>>
>>>> > It's not enough to hold a reference to the task struct, because it can
>>>> > still be taken out of the circular list of threads.  The RCU
>>>> > assumptions don't hold in that case.
>>>> >
>>>>
>>>> Could you please post a proper bug report that isolates this at the cause?
>>>
>>>
>>> We've been running into this issue on Chrome OS. crbug.com/256326 has
>>> additional
>>> details.  The issue manifests itself as a soft lockup.
>>>
>>> The kernel we've been seeing this on is 3.8.
>>>
>>> We have a pretty consistent repro currently.  Happy to try out other
>>> suggestions
>>> for a fix.
>>>
>>>>
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>> --
>>> Sameer



-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm, oom: Fix race when selecting process to kill

2013-11-05 Thread Sameer Nanda
The selection of the process to be killed happens in two spots -- first
in select_bad_process and then a further refinement by looking for
child processes in oom_kill_process. Since this is a two step process,
it is possible that the process selected by select_bad_process may get a
SIGKILL just before oom_kill_process executes. If this were to happen,
__unhash_process deletes this process from the thread_group list. This
then results in oom_kill_process getting stuck in an infinite loop when
traversing the thread_group list of the selected process.

Fix this race by holding the tasklist_lock across the calls to both
select_bad_process and oom_kill_process.

Change-Id: I8f96b106b3257b5c103d6497bac7f04f4dff4e60
Signed-off-by: Sameer Nanda 
---
 mm/oom_kill.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6738c47..7bd3587 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -436,7 +436,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * parent.  This attempts to lose the minimal amount of work done while
 * still freeing memory.
 */
-   read_lock(&tasklist_lock);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
@@ -456,7 +455,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
}
}
} while_each_thread(p, t);
-   read_unlock(&tasklist_lock);
 
rcu_read_lock();
p = find_lock_task_mm(victim);
@@ -641,6 +639,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t 
gfp_mask,
mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
 
+   read_lock(&tasklist_lock);
if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -663,6 +662,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t 
gfp_mask,
killed = 1;
}
 out:
+   read_unlock(&tasklist_lock);
/*
 * Give the killed threads a good chance of exiting before trying to
 * allocate memory again.
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/