Re: lockdep trace from posix timers

2012-08-29 Thread Oleg Nesterov
On 08/28, Peter Zijlstra wrote:
>
> Surely we can do this locklessly.. I'll go try harder still.

I doubt...

Even ignore work->func check, somehow you need to ensure that
work->next == new can't be changed durung cmpxchg(..., new).

Anyway, if this is possible, can't you do this on top of 1-4
I sent? There are simple, and solve the problems we discusssed.

Off-topic. This is really minor, bur can't we simplify llist_add()?

static inline bool llist_add(struct llist_node *new, struct llist_head 
*head)
{
struct llist_node *old;

do {
old = ACCESS_ONCE(head->first);
new->next = old;
} while (cmpxchg(>first, old, new) != old);

return old == NULL;
}

looks simpler and saves a couple of insns. The likely case should
assume that cmpxchg() succeeds after the 1st attempt. If it fails,
another LOAD from head->first should be very cheap.

And note this ACCESS_ONCE(head->first) above. I think that (in theory)
the current code needs it too. But only in theory, I guess.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-29 Thread Oleg Nesterov
On 08/28, Peter Zijlstra wrote:

 Surely we can do this locklessly.. I'll go try harder still.

I doubt...

Even ignore work-func check, somehow you need to ensure that
work-next == new can't be changed durung cmpxchg(..., new).

Anyway, if this is possible, can't you do this on top of 1-4
I sent? There are simple, and solve the problems we discusssed.

Off-topic. This is really minor, bur can't we simplify llist_add()?

static inline bool llist_add(struct llist_node *new, struct llist_head 
*head)
{
struct llist_node *old;

do {
old = ACCESS_ONCE(head-first);
new-next = old;
} while (cmpxchg(head-first, old, new) != old);

return old == NULL;
}

looks simpler and saves a couple of insns. The likely case should
assume that cmpxchg() succeeds after the 1st attempt. If it fails,
another LOAD from head-first should be very cheap.

And note this ACCESS_ONCE(head-first) above. I think that (in theory)
the current code needs it too. But only in theory, I guess.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-28 Thread Peter Zijlstra
On Tue, 2012-08-28 at 19:01 +0200, Oleg Nesterov wrote:
> >  struct callback_head *
> >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> >  {
> > + struct callback_head **workp, *work;
> > +
> > +again:
> > + workp = >task_works;
> > + work = *workp;
> > + while (work) {
> > + if (work->func == func) {
> 
> But you can't dereference this pointer. Without some locking this
> can race with another task_work_cancel() or task_work_run(), this
> work can be free/unmapped/reused.
> 
> > + if (cmpxchg(workp, work, work->next) == work)
> > + return work;
> 
> Or this can race with task_work_cancel(work) + task_work_add(work).
> cmpxchg() can succeed even if work->func is already different. 

Bah.. you and your races ;-)

Surely we can do this locklessly.. I'll go try harder still.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-28 Thread Oleg Nesterov
On 08/28, Oleg Nesterov wrote:
>
> On 08/28, Peter Zijlstra wrote:
> >
> > +again:
> > +   workp = >task_works;
> > +   work = *workp;
> > +   while (work) {
> > +   if (work->func == func) {
>
> But you can't dereference this pointer. Without some locking this
> can race with another task_work_cancel() or task_work_run(), this
> work can be free/unmapped/reused.
>
> > +   if (cmpxchg(workp, work, work->next) == work)
> > +   return work;
>
> Or this can race with task_work_cancel(work) + task_work_add(work).
> cmpxchg() can succeed even if work->func is already different.

Even simpler, this can race with another task_work_cancel() which
is going to remove work->next.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-28 Thread Oleg Nesterov
On 08/28, Peter Zijlstra wrote:
>
> On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
> >
> > Peter, if you think it can work for you and if you agree with
> > the implementation I will be happy to send the patch.
>
> Yeah I think it would work, but I'm not sure why you're introducing the
> cmp_xchg helper just for this..

Please look at 1-4 the patches I sent (only 1-2 are relevant), I removed
this helper. Although I still think it makes sense, but of course not in
task_work.c.

>  struct callback_head *
>  task_work_cancel(struct task_struct *task, task_work_func_t func)
>  {
> - unsigned long flags;
> - struct callback_head *last, *res = NULL;
> -
> - raw_spin_lock_irqsave(>pi_lock, flags);
> - last = task->task_works;
> - if (last) {
> - struct callback_head *q = last, *p = q->next;
> - while (1) {
> - if (p->func == func) {
> - q->next = p->next;
> - if (p == last)
> - task->task_works = q == p ? NULL : q;
> - res = p;
> - break;
> - }
> - if (p == last)
> - break;
> - q = p;
> - p = q->next;
> + struct callback_head **workp, *work;
> +
> +again:
> + workp = >task_works;
> + work = *workp;
> + while (work) {
> + if (work->func == func) {

But you can't dereference this pointer. Without some locking this
can race with another task_work_cancel() or task_work_run(), this
work can be free/unmapped/reused.

> + if (cmpxchg(workp, work, work->next) == work)
> + return work;

Or this can race with task_work_cancel(work) + task_work_add(work).
cmpxchg() can succeed even if work->func is already different.

> +static callback_head *task_work_pop(void)
>  {
> - struct task_struct *task = current;
> - struct callback_head *p, *q;
> -
> - while (1) {
> - raw_spin_lock_irq(>pi_lock);
> - p = task->task_works;
> - task->task_works = NULL;
> - raw_spin_unlock_irq(>pi_lock);
> -
> - if (unlikely(!p))
> - return;
> -
> - q = p->next; /* head */
> - p->next = NULL; /* cut it */
> - while (q) {
> - p = q->next;
> - q->func(q);
> - q = p;
> - }
> + struct callback_head **head = >task_work;
> + struct callback_head *entry, *old_entry;
> +
> + entry = *head;
> + for (;;) {
> + if (!entry || entry == )
> + return NULL;
> +
> + old_entry = entry;
> + entry = cmpxchg(head, entry, entry->next);

Well, this obviously means cmpxchg() for each entry...

> ( And yeah, I know, its not FIFO ;-)

Cough. akpm didn't like fifo, Linus disliked it too...

And now you! Whats going on??? ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-28 Thread Peter Zijlstra
On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
> 
> Peter, if you think it can work for you and if you agree with
> the implementation I will be happy to send the patch. 

Yeah I think it would work, but I'm not sure why you're introducing the
cmp_xchg helper just for this..

Anyway, how about something like the below, it pops the works one by one
when running, that way when the cancel will only return NULL when the
work is either being executed or already executed.

( And yeah, I know, its not FIFO ;-)

---
 include/linux/task_work.h |7 +--
 kernel/exit.c |2 +-
 kernel/task_work.c|  130 +
 3 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t 
func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-   if (unlikely(task->task_works))
-   task_work_run();
-}
+void task_work_exit(void);
 
 #endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
-   exit_task_work(tsk);
+   task_work_exit();
check_stack_usage();
exit_thread();
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..7767924 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,95 @@
 #include 
 #include 
 
+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = {
+   .next = NULL,
+   .func = task_work_nop,
+};
+
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool 
notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool 
notify)
 {
-   struct callback_head *last, *first;
-   unsigned long flags;
-
-   /*
-* Not inserting the new work if the task has already passed
-* exit_task_work() is the responisbility of callers.
-*/
-   raw_spin_lock_irqsave(>pi_lock, flags);
-   last = task->task_works;
-   first = last ? last->next : twork;
-   twork->next = first;
-   if (last)
-   last->next = twork;
-   task->task_works = twork;
-   raw_spin_unlock_irqrestore(>pi_lock, flags);
+   struct callback_head **head = >task_works;
+   struct callback_head *entry, *old_entry;
+
+   entry = *head;
+   for (;;) {
+   if (entry == )
+   return -ESRCH;
+
+   old_entry = entry;
+   work->next = entry;
+   entry = cmpxchg(head, old_entry, work);
+   if (entry == old_entry)
+   break;
+   }
 
/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
+
return 0;
 }
 
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
-   unsigned long flags;
-   struct callback_head *last, *res = NULL;
-
-   raw_spin_lock_irqsave(>pi_lock, flags);
-   last = task->task_works;
-   if (last) {
-   struct callback_head *q = last, *p = q->next;
-   while (1) {
-   if (p->func == func) {
-   q->next = p->next;
-   if (p == last)
-   task->task_works = q == p ? NULL : q;
-   res = p;
-   break;
-   }
-   if (p == last)
-   break;
-   q = p;
-   p = q->next;
+   struct callback_head **workp, *work;
+
+again:
+   workp = >task_works;
+   work = *workp;
+   while (work) {
+   if (work->func == func) {
+   if (cmpxchg(workp, work, work->next) == work)
+   return work;
+   goto again;
}
+
+   workp = >next;
+   work = *workp;
}
-   raw_spin_unlock_irqrestore(>pi_lock, flags);
-   return res;
+
+   return NULL;
 }
 
-void task_work_run(void)
+static callback_head *task_work_pop(void)
 {
-   struct task_struct *task = current;
-   struct callback_head *p, *q;
-
-   while (1) {
-   raw_spin_lock_irq(>pi_lock);
-   p = task->task_works;
-   task->task_works = NULL;
-   

Re: lockdep trace from posix timers

2012-08-28 Thread Peter Zijlstra
On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
 
 Peter, if you think it can work for you and if you agree with
 the implementation I will be happy to send the patch. 

Yeah I think it would work, but I'm not sure why you're introducing the
cmp_xchg helper just for this..

Anyway, how about something like the below, it pops the works one by one
when running, that way when the cancel will only return NULL when the
work is either being executed or already executed.

( And yeah, I know, its not FIFO ;-)

---
 include/linux/task_work.h |7 +--
 kernel/exit.c |2 +-
 kernel/task_work.c|  130 +
 3 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t 
func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-   if (unlikely(task-task_works))
-   task_work_run();
-}
+void task_work_exit(void);
 
 #endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
-   exit_task_work(tsk);
+   task_work_exit();
check_stack_usage();
exit_thread();
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..7767924 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,95 @@
 #include linux/task_work.h
 #include linux/tracehook.h
 
+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = {
+   .next = NULL,
+   .func = task_work_nop,
+};
+
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool 
notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool 
notify)
 {
-   struct callback_head *last, *first;
-   unsigned long flags;
-
-   /*
-* Not inserting the new work if the task has already passed
-* exit_task_work() is the responisbility of callers.
-*/
-   raw_spin_lock_irqsave(task-pi_lock, flags);
-   last = task-task_works;
-   first = last ? last-next : twork;
-   twork-next = first;
-   if (last)
-   last-next = twork;
-   task-task_works = twork;
-   raw_spin_unlock_irqrestore(task-pi_lock, flags);
+   struct callback_head **head = task-task_works;
+   struct callback_head *entry, *old_entry;
+
+   entry = *head;
+   for (;;) {
+   if (entry == dead)
+   return -ESRCH;
+
+   old_entry = entry;
+   work-next = entry;
+   entry = cmpxchg(head, old_entry, work);
+   if (entry == old_entry)
+   break;
+   }
 
/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
+
return 0;
 }
 
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
-   unsigned long flags;
-   struct callback_head *last, *res = NULL;
-
-   raw_spin_lock_irqsave(task-pi_lock, flags);
-   last = task-task_works;
-   if (last) {
-   struct callback_head *q = last, *p = q-next;
-   while (1) {
-   if (p-func == func) {
-   q-next = p-next;
-   if (p == last)
-   task-task_works = q == p ? NULL : q;
-   res = p;
-   break;
-   }
-   if (p == last)
-   break;
-   q = p;
-   p = q-next;
+   struct callback_head **workp, *work;
+
+again:
+   workp = task-task_works;
+   work = *workp;
+   while (work) {
+   if (work-func == func) {
+   if (cmpxchg(workp, work, work-next) == work)
+   return work;
+   goto again;
}
+
+   workp = work-next;
+   work = *workp;
}
-   raw_spin_unlock_irqrestore(task-pi_lock, flags);
-   return res;
+
+   return NULL;
 }
 
-void task_work_run(void)
+static callback_head *task_work_pop(void)
 {
-   struct task_struct *task = current;
-   struct callback_head *p, *q;
-
-   while (1) {
-   raw_spin_lock_irq(task-pi_lock);
-   p = task-task_works;
-  

Re: lockdep trace from posix timers

2012-08-28 Thread Oleg Nesterov
On 08/28, Peter Zijlstra wrote:

 On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
 
  Peter, if you think it can work for you and if you agree with
  the implementation I will be happy to send the patch.

 Yeah I think it would work, but I'm not sure why you're introducing the
 cmp_xchg helper just for this..

Please look at 1-4 the patches I sent (only 1-2 are relevant), I removed
this helper. Although I still think it makes sense, but of course not in
task_work.c.

  struct callback_head *
  task_work_cancel(struct task_struct *task, task_work_func_t func)
  {
 - unsigned long flags;
 - struct callback_head *last, *res = NULL;
 -
 - raw_spin_lock_irqsave(task-pi_lock, flags);
 - last = task-task_works;
 - if (last) {
 - struct callback_head *q = last, *p = q-next;
 - while (1) {
 - if (p-func == func) {
 - q-next = p-next;
 - if (p == last)
 - task-task_works = q == p ? NULL : q;
 - res = p;
 - break;
 - }
 - if (p == last)
 - break;
 - q = p;
 - p = q-next;
 + struct callback_head **workp, *work;
 +
 +again:
 + workp = task-task_works;
 + work = *workp;
 + while (work) {
 + if (work-func == func) {

But you can't dereference this pointer. Without some locking this
can race with another task_work_cancel() or task_work_run(), this
work can be free/unmapped/reused.

 + if (cmpxchg(workp, work, work-next) == work)
 + return work;

Or this can race with task_work_cancel(work) + task_work_add(work).
cmpxchg() can succeed even if work-func is already different.

 +static callback_head *task_work_pop(void)
  {
 - struct task_struct *task = current;
 - struct callback_head *p, *q;
 -
 - while (1) {
 - raw_spin_lock_irq(task-pi_lock);
 - p = task-task_works;
 - task-task_works = NULL;
 - raw_spin_unlock_irq(task-pi_lock);
 -
 - if (unlikely(!p))
 - return;
 -
 - q = p-next; /* head */
 - p-next = NULL; /* cut it */
 - while (q) {
 - p = q-next;
 - q-func(q);
 - q = p;
 - }
 + struct callback_head **head = current-task_work;
 + struct callback_head *entry, *old_entry;
 +
 + entry = *head;
 + for (;;) {
 + if (!entry || entry == dead)
 + return NULL;
 +
 + old_entry = entry;
 + entry = cmpxchg(head, entry, entry-next);

Well, this obviously means cmpxchg() for each entry...

 ( And yeah, I know, its not FIFO ;-)

Cough. akpm didn't like fifo, Linus disliked it too...

And now you! Whats going on??? ;)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-28 Thread Oleg Nesterov
On 08/28, Oleg Nesterov wrote:

 On 08/28, Peter Zijlstra wrote:
 
  +again:
  +   workp = task-task_works;
  +   work = *workp;
  +   while (work) {
  +   if (work-func == func) {

 But you can't dereference this pointer. Without some locking this
 can race with another task_work_cancel() or task_work_run(), this
 work can be free/unmapped/reused.

  +   if (cmpxchg(workp, work, work-next) == work)
  +   return work;

 Or this can race with task_work_cancel(work) + task_work_add(work).
 cmpxchg() can succeed even if work-func is already different.

Even simpler, this can race with another task_work_cancel() which
is going to remove work-next.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-28 Thread Peter Zijlstra
On Tue, 2012-08-28 at 19:01 +0200, Oleg Nesterov wrote:
   struct callback_head *
   task_work_cancel(struct task_struct *task, task_work_func_t func)
   {
  + struct callback_head **workp, *work;
  +
  +again:
  + workp = task-task_works;
  + work = *workp;
  + while (work) {
  + if (work-func == func) {
 
 But you can't dereference this pointer. Without some locking this
 can race with another task_work_cancel() or task_work_run(), this
 work can be free/unmapped/reused.
 
  + if (cmpxchg(workp, work, work-next) == work)
  + return work;
 
 Or this can race with task_work_cancel(work) + task_work_add(work).
 cmpxchg() can succeed even if work-func is already different. 

Bah.. you and your races ;-)

Surely we can do this locklessly.. I'll go try harder still.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-24 Thread Oleg Nesterov
On 08/21, Oleg Nesterov wrote:
>
> > task_work_add(struct task_struct *task, struct callback_head *twork, bool 
> > notify)
> > {
> > do {
> > twork->next = ACCESS_ONCE(task->task_works);
> > if (unlikely(twork->next == TWORK_EXITED))
> > return -ESRCH;
> > } while (cmp_xchg(>task_works, twork->next, twork));
>   ^^^
>
> while (!cmp_xchg(...))

and _cancel can be simplified a bit.

OK, I actually tested the code below, seems to work.

Peter, if you think it can work for you and if you agree with
the implementation I will be happy to send the patch.

The only change outside of task_work.c is that exit_task_work()
should call task_work_run() unconditionally.

As for cmp_xchg below... Of course it is not strictly needed.
But otoh, personally I'd like to have this helper (probably
renamed) somewhere in include/linux. Perhaps this is just me,
but cmpxchg() always confuses me, and most users only need
success_or_not from cmpxchg.

Oleg.


#define cmp_xchg(ptr, _old, new)\
({  \
__typeof__(_old) old = (_old);  \
cmpxchg(ptr, old, new) == old;  \
})

#define TWORK_EXITED((struct callback_head*)1)

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
do {
work->next = ACCESS_ONCE(task->task_works);
if (unlikely(work->next == TWORK_EXITED))
return -ESRCH;
} while (!cmp_xchg(>task_works, work->next, work));

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
struct callback_head **pprev = >task_works;
struct callback_head *work = NULL;
unsigned long flags;

raw_spin_lock_irqsave(>pi_lock, flags);
if (likely(*pprev != TWORK_EXITED)) {
while ((work = *pprev)) {
read_barrier_depends();
if (work->func != func)
pprev = >next;
else if (cmp_xchg(pprev, work, work->next))
break;
}
}
raw_spin_unlock_irqrestore(>pi_lock, flags);

return work;
}

void task_work_run(void)
{
struct task_struct *task = current;
struct callback_head *work, *head, *next;

for (;;) {
raw_spin_lock_irq(>pi_lock);
do {
work = ACCESS_ONCE(task->task_works);
head = !work && (task->flags & PF_EXITING) ?
TWORK_EXITED : NULL;
} while (!cmp_xchg(>task_works, work, head));
raw_spin_unlock_irq(>pi_lock);

if (!work)
break;

#if PETERZ_WONT_ARGUE_AGAINST_FIFO_TOO_MUCH
head = NULL;
do {
next = work->next;
work->next = head;
head = work;
work = next;
} while (work);

work = head;
#endif
do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-24 Thread Oleg Nesterov
On 08/21, Oleg Nesterov wrote:

  task_work_add(struct task_struct *task, struct callback_head *twork, bool 
  notify)
  {
  do {
  twork-next = ACCESS_ONCE(task-task_works);
  if (unlikely(twork-next == TWORK_EXITED))
  return -ESRCH;
  } while (cmp_xchg(task-task_works, twork-next, twork));
   ^^^

 while (!cmp_xchg(...))

and _cancel can be simplified a bit.

OK, I actually tested the code below, seems to work.

Peter, if you think it can work for you and if you agree with
the implementation I will be happy to send the patch.

The only change outside of task_work.c is that exit_task_work()
should call task_work_run() unconditionally.

As for cmp_xchg below... Of course it is not strictly needed.
But otoh, personally I'd like to have this helper (probably
renamed) somewhere in include/linux. Perhaps this is just me,
but cmpxchg() always confuses me, and most users only need
success_or_not from cmpxchg.

Oleg.


#define cmp_xchg(ptr, _old, new)\
({  \
__typeof__(_old) old = (_old);  \
cmpxchg(ptr, old, new) == old;  \
})

#define TWORK_EXITED((struct callback_head*)1)

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
do {
work-next = ACCESS_ONCE(task-task_works);
if (unlikely(work-next == TWORK_EXITED))
return -ESRCH;
} while (!cmp_xchg(task-task_works, work-next, work));

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
struct callback_head **pprev = task-task_works;
struct callback_head *work = NULL;
unsigned long flags;

raw_spin_lock_irqsave(task-pi_lock, flags);
if (likely(*pprev != TWORK_EXITED)) {
while ((work = *pprev)) {
read_barrier_depends();
if (work-func != func)
pprev = work-next;
else if (cmp_xchg(pprev, work, work-next))
break;
}
}
raw_spin_unlock_irqrestore(task-pi_lock, flags);

return work;
}

void task_work_run(void)
{
struct task_struct *task = current;
struct callback_head *work, *head, *next;

for (;;) {
raw_spin_lock_irq(task-pi_lock);
do {
work = ACCESS_ONCE(task-task_works);
head = !work  (task-flags  PF_EXITING) ?
TWORK_EXITED : NULL;
} while (!cmp_xchg(task-task_works, work, head));
raw_spin_unlock_irq(task-pi_lock);

if (!work)
break;

#if PETERZ_WONT_ARGUE_AGAINST_FIFO_TOO_MUCH
head = NULL;
do {
next = work-next;
work-next = head;
head = work;
work = next;
} while (work);

work = head;
#endif
do {
next = work-next;
work-func(work);
work = next;
cond_resched();
} while (work);
}
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-21 Thread Oleg Nesterov
On 08/21, Oleg Nesterov wrote:
>
> static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
> {
>   return cmpxchg(ptr, old, new) == old;
> }
>
> int
> task_work_add(struct task_struct *task, struct callback_head *twork, bool 
> notify)
> {
>   do {
>   twork->next = ACCESS_ONCE(task->task_works);
>   if (unlikely(twork->next == TWORK_EXITED))
>   return -ESRCH;
>   } while (cmp_xchg(>task_works, twork->next, twork));
  ^^^

while (!cmp_xchg(...))

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-21 Thread Oleg Nesterov
On 08/20, Oleg Nesterov wrote:
>
> On 08/20, Peter Zijlstra wrote:
> >
> > Anyway, would taking ->pi_lock over _cancel and _run suffice?
>
> I thinks yes... But I can't think properly today.

OK, how about the code below?

Oleg.


#define TWORK_EXITED((struct callback_head*)1)

static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
{
return cmpxchg(ptr, old, new) == old;
}

int
task_work_add(struct task_struct *task, struct callback_head *twork, bool 
notify)
{
do {
twork->next = ACCESS_ONCE(task->task_works);
if (unlikely(twork->next == TWORK_EXITED))
return -ESRCH;
} while (cmp_xchg(>task_works, twork->next, twork));

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
struct callback_head *twork = NULL, **pprev = >task_works;
unsigned long flags;

raw_spin_lock_irqsave(>pi_lock, flags);
if (likely(*pprev != TWORK_EXITED)) {
for (; (twork = *pprev); pprev = ) {
read_barrier_depends();
if (twork->func != func)
continue;

while (!cmp_xchg(pprev, twork, twork->next))
// COMMENT ABOUT THE RACE WITH _add()
pprev = &(*pprev)->next;
break;
}
}
raw_spin_unlock_irqrestore(>pi_lock, flags);

return twork;
}

void task_work_run(void)
{
struct task_struct *task = current;
struct callback_head *twork, *marker;

for (;;) {
raw_spin_lock_irq(>pi_lock);
do {
twork = ACCESS_ONCE(task->task_works);
marker = (!twork && (task->flags & PF_EXITING)) ?
TWORK_EXITED : NULL;
} while (cmp_xchg(>task_works, twork, marker));
raw_spin_unlock_irq(>pi_lock);

if (!twork)
break;

...run works...
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-21 Thread Oleg Nesterov
On 08/20, Oleg Nesterov wrote:

 On 08/20, Peter Zijlstra wrote:
 
  Anyway, would taking -pi_lock over _cancel and _run suffice?

 I thinks yes... But I can't think properly today.

OK, how about the code below?

Oleg.


#define TWORK_EXITED((struct callback_head*)1)

static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
{
return cmpxchg(ptr, old, new) == old;
}

int
task_work_add(struct task_struct *task, struct callback_head *twork, bool 
notify)
{
do {
twork-next = ACCESS_ONCE(task-task_works);
if (unlikely(twork-next == TWORK_EXITED))
return -ESRCH;
} while (cmp_xchg(task-task_works, twork-next, twork));

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
struct callback_head *twork = NULL, **pprev = task-task_works;
unsigned long flags;

raw_spin_lock_irqsave(task-pi_lock, flags);
if (likely(*pprev != TWORK_EXITED)) {
for (; (twork = *pprev); pprev = twork) {
read_barrier_depends();
if (twork-func != func)
continue;

while (!cmp_xchg(pprev, twork, twork-next))
// COMMENT ABOUT THE RACE WITH _add()
pprev = (*pprev)-next;
break;
}
}
raw_spin_unlock_irqrestore(task-pi_lock, flags);

return twork;
}

void task_work_run(void)
{
struct task_struct *task = current;
struct callback_head *twork, *marker;

for (;;) {
raw_spin_lock_irq(task-pi_lock);
do {
twork = ACCESS_ONCE(task-task_works);
marker = (!twork  (task-flags  PF_EXITING)) ?
TWORK_EXITED : NULL;
} while (cmp_xchg(task-task_works, twork, marker));
raw_spin_unlock_irq(task-pi_lock);

if (!twork)
break;

...run works...
}
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-21 Thread Oleg Nesterov
On 08/21, Oleg Nesterov wrote:

 static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
 {
   return cmpxchg(ptr, old, new) == old;
 }

 int
 task_work_add(struct task_struct *task, struct callback_head *twork, bool 
 notify)
 {
   do {
   twork-next = ACCESS_ONCE(task-task_works);
   if (unlikely(twork-next == TWORK_EXITED))
   return -ESRCH;
   } while (cmp_xchg(task-task_works, twork-next, twork));
  ^^^

while (!cmp_xchg(...))

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> Anyway, would taking ->pi_lock over _cancel and _run suffice?

I thinks yes... But I can't think properly today.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 18:10 +0200, Oleg Nesterov wrote:
> task->task_works points to the last node in the circular single-linked list,
> task_work_add() adds the new element after the last one and updates
> task->task_works. This is O(1).

Agreed, the way I was looking at that is: ->task_works points to the
head and we put a new one in front, that too is O(1) ;-)

> > > But the list should be short, we can reverse it in _run() if we change
> > > task_work_add() to add to the head.
> >
> > Reversing a (single linked) list is O(n^2)..
> 
> Hmm. This is O(n). You can simply iterate over this list once, changing
> the ->next pointer to point back. 

OK, I'm going to stop and step away from the computer now.. clearly I
more than useless today :/

But yeah.. that could be done.

Anyway, would taking ->pi_lock over _cancel and _run suffice?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:
>
> > IMHO, this is just more natural.
>
> Depends on what you're used to I guess ;-)

I have to agree ;)

> > For example. keyctl_session_to_parent() does _cancel only to protect
> > from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
> > loop. It could simply do task_work_add(), but in this case we need
> > fifo for correctness.
>
> I'm not entirely sure I see, not doing the cancel would delay the free
> until the executing of key_change_session_keyring()? doing that keyctl()
> in an indefinite loop involves going back to userspace, so where's the
> resource issue?

But the child does task_work_add(current->parent). IOW, there are 2
different tasks. Now suppose that ->parent sleeps.

> Also, I'm not seeing where the FIFO requirement comes from.

Again, suppose that ->parent sleeps. The last KEYCTL_SESSION_TO_PARENT
request should "win". Although I agree, this is not that important.

> > > Iterating a single linked queue in fifo
> > > seems more expensive than useful.
> >
> > Currently the list is fifo (we add to the last element), this is O(1).
>
> depends on what way you look at the list I guess, with a single linked
> list there's only one end you can add to in O(1), so we're calling that
> the tail?

Sorry, can't understand...

task->task_works points to the last node in the circular single-linked list,
task_work_add() adds the new element after the last one and updates
task->task_works. This is O(1).

> > But the list should be short, we can reverse it in _run() if we change
> > task_work_add() to add to the head.
>
> Reversing a (single linked) list is O(n^2)..

Hmm. This is O(n). You can simply iterate over this list once, changing
the ->next pointer to point back.

> which is indeed doable for
> short lists, but why assume its short?

I agree, it is better to not do this.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:58 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
> >
> > > > I guess we could steal the entire list and requeue it afterwards and
> > > > lift TIF_NOTIFY_RESUME along with it..
> > >
> > > We can't. This can race with exit_task_work(). And this can break
> > > fput(), the task can return to the userspace without __fput().
> >
> > So we could put that spinlock back around cancel and run and leave add
> > lockless. That'd solve my immediate problem but its not something I'm
> > proud of.
> 
> Which problem?

/me doing task_work_add() from under rq->lock..

> We can probably use bit_spin_lock() and avoid ->pi_lock.

tglx will kill us both for even thinking of bit-spinlocks.

> Of course, we can add the new lock into task_struct, but this is not nice.

If we can limit the lock to cancel/run I'm ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> > BTW, do you think it is really important to try to avoid ->pi_lock?
>
> It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
> fudge around and delay, but not having to would be nice.

Aha, got it.

At first glance, bit_spin_lock() can work. I'll try to make the patches
tomorrow.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
>
> > > I guess we could steal the entire list and requeue it afterwards and
> > > lift TIF_NOTIFY_RESUME along with it..
> >
> > We can't. This can race with exit_task_work(). And this can break
> > fput(), the task can return to the userspace without __fput().
>
> So we could put that spinlock back around cancel and run and leave add
> lockless. That'd solve my immediate problem but its not something I'm
> proud of.

Which problem?

We can probably use bit_spin_lock() and avoid ->pi_lock.

Of course, we can add the new lock into task_struct, but this is not nice.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:

> I won't insist. The patch I sent uses PF_EXITING and the fake
> "struct callback_head* TWORK_EXITED", but this looks almost the same.

Right, I used a fake callback_head because it avoided a few special
cases since its a dereferencable pointer.

> > > Note also your patch breaks fifo, but this is fixable.
> >
> > Why do you care about the order?
> 
> IMHO, this is just more natural.

Depends on what you're used to I guess ;-) Both RCU and irq_work are
filo, this seems to be the natural way for single linked lists.

> For example. keyctl_session_to_parent() does _cancel only to protect
> from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
> loop. It could simply do task_work_add(), but in this case we need
> fifo for correctness.

I'm not entirely sure I see, not doing the cancel would delay the free
until the executing of key_change_session_keyring()? doing that keyctl()
in an indefinite loop involves going back to userspace, so where's the
resource issue?

Also, I'm not seeing where the FIFO requirement comes from.

> > Iterating a single linked queue in fifo
> > seems more expensive than useful.
> 
> Currently the list is fifo (we add to the last element), this is O(1).

depends on what way you look at the list I guess, with a single linked
list there's only one end you can add to in O(1), so we're calling that
the tail?

> But the list should be short, we can reverse it in _run() if we change
> task_work_add() to add to the head.

Reversing a (single linked) list is O(n^2).. which is indeed doable for
short lists, but why assume its short?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> BTW, do you think it is really important to try to avoid ->pi_lock?

It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
fudge around and delay, but not having to would be nice.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:

> > I guess we could steal the entire list and requeue it afterwards and
> > lift TIF_NOTIFY_RESUME along with it..
> 
> We can't. This can race with exit_task_work(). And this can break
> fput(), the task can return to the userspace without __fput().

So we could put that spinlock back around cancel and run and leave add
lockless. That'd solve my immediate problem but its not something I'm
proud of.

All schemes I can come up with end up being broken or effectively the
same as the above proposal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

BTW, do you think it is really important to try to avoid ->pi_lock?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
> >
> > Yes, we can add the explicit argument to __task_work_run(), but it can
> > check PF_EXITING instead, this looks simpler to me.
>
> I guess we could.. but I thought the explicit callback was simpler ;-)

I won't insist. The patch I sent uses PF_EXITING and the fake
"struct callback_head* TWORK_EXITED", but this looks almost the same.

> > Note also your patch breaks fifo, but this is fixable.
>
> Why do you care about the order?

IMHO, this is just more natural.

For example. keyctl_session_to_parent() does _cancel only to protect
from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
loop. It could simply do task_work_add(), but in this case we need
fifo for correctness.

> Iterating a single linked queue in fifo
> seems more expensive than useful.

Currently the list is fifo (we add to the last element), this is O(1).

But the list should be short, we can reverse it in _run() if we change
task_work_add() to add to the head.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > > On 08/20, Peter Zijlstra wrote:
> > > >
> > > >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> > > >  {
> > > > ...
> > > > +
> > > > +again:
> > > > +   workp = >task_works;
> > > > +   work = *workp;
> > > > +   while (work) {
> > > > +   if (work->func == func) {
> > >
> > > But this all can race with task_work_run() if task != current.
> > >
> > > This "work" can be freed/reused. And it should only return !NULL
> > > if twork->func() was not called.
> >
> > Ah, because we could be iterating the list after the xchg done by run.
>
> I guess we could steal the entire list and requeue it afterwards and
> lift TIF_NOTIFY_RESUME along with it..

We can't. This can race with exit_task_work(). And this can break
fput(), the task can return to the userspace without __fput().

> but I can't say that's pretty.

Yes ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > On 08/20, Peter Zijlstra wrote:
> > >
> > >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> > >  {
> > > ...
> > > +
> > > +again:
> > > + workp = >task_works;
> > > + work = *workp;
> > > + while (work) {
> > > + if (work->func == func) {
> > 
> > But this all can race with task_work_run() if task != current.
> > 
> > This "work" can be freed/reused. And it should only return !NULL
> > if twork->func() was not called.
> 
> Ah, because we could be iterating the list after the xchg done by run.

I guess we could steal the entire list and requeue it afterwards and
lift TIF_NOTIFY_RESUME along with it.. but I can't say that's pretty.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > +static void __task_work_run(struct callback_head *tail)
> >  {
> > -   struct task_struct *task = current;
> > -   struct callback_head *p, *q;
> > -
> > -   while (1) {
> > -   raw_spin_lock_irq(>pi_lock);
> > -   p = task->task_works;
> > -   task->task_works = NULL;
> > -   raw_spin_unlock_irq(>pi_lock);
> > -
> > -   if (unlikely(!p))
> > -   return;
> > -
> > -   q = p->next; /* head */
> > -   p->next = NULL; /* cut it */
> > -   while (q) {
> > -   p = q->next;
> > -   q->func(q);
> > -   q = p;
> > +   struct callback_head **head = >task_works;
> > +
> > +   do {
> > +   struct callback_head *work = xchg(head, NULL);
> > +   while (work) {
> > +   struct callback_head *next = ACCESS_ONCE(work->next);
> > +
> > +   WARN_ON_ONCE(work == );
> > +
> > +   work->func(work);
> > +   work = next;
> > }
> > -   }
> > +   } while (cmpxchg(head, NULL, tail) != NULL);
> 
> Yes, we can add the explicit argument to __task_work_run(), but it can
> check PF_EXITING instead, this looks simpler to me.

I guess we could.. but I thought the explicit callback was simpler ;-)

> Note also your patch breaks fifo, but this is fixable.

Why do you care about the order? Iterating a single linked queue in fifo
seems more expensive than useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> >  {
> > ...
> > +
> > +again:
> > +   workp = >task_works;
> > +   work = *workp;
> > +   while (work) {
> > +   if (work->func == func) {
> 
> But this all can race with task_work_run() if task != current.
> 
> This "work" can be freed/reused. And it should only return !NULL
> if twork->func() was not called.

Ah, because we could be iterating the list after the xchg done by run.

Hmm.. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> +static void __task_work_run(struct callback_head *tail)
>  {
> - struct task_struct *task = current;
> - struct callback_head *p, *q;
> -
> - while (1) {
> - raw_spin_lock_irq(>pi_lock);
> - p = task->task_works;
> - task->task_works = NULL;
> - raw_spin_unlock_irq(>pi_lock);
> -
> - if (unlikely(!p))
> - return;
> -
> - q = p->next; /* head */
> - p->next = NULL; /* cut it */
> - while (q) {
> - p = q->next;
> - q->func(q);
> - q = p;
> + struct callback_head **head = >task_works;
> +
> + do {
> + struct callback_head *work = xchg(head, NULL);
> + while (work) {
> + struct callback_head *next = ACCESS_ONCE(work->next);
> +
> + WARN_ON_ONCE(work == );
> +
> + work->func(work);
> + work = next;
>   }
> - }
> + } while (cmpxchg(head, NULL, tail) != NULL);

Yes, we can add the explicit argument to __task_work_run(), but it can
check PF_EXITING instead, this looks simpler to me.

Note also your patch breaks fifo, but this is fixable.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
>  task_work_cancel(struct task_struct *task, task_work_func_t func)
>  {
> ...
> +
> +again:
> + workp = >task_works;
> + work = *workp;
> + while (work) {
> + if (work->func == func) {

But this all can race with task_work_run() if task != current.

This "work" can be freed/reused. And it should only return !NULL
if twork->func() was not called.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:
>
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
>
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.

To serialize with exit_mm() which clears ->mm.

We shouldn't do task_work_add(task) if the exiting task has already
passed exit_task_work(). There is no way to do this after ed3e694d7
(and I think this is wrong), so keyctl relies on the fact that
exit_task_work() is called after exit_mm().

> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

Not sure task_work_add/run can use cmpxchg, but _cancel can't.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 08:19 -0400, Steven Rostedt wrote:
> 
> > > +   for (;;) {
> > > +   if (entry == )
> 
> But your compiler likes "entry == "? ;-)
> 
Yes, fancy as GCC is these days, it doesn't quite bother with the
meta-physical questions just yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Steven Rostedt
On Mon, 2012-08-20 at 13:50 +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
> >  int
> > +task_work_add(struct task_struct *task, struct callback_head *work, bool 
> > notify)
> >  {
> > +   struct callback_head **head = >task_works;
> > +   struct callback_head *entry, *old_entry;
> > +
> > +   entry = 
> 
> My compiler just alerted me that: 
> 
>   entry = *head; 
> 
> works a lot better..
> 
> > +   for (;;) {
> > +   if (entry == )

But your compiler likes "entry == "? ;-)

-- Steve

> > +   return -ESRCH;
> > +
> > +   old_entry = entry;
> > +   work->next = entry;
> > +   entry = cmpxchg(head, old_entry, work);
> > +   if (entry == old_entry)
> > +   break;
> > +   }
> >  
> > /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). 
> > */
> > if (notify)
> > set_notify_resume(task);
> > +
> > return 0;
> >  } 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
>  int
> +task_work_add(struct task_struct *task, struct callback_head *work, bool 
> notify)
>  {
> +   struct callback_head **head = >task_works;
> +   struct callback_head *entry, *old_entry;
> +
> +   entry = 

My compiler just alerted me that: 

entry = *head; 

works a lot better..

> +   for (;;) {
> +   if (entry == )
> +   return -ESRCH;
> +
> +   old_entry = entry;
> +   work->next = entry;
> +   entry = cmpxchg(head, old_entry, work);
> +   if (entry == old_entry)
> +   break;
> +   }
>  
> /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
> if (notify)
> set_notify_resume(task);
> +
> return 0;
>  } 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
> > I'm not at all sure how that relates to needing task_lock() in the
> > keyctl stuff. 

Ah, is it used to serialize against exit_mm() so that the !->mm check
will suffice to avoid queuing works past exit_task_work() ?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 09:15 +0200, Peter Zijlstra wrote:
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
> 
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.
> 
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

How about something like the below?

---
 include/linux/task_work.h |   7 +--
 kernel/exit.c |   2 +-
 kernel/task_work.c| 120 --
 3 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t 
func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-   if (unlikely(task->task_works))
-   task_work_run();
-}
+void task_work_exit(void);
 
 #endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
-   exit_task_work(tsk);
+   task_work_exit();
check_stack_usage();
exit_thread();
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..e5eac14 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,85 @@
 #include 
 #include 
 
+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = { 
+   .next = NULL,
+   .func = task_work_nop,
+};
+
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool 
notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool 
notify)
 {
-   struct callback_head *last, *first;
-   unsigned long flags;
-
-   /*
-* Not inserting the new work if the task has already passed
-* exit_task_work() is the responisbility of callers.
-*/
-   raw_spin_lock_irqsave(>pi_lock, flags);
-   last = task->task_works;
-   first = last ? last->next : twork;
-   twork->next = first;
-   if (last)
-   last->next = twork;
-   task->task_works = twork;
-   raw_spin_unlock_irqrestore(>pi_lock, flags);
+   struct callback_head **head = >task_works;
+   struct callback_head *entry, *old_entry;
+
+   entry = 
+   for (;;) {
+   if (entry == )
+   return -ESRCH;
+
+   old_entry = entry;
+   work->next = entry;
+   entry = cmpxchg(head, old_entry, work);
+   if (entry == old_entry)
+   break;
+   }
 
/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
+
return 0;
 }
 
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
-   unsigned long flags;
-   struct callback_head *last, *res = NULL;
-
-   raw_spin_lock_irqsave(>pi_lock, flags);
-   last = task->task_works;
-   if (last) {
-   struct callback_head *q = last, *p = q->next;
-   while (1) {
-   if (p->func == func) {
-   q->next = p->next;
-   if (p == last)
-   task->task_works = q == p ? NULL : q;
-   res = p;
-   break;
-   }
-   if (p == last)
-   break;
-   q = p;
-   p = q->next;
+   struct callback_head **workp, *work;
+
+again:
+   workp = >task_works;
+   work = *workp;
+   while (work) {
+   if (work->func == func) {
+   if (cmpxchg(workp, work, work->next) == work)
+   return work;
+   goto again;
}
+
+   workp = >next;
+   work = *workp;
}
-   raw_spin_unlock_irqrestore(>pi_lock, flags);
-   return res;
+
+   return NULL;
 }
 
-void task_work_run(void)
+static void __task_work_run(struct callback_head *tail)
 {
-   struct task_struct *task = current;
-   struct callback_head *p, *q;
-
-   while (1) {
-   raw_spin_lock_irq(>pi_lock);
-   p = task->task_works;
-   task->task_works = NULL;
-   raw_spin_unlock_irq(>pi_lock);
-

Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)

I'm not at all sure how that relates to needing task_lock() in the
keyctl stuff.

Also, can't task_work use llist stuff? That would also avoid using
->pi_lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
 I still think that task_work_add() should synhronize with exit_task_work()
 itself and fail if necessary. But I wasn't able to convince Al ;)

I'm not at all sure how that relates to needing task_lock() in the
keyctl stuff.

Also, can't task_work use llist stuff? That would also avoid using
-pi_lock.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 09:15 +0200, Peter Zijlstra wrote:
 On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
  I still think that task_work_add() should synhronize with exit_task_work()
  itself and fail if necessary. But I wasn't able to convince Al ;)
 
 I'm not at all sure how that relates to needing task_lock() in the
 keyctl stuff.
 
 Also, can't task_work use llist stuff? That would also avoid using
 -pi_lock.

How about something like the below?

---
 include/linux/task_work.h |   7 +--
 kernel/exit.c |   2 +-
 kernel/task_work.c| 120 --
 3 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t 
func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-   if (unlikely(task-task_works))
-   task_work_run();
-}
+void task_work_exit(void);
 
 #endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
-   exit_task_work(tsk);
+   task_work_exit();
check_stack_usage();
exit_thread();
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..e5eac14 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,85 @@
 #include linux/task_work.h
 #include linux/tracehook.h
 
+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = { 
+   .next = NULL,
+   .func = task_work_nop,
+};
+
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool 
notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool 
notify)
 {
-   struct callback_head *last, *first;
-   unsigned long flags;
-
-   /*
-* Not inserting the new work if the task has already passed
-* exit_task_work() is the responisbility of callers.
-*/
-   raw_spin_lock_irqsave(task-pi_lock, flags);
-   last = task-task_works;
-   first = last ? last-next : twork;
-   twork-next = first;
-   if (last)
-   last-next = twork;
-   task-task_works = twork;
-   raw_spin_unlock_irqrestore(task-pi_lock, flags);
+   struct callback_head **head = task-task_works;
+   struct callback_head *entry, *old_entry;
+
+   entry = head;
+   for (;;) {
+   if (entry == dead)
+   return -ESRCH;
+
+   old_entry = entry;
+   work-next = entry;
+   entry = cmpxchg(head, old_entry, work);
+   if (entry == old_entry)
+   break;
+   }
 
/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
+
return 0;
 }
 
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
-   unsigned long flags;
-   struct callback_head *last, *res = NULL;
-
-   raw_spin_lock_irqsave(task-pi_lock, flags);
-   last = task-task_works;
-   if (last) {
-   struct callback_head *q = last, *p = q-next;
-   while (1) {
-   if (p-func == func) {
-   q-next = p-next;
-   if (p == last)
-   task-task_works = q == p ? NULL : q;
-   res = p;
-   break;
-   }
-   if (p == last)
-   break;
-   q = p;
-   p = q-next;
+   struct callback_head **workp, *work;
+
+again:
+   workp = task-task_works;
+   work = *workp;
+   while (work) {
+   if (work-func == func) {
+   if (cmpxchg(workp, work, work-next) == work)
+   return work;
+   goto again;
}
+
+   workp = work-next;
+   work = *workp;
}
-   raw_spin_unlock_irqrestore(task-pi_lock, flags);
-   return res;
+
+   return NULL;
 }
 
-void task_work_run(void)
+static void __task_work_run(struct callback_head *tail)
 {
-   struct task_struct *task = current;
-   struct callback_head *p, *q;
-
-   while (1) {
-   raw_spin_lock_irq(task-pi_lock);
-   p = task-task_works;
-   task-task_works = NULL;
-   

Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
  I'm not at all sure how that relates to needing task_lock() in the
  keyctl stuff. 

Ah, is it used to serialize against exit_mm() so that the !-mm check
will suffice to avoid queuing works past exit_task_work() ?



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
  int
 +task_work_add(struct task_struct *task, struct callback_head *work, bool 
 notify)
  {
 +   struct callback_head **head = task-task_works;
 +   struct callback_head *entry, *old_entry;
 +
 +   entry = head;

My compiler just alerted me that: 

entry = *head; 

works a lot better..

 +   for (;;) {
 +   if (entry == dead)
 +   return -ESRCH;
 +
 +   old_entry = entry;
 +   work-next = entry;
 +   entry = cmpxchg(head, old_entry, work);
 +   if (entry == old_entry)
 +   break;
 +   }
  
 /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
 if (notify)
 set_notify_resume(task);
 +
 return 0;
  } 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Steven Rostedt
On Mon, 2012-08-20 at 13:50 +0200, Peter Zijlstra wrote:
 On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
   int
  +task_work_add(struct task_struct *task, struct callback_head *work, bool 
  notify)
   {
  +   struct callback_head **head = task-task_works;
  +   struct callback_head *entry, *old_entry;
  +
  +   entry = head;
 
 My compiler just alerted me that: 
 
   entry = *head; 
 
 works a lot better..
 
  +   for (;;) {
  +   if (entry == dead)

But your compiler likes entry == dead? ;-)

-- Steve

  +   return -ESRCH;
  +
  +   old_entry = entry;
  +   work-next = entry;
  +   entry = cmpxchg(head, old_entry, work);
  +   if (entry == old_entry)
  +   break;
  +   }
   
  /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). 
  */
  if (notify)
  set_notify_resume(task);
  +
  return 0;
   } 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 08:19 -0400, Steven Rostedt wrote:
 
   +   for (;;) {
   +   if (entry == dead)
 
 But your compiler likes entry == dead? ;-)
 
Yes, fancy as GCC is these days, it doesn't quite bother with the
meta-physical questions just yet.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
  I still think that task_work_add() should synhronize with exit_task_work()
  itself and fail if necessary. But I wasn't able to convince Al ;)

 I'm not at all sure how that relates to needing task_lock() in the
 keyctl stuff.

To serialize with exit_mm() which clears -mm.

We shouldn't do task_work_add(task) if the exiting task has already
passed exit_task_work(). There is no way to do this after ed3e694d7
(and I think this is wrong), so keyctl relies on the fact that
exit_task_work() is called after exit_mm().

 Also, can't task_work use llist stuff? That would also avoid using
 -pi_lock.

Not sure task_work_add/run can use cmpxchg, but _cancel can't.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

  task_work_cancel(struct task_struct *task, task_work_func_t func)
  {
 ...
 +
 +again:
 + workp = task-task_works;
 + work = *workp;
 + while (work) {
 + if (work-func == func) {

But this all can race with task_work_run() if task != current.

This work can be freed/reused. And it should only return !NULL
if twork-func() was not called.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 +static void __task_work_run(struct callback_head *tail)
  {
 - struct task_struct *task = current;
 - struct callback_head *p, *q;
 -
 - while (1) {
 - raw_spin_lock_irq(task-pi_lock);
 - p = task-task_works;
 - task-task_works = NULL;
 - raw_spin_unlock_irq(task-pi_lock);
 -
 - if (unlikely(!p))
 - return;
 -
 - q = p-next; /* head */
 - p-next = NULL; /* cut it */
 - while (q) {
 - p = q-next;
 - q-func(q);
 - q = p;
 + struct callback_head **head = current-task_works;
 +
 + do {
 + struct callback_head *work = xchg(head, NULL);
 + while (work) {
 + struct callback_head *next = ACCESS_ONCE(work-next);
 +
 + WARN_ON_ONCE(work == dead);
 +
 + work-func(work);
 + work = next;
   }
 - }
 + } while (cmpxchg(head, NULL, tail) != NULL);

Yes, we can add the explicit argument to __task_work_run(), but it can
check PF_EXITING instead, this looks simpler to me.

Note also your patch breaks fifo, but this is fixable.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
 On 08/20, Peter Zijlstra wrote:
 
   task_work_cancel(struct task_struct *task, task_work_func_t func)
   {
  ...
  +
  +again:
  +   workp = task-task_works;
  +   work = *workp;
  +   while (work) {
  +   if (work-func == func) {
 
 But this all can race with task_work_run() if task != current.
 
 This work can be freed/reused. And it should only return !NULL
 if twork-func() was not called.

Ah, because we could be iterating the list after the xchg done by run.

Hmm.. 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
 On 08/20, Peter Zijlstra wrote:
 
  +static void __task_work_run(struct callback_head *tail)
   {
  -   struct task_struct *task = current;
  -   struct callback_head *p, *q;
  -
  -   while (1) {
  -   raw_spin_lock_irq(task-pi_lock);
  -   p = task-task_works;
  -   task-task_works = NULL;
  -   raw_spin_unlock_irq(task-pi_lock);
  -
  -   if (unlikely(!p))
  -   return;
  -
  -   q = p-next; /* head */
  -   p-next = NULL; /* cut it */
  -   while (q) {
  -   p = q-next;
  -   q-func(q);
  -   q = p;
  +   struct callback_head **head = current-task_works;
  +
  +   do {
  +   struct callback_head *work = xchg(head, NULL);
  +   while (work) {
  +   struct callback_head *next = ACCESS_ONCE(work-next);
  +
  +   WARN_ON_ONCE(work == dead);
  +
  +   work-func(work);
  +   work = next;
  }
  -   }
  +   } while (cmpxchg(head, NULL, tail) != NULL);
 
 Yes, we can add the explicit argument to __task_work_run(), but it can
 check PF_EXITING instead, this looks simpler to me.

I guess we could.. but I thought the explicit callback was simpler ;-)

 Note also your patch breaks fifo, but this is fixable.

Why do you care about the order? Iterating a single linked queue in fifo
seems more expensive than useful.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
 On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
  On 08/20, Peter Zijlstra wrote:
  
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
   ...
   +
   +again:
   + workp = task-task_works;
   + work = *workp;
   + while (work) {
   + if (work-func == func) {
  
  But this all can race with task_work_run() if task != current.
  
  This work can be freed/reused. And it should only return !NULL
  if twork-func() was not called.
 
 Ah, because we could be iterating the list after the xchg done by run.

I guess we could steal the entire list and requeue it afterwards and
lift TIF_NOTIFY_RESUME along with it.. but I can't say that's pretty.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
  On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
   On 08/20, Peter Zijlstra wrote:
   
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
...
+
+again:
+   workp = task-task_works;
+   work = *workp;
+   while (work) {
+   if (work-func == func) {
  
   But this all can race with task_work_run() if task != current.
  
   This work can be freed/reused. And it should only return !NULL
   if twork-func() was not called.
 
  Ah, because we could be iterating the list after the xchg done by run.

 I guess we could steal the entire list and requeue it afterwards and
 lift TIF_NOTIFY_RESUME along with it..

We can't. This can race with exit_task_work(). And this can break
fput(), the task can return to the userspace without __fput().

 but I can't say that's pretty.

Yes ;)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
 
  Yes, we can add the explicit argument to __task_work_run(), but it can
  check PF_EXITING instead, this looks simpler to me.

 I guess we could.. but I thought the explicit callback was simpler ;-)

I won't insist. The patch I sent uses PF_EXITING and the fake
struct callback_head* TWORK_EXITED, but this looks almost the same.

  Note also your patch breaks fifo, but this is fixable.

 Why do you care about the order?

IMHO, this is just more natural.

For example. keyctl_session_to_parent() does _cancel only to protect
from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
loop. It could simply do task_work_add(), but in this case we need
fifo for correctness.

 Iterating a single linked queue in fifo
 seems more expensive than useful.

Currently the list is fifo (we add to the last element), this is O(1).

But the list should be short, we can reverse it in _run() if we change
task_work_add() to add to the head.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 Also, can't task_work use llist stuff? That would also avoid using
 -pi_lock.

BTW, do you think it is really important to try to avoid -pi_lock?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:

  I guess we could steal the entire list and requeue it afterwards and
  lift TIF_NOTIFY_RESUME along with it..
 
 We can't. This can race with exit_task_work(). And this can break
 fput(), the task can return to the userspace without __fput().

So we could put that spinlock back around cancel and run and leave add
lockless. That'd solve my immediate problem but its not something I'm
proud of.

All schemes I can come up with end up being broken or effectively the
same as the above proposal.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
 BTW, do you think it is really important to try to avoid -pi_lock?

It is for me, I'm doing task_work_add() from under rq-lock.. :/ I could
fudge around and delay, but not having to would be nice.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:

 I won't insist. The patch I sent uses PF_EXITING and the fake
 struct callback_head* TWORK_EXITED, but this looks almost the same.

Right, I used a fake callback_head because it avoided a few special
cases since its a dereferencable pointer.

   Note also your patch breaks fifo, but this is fixable.
 
  Why do you care about the order?
 
 IMHO, this is just more natural.

Depends on what you're used to I guess ;-) Both RCU and irq_work are
filo, this seems to be the natural way for single linked lists.

 For example. keyctl_session_to_parent() does _cancel only to protect
 from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
 loop. It could simply do task_work_add(), but in this case we need
 fifo for correctness.

I'm not entirely sure I see, not doing the cancel would delay the free
until the executing of key_change_session_keyring()? doing that keyctl()
in an indefinite loop involves going back to userspace, so where's the
resource issue?

Also, I'm not seeing where the FIFO requirement comes from.

  Iterating a single linked queue in fifo
  seems more expensive than useful.
 
 Currently the list is fifo (we add to the last element), this is O(1).

depends on what way you look at the list I guess, with a single linked
list there's only one end you can add to in O(1), so we're calling that
the tail?

 But the list should be short, we can reverse it in _run() if we change
 task_work_add() to add to the head.

Reversing a (single linked) list is O(n^2).. which is indeed doable for
short lists, but why assume its short?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:

   I guess we could steal the entire list and requeue it afterwards and
   lift TIF_NOTIFY_RESUME along with it..
 
  We can't. This can race with exit_task_work(). And this can break
  fput(), the task can return to the userspace without __fput().

 So we could put that spinlock back around cancel and run and leave add
 lockless. That'd solve my immediate problem but its not something I'm
 proud of.

Which problem?

We can probably use bit_spin_lock() and avoid -pi_lock.

Of course, we can add the new lock into task_struct, but this is not nice.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
  BTW, do you think it is really important to try to avoid -pi_lock?

 It is for me, I'm doing task_work_add() from under rq-lock.. :/ I could
 fudge around and delay, but not having to would be nice.

Aha, got it.

At first glance, bit_spin_lock() can work. I'll try to make the patches
tomorrow.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 17:58 +0200, Oleg Nesterov wrote:
 On 08/20, Peter Zijlstra wrote:
 
  On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
 
I guess we could steal the entire list and requeue it afterwards and
lift TIF_NOTIFY_RESUME along with it..
  
   We can't. This can race with exit_task_work(). And this can break
   fput(), the task can return to the userspace without __fput().
 
  So we could put that spinlock back around cancel and run and leave add
  lockless. That'd solve my immediate problem but its not something I'm
  proud of.
 
 Which problem?

/me doing task_work_add() from under rq-lock..

 We can probably use bit_spin_lock() and avoid -pi_lock.

tglx will kill us both for even thinking of bit-spinlocks.

 Of course, we can add the new lock into task_struct, but this is not nice.

If we can limit the lock to cancel/run I'm ok.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:

  IMHO, this is just more natural.

 Depends on what you're used to I guess ;-)

I have to agree ;)

  For example. keyctl_session_to_parent() does _cancel only to protect
  from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
  loop. It could simply do task_work_add(), but in this case we need
  fifo for correctness.

 I'm not entirely sure I see, not doing the cancel would delay the free
 until the executing of key_change_session_keyring()? doing that keyctl()
 in an indefinite loop involves going back to userspace, so where's the
 resource issue?

But the child does task_work_add(current-parent). IOW, there are 2
different tasks. Now suppose that -parent sleeps.

 Also, I'm not seeing where the FIFO requirement comes from.

Again, suppose that -parent sleeps. The last KEYCTL_SESSION_TO_PARENT
request should win. Although I agree, this is not that important.

   Iterating a single linked queue in fifo
   seems more expensive than useful.
 
  Currently the list is fifo (we add to the last element), this is O(1).

 depends on what way you look at the list I guess, with a single linked
 list there's only one end you can add to in O(1), so we're calling that
 the tail?

Sorry, can't understand...

task-task_works points to the last node in the circular single-linked list,
task_work_add() adds the new element after the last one and updates
task-task_works. This is O(1).

  But the list should be short, we can reverse it in _run() if we change
  task_work_add() to add to the head.

 Reversing a (single linked) list is O(n^2)..

Hmm. This is O(n). You can simply iterate over this list once, changing
the -next pointer to point back.

 which is indeed doable for
 short lists, but why assume its short?

I agree, it is better to not do this.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 18:10 +0200, Oleg Nesterov wrote:
 task-task_works points to the last node in the circular single-linked list,
 task_work_add() adds the new element after the last one and updates
 task-task_works. This is O(1).

Agreed, the way I was looking at that is: -task_works points to the
head and we put a new one in front, that too is O(1) ;-)

   But the list should be short, we can reverse it in _run() if we change
   task_work_add() to add to the head.
 
  Reversing a (single linked) list is O(n^2)..
 
 Hmm. This is O(n). You can simply iterate over this list once, changing
 the -next pointer to point back. 

OK, I'm going to stop and step away from the computer now.. clearly I
more than useless today :/

But yeah.. that could be done.

Anyway, would taking -pi_lock over _cancel and _run suffice?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-20 Thread Oleg Nesterov
On 08/20, Peter Zijlstra wrote:

 Anyway, would taking -pi_lock over _cancel and _run suffice?

I thinks yes... But I can't think properly today.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-17 Thread Oleg Nesterov
On 08/17, Oleg Nesterov wrote:
>
> On 08/16, Peter Zijlstra wrote:
> >
> >  write_lock_irq(_lock)
> >  task_lock(parent)  parent->alloc_lock
>
> And this is already wrong. See the comment above task_lock().
>
> > And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> > inversion deadlock reported.
>
> Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
> with alloc_lock.
>
> > David, Al, anybody want to have a go at fixing this?
>
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)

And this is my old patch: http://marc.info/?l=linux-kernel=134082268721700
It should be re-diffed of course.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-17 Thread Oleg Nesterov
On 08/16, Peter Zijlstra wrote:
>
>  write_lock_irq(_lock)
>  task_lock(parent)parent->alloc_lock

And this is already wrong. See the comment above task_lock().

> And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> inversion deadlock reported.

Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
with alloc_lock.

> David, Al, anybody want to have a go at fixing this?

I still think that task_work_add() should synhronize with exit_task_work()
itself and fail if necessary. But I wasn't able to convince Al ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-17 Thread Oleg Nesterov
On 08/16, Peter Zijlstra wrote:

  write_lock_irq(tasklist_lock)
  task_lock(parent)parent-alloc_lock

And this is already wrong. See the comment above task_lock().

 And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
 inversion deadlock reported.

Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
with alloc_lock.

 David, Al, anybody want to have a go at fixing this?

I still think that task_work_add() should synhronize with exit_task_work()
itself and fail if necessary. But I wasn't able to convince Al ;)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-17 Thread Oleg Nesterov
On 08/17, Oleg Nesterov wrote:

 On 08/16, Peter Zijlstra wrote:
 
   write_lock_irq(tasklist_lock)
   task_lock(parent)  parent-alloc_lock

 And this is already wrong. See the comment above task_lock().

  And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
  inversion deadlock reported.

 Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
 with alloc_lock.

  David, Al, anybody want to have a go at fixing this?

 I still think that task_work_add() should synhronize with exit_task_work()
 itself and fail if necessary. But I wasn't able to convince Al ;)

And this is my old patch: http://marc.info/?l=linux-kernelm=134082268721700
It should be re-diffed of course.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-16 Thread Peter Zijlstra
On Tue, 2012-07-24 at 16:36 -0400, Dave Jones wrote:

> ==
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.5.0+ #122 Not tainted
> --
> trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
> [] posix_cpu_timer_del+0x2b/0xe0
> 
> and this task is already holding:
> blocked:  (&(_timer->it_lock)->rlock){-.-...}, instance: 
> 880143bce170, at: [] __lock_timer+0x89/0x1f0
> which would create a new lock dependency:
>  (&(_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> 
> but this new dependency connects a HARDIRQ-irq-safe lock:

> to a HARDIRQ-irq-unsafe lock:
>  (&(>alloc_lock)->rlock){+.+...}

> other info that might help us debug this:
> 
> Chain exists of:
>   &(_timer->it_lock)->rlock --> tasklist_lock --> &(>alloc_lock)->rlock
> 
>  Possible interrupt unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(&(>alloc_lock)->rlock);
>local_irq_disable();
>lock(&(_timer->it_lock)->rlock);
>lock(tasklist_lock);
>   
> lock(&(_timer->it_lock)->rlock);
> 
>  *** DEADLOCK ***
> 
> 1 lock on stack by trinity-child2/5327:
>  #0: blocked:  (&(_timer->it_lock)->rlock){-.-...}, instance: 
> 880143bce170, at: [] __lock_timer+0x89/0x1f0


> the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:

>[] lock_acquire+0xad/0x220
>[] _raw_spin_lock+0x46/0x80
>[] keyctl_session_to_parent+0xde/0x490
>[] sys_keyctl+0x6d/0x1a0
>[] system_call_fastpath+0x1a/0x1f

> stack backtrace:
> Pid: 5327, comm: trinity-child2 Not tainted 3.5.0+ #122
> Call Trace:
>  [] check_usage+0x4e4/0x500
>  [] ? native_sched_clock+0x19/0x80
>  [] ? trace_hardirqs_off_caller+0x28/0xd0
>  [] ? native_sched_clock+0x19/0x80
>  [] check_irq_usage+0x5b/0xe0
>  [] __lock_acquire+0xd8a/0x1ae0
>  [] ? __lock_acquire+0x306/0x1ae0
>  [] ? trace_hardirqs_off_caller+0x28/0xd0
>  [] ? lock_release_non_nested+0x175/0x320
>  [] lock_acquire+0xad/0x220
>  [] ? posix_cpu_timer_del+0x2b/0xe0
>  [] _raw_read_lock+0x49/0x80
>  [] ? posix_cpu_timer_del+0x2b/0xe0
>  [] ? __lock_timer+0xd5/0x1f0
>  [] posix_cpu_timer_del+0x2b/0xe0
>  [] sys_timer_delete+0x26/0x100
>  [] system_call_fastpath+0x1a/0x1f


So we have:


 sys_keyctl()
   keyctl_session_to_parent()
 write_lock_irq(_lock)
 task_lock(parent)  parent->alloc_lock

VS

  sys_timer_delete()
lock_timer()timer->it_lock
posix_cpu_timer_del()
  read_lock(_lock)


Creating:

  timer->it_lock -> tasklist_lock -> task->alloc_lock

And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
inversion deadlock reported.

The task_lock() in keyctl_session_to_parent() comes from Al who didn't
think it necessary to write a changelog in d35abdb2.

David, Al, anybody want to have a go at fixing this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-16 Thread Dave Jones
On Thu, Aug 16, 2012 at 08:54:31PM +0800, Ming Lei wrote:
 > On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones  wrote:
 > > On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
 > >  > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
 > >  >
 > >  >  Dave
 > >  >
 > >  > ==
 > >  > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
 > >  > 3.5.0+ #122 Not tainted
 > >  > --
 > >  > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 > >  > blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
 > > [] posix_cpu_timer_del+0x2b/0xe0
 > >  >
 > >  > and this task is already holding:
 > >  > blocked:  (&(_timer->it_lock)->rlock){-.-...}, instance: 
 > > 880143bce170, at: [] __lock_timer+0x89/0x1f0
 > >  > which would create a new lock dependency:
 > >  >  (&(_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
 > >  >
 > >  > but this new dependency connects a HARDIRQ-irq-safe lock:
 > >  >  (&(_timer->it_lock)->rlock){-.-...}
 > >  > ... which became HARDIRQ-irq-safe at:
 > >
 > > Shall I start bisecting this ? I can trigger it very easily, but if you
 > > can give me a set of commits to narrow down, it'll speed up the bisection.
 > 
 > It should a real possible deadlock, could you test the below patch to
 > see if it can fix the warning?

I've not managed to hit it in a while. It seems very dependant upon
specific builds for some reason.  Very strange.

Dave 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-16 Thread Ming Lei
On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones  wrote:
> On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
>  > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
>  >
>  >  Dave
>  >
>  > ==
>  > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
>  > 3.5.0+ #122 Not tainted
>  > --
>  > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  > blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
> [] posix_cpu_timer_del+0x2b/0xe0
>  >
>  > and this task is already holding:
>  > blocked:  (&(_timer->it_lock)->rlock){-.-...}, instance: 
> 880143bce170, at: [] __lock_timer+0x89/0x1f0
>  > which would create a new lock dependency:
>  >  (&(_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
>  >
>  > but this new dependency connects a HARDIRQ-irq-safe lock:
>  >  (&(_timer->it_lock)->rlock){-.-...}
>  > ... which became HARDIRQ-irq-safe at:
>
> Shall I start bisecting this ? I can trigger it very easily, but if you
> can give me a set of commits to narrow down, it'll speed up the bisection.

It should a real possible deadlock, could you test the below patch to
see if it can fix the warning?

--
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..29f6a8e 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -420,7 +420,7 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
int ret = 0;

if (likely(p != NULL)) {
-   read_lock(_lock);
+   /* tasklist_lock held already in timer_delete */
if (unlikely(p->sighand == NULL)) {
/*
 * We raced with the reaping of the task.
@@ -435,7 +435,6 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
list_del(>it.cpu.entry);
spin_unlock(>sighand->siglock);
}
-   read_unlock(_lock);

if (!ret)
put_task_struct(p);
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 69185ae..222d24c 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -884,15 +884,31 @@ SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
struct k_itimer *timer;
unsigned long flags;

+   /*
+* hold tasklist_lock to protect tsk->sighand which might be
+* accessed inside ->timer_del. It should be held before
+* timer->it_lock to avoid the below deadlock:
+*  CPU0CPU1
+*  lock(tasklist_lock)
+*  timer_delete()
+*  lock(timer->it_lock)
+*  lock(tasklist_lock)
+*  
+*  lock(timer->it_lock)
+*/
+   read_lock(_lock);
 retry_delete:
timer = lock_timer(timer_id, );
-   if (!timer)
+   if (!timer) {
+   read_unlock(_lock);
return -EINVAL;
+   }

if (timer_delete_hook(timer) == TIMER_RETRY) {
unlock_timer(timer, flags);
goto retry_delete;
}
+   read_unlock(_lock);

spin_lock(>sighand->siglock);
list_del(>list);


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-16 Thread Ming Lei
On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones da...@redhat.com wrote:
 On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
   Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
  
Dave
  
   ==
   [ INFO: HARDIRQ-safe - HARDIRQ-unsafe lock order detected ]
   3.5.0+ #122 Not tainted
   --
   trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
   blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
 [8109762b] posix_cpu_timer_del+0x2b/0xe0
  
   and this task is already holding:
   blocked:  ((new_timer-it_lock)-rlock){-.-...}, instance: 
 880143bce170, at: [81093d49] __lock_timer+0x89/0x1f0
   which would create a new lock dependency:
((new_timer-it_lock)-rlock){-.-...} - (tasklist_lock){.+.+..}
  
   but this new dependency connects a HARDIRQ-irq-safe lock:
((new_timer-it_lock)-rlock){-.-...}
   ... which became HARDIRQ-irq-safe at:

 Shall I start bisecting this ? I can trigger it very easily, but if you
 can give me a set of commits to narrow down, it'll speed up the bisection.

It should a real possible deadlock, could you test the below patch to
see if it can fix the warning?

--
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..29f6a8e 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -420,7 +420,7 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
int ret = 0;

if (likely(p != NULL)) {
-   read_lock(tasklist_lock);
+   /* tasklist_lock held already in timer_delete */
if (unlikely(p-sighand == NULL)) {
/*
 * We raced with the reaping of the task.
@@ -435,7 +435,6 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
list_del(timer-it.cpu.entry);
spin_unlock(p-sighand-siglock);
}
-   read_unlock(tasklist_lock);

if (!ret)
put_task_struct(p);
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 69185ae..222d24c 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -884,15 +884,31 @@ SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
struct k_itimer *timer;
unsigned long flags;

+   /*
+* hold tasklist_lock to protect tsk-sighand which might be
+* accessed inside -timer_del. It should be held before
+* timer-it_lock to avoid the below deadlock:
+*  CPU0CPU1
+*  lock(tasklist_lock)
+*  timer_delete()
+*  lock(timer-it_lock)
+*  lock(tasklist_lock)
+*  timer interrupt
+*  lock(timer-it_lock)
+*/
+   read_lock(tasklist_lock);
 retry_delete:
timer = lock_timer(timer_id, flags);
-   if (!timer)
+   if (!timer) {
+   read_unlock(tasklist_lock);
return -EINVAL;
+   }

if (timer_delete_hook(timer) == TIMER_RETRY) {
unlock_timer(timer, flags);
goto retry_delete;
}
+   read_unlock(tasklist_lock);

spin_lock(current-sighand-siglock);
list_del(timer-list);


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-16 Thread Dave Jones
On Thu, Aug 16, 2012 at 08:54:31PM +0800, Ming Lei wrote:
  On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones da...@redhat.com wrote:
   On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
 Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a

  Dave

 ==
 [ INFO: HARDIRQ-safe - HARDIRQ-unsafe lock order detected ]
 3.5.0+ #122 Not tainted
 --
 trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
   [8109762b] posix_cpu_timer_del+0x2b/0xe0

 and this task is already holding:
 blocked:  ((new_timer-it_lock)-rlock){-.-...}, instance: 
   880143bce170, at: [81093d49] __lock_timer+0x89/0x1f0
 which would create a new lock dependency:
  ((new_timer-it_lock)-rlock){-.-...} - (tasklist_lock){.+.+..}

 but this new dependency connects a HARDIRQ-irq-safe lock:
  ((new_timer-it_lock)-rlock){-.-...}
 ... which became HARDIRQ-irq-safe at:
  
   Shall I start bisecting this ? I can trigger it very easily, but if you
   can give me a set of commits to narrow down, it'll speed up the bisection.
  
  It should a real possible deadlock, could you test the below patch to
  see if it can fix the warning?

I've not managed to hit it in a while. It seems very dependant upon
specific builds for some reason.  Very strange.

Dave 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-08-16 Thread Peter Zijlstra
On Tue, 2012-07-24 at 16:36 -0400, Dave Jones wrote:

 ==
 [ INFO: HARDIRQ-safe - HARDIRQ-unsafe lock order detected ]
 3.5.0+ #122 Not tainted
 --
 trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
 [8109762b] posix_cpu_timer_del+0x2b/0xe0
 
 and this task is already holding:
 blocked:  ((new_timer-it_lock)-rlock){-.-...}, instance: 
 880143bce170, at: [81093d49] __lock_timer+0x89/0x1f0
 which would create a new lock dependency:
  ((new_timer-it_lock)-rlock){-.-...} - (tasklist_lock){.+.+..}
 
 but this new dependency connects a HARDIRQ-irq-safe lock:

 to a HARDIRQ-irq-unsafe lock:
  ((p-alloc_lock)-rlock){+.+...}

 other info that might help us debug this:
 
 Chain exists of:
   (new_timer-it_lock)-rlock -- tasklist_lock -- (p-alloc_lock)-rlock
 
  Possible interrupt unsafe locking scenario:
 
CPU0CPU1

   lock((p-alloc_lock)-rlock);
local_irq_disable();
lock((new_timer-it_lock)-rlock);
lock(tasklist_lock);
   Interrupt
 lock((new_timer-it_lock)-rlock);
 
  *** DEADLOCK ***
 
 1 lock on stack by trinity-child2/5327:
  #0: blocked:  ((new_timer-it_lock)-rlock){-.-...}, instance: 
 880143bce170, at: [81093d49] __lock_timer+0x89/0x1f0


 the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:

[810da83d] lock_acquire+0xad/0x220
[816895f6] _raw_spin_lock+0x46/0x80
[812d5f2e] keyctl_session_to_parent+0xde/0x490
[812d634d] sys_keyctl+0x6d/0x1a0
[8169336d] system_call_fastpath+0x1a/0x1f

 stack backtrace:
 Pid: 5327, comm: trinity-child2 Not tainted 3.5.0+ #122
 Call Trace:
  [810d8194] check_usage+0x4e4/0x500
  [81023729] ? native_sched_clock+0x19/0x80
  [810d59a8] ? trace_hardirqs_off_caller+0x28/0xd0
  [81023729] ? native_sched_clock+0x19/0x80
  [810d820b] check_irq_usage+0x5b/0xe0
  [810d93da] __lock_acquire+0xd8a/0x1ae0
  [810d8956] ? __lock_acquire+0x306/0x1ae0
  [810d59a8] ? trace_hardirqs_off_caller+0x28/0xd0
  [810da2a5] ? lock_release_non_nested+0x175/0x320
  [810da83d] lock_acquire+0xad/0x220
  [8109762b] ? posix_cpu_timer_del+0x2b/0xe0
  [81689b59] _raw_read_lock+0x49/0x80
  [8109762b] ? posix_cpu_timer_del+0x2b/0xe0
  [81093d95] ? __lock_timer+0xd5/0x1f0
  [8109762b] posix_cpu_timer_del+0x2b/0xe0
  [81094786] sys_timer_delete+0x26/0x100
  [8169336d] system_call_fastpath+0x1a/0x1f


So we have:


 sys_keyctl()
   keyctl_session_to_parent()
 write_lock_irq(tasklist_lock)
 task_lock(parent)  parent-alloc_lock

VS

  sys_timer_delete()
lock_timer()timer-it_lock
posix_cpu_timer_del()
  read_lock(tasklist_lock)


Creating:

  timer-it_lock - tasklist_lock - task-alloc_lock

And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
inversion deadlock reported.

The task_lock() in keyctl_session_to_parent() comes from Al who didn't
think it necessary to write a changelog in d35abdb2.

David, Al, anybody want to have a go at fixing this?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-07-27 Thread Dave Jones
On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
 > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
 > 
 >  Dave
 > 
 > ==
 > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
 > 3.5.0+ #122 Not tainted
 > --
 > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 > blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
 > [] posix_cpu_timer_del+0x2b/0xe0
 > 
 > and this task is already holding:
 > blocked:  (&(_timer->it_lock)->rlock){-.-...}, instance: 
 > 880143bce170, at: [] __lock_timer+0x89/0x1f0
 > which would create a new lock dependency:
 >  (&(_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
 > 
 > but this new dependency connects a HARDIRQ-irq-safe lock:
 >  (&(_timer->it_lock)->rlock){-.-...}
 > ... which became HARDIRQ-irq-safe at:

Shall I start bisecting this ? I can trigger it very easily, but if you
can give me a set of commits to narrow down, it'll speed up the bisection.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep trace from posix timers

2012-07-27 Thread Dave Jones
On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
  Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
  
   Dave
  
  ==
  [ INFO: HARDIRQ-safe - HARDIRQ-unsafe lock order detected ]
  3.5.0+ #122 Not tainted
  --
  trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
  blocked:  (tasklist_lock){.+.+..}, instance: 81c05098, at: 
  [8109762b] posix_cpu_timer_del+0x2b/0xe0
  
  and this task is already holding:
  blocked:  ((new_timer-it_lock)-rlock){-.-...}, instance: 
  880143bce170, at: [81093d49] __lock_timer+0x89/0x1f0
  which would create a new lock dependency:
   ((new_timer-it_lock)-rlock){-.-...} - (tasklist_lock){.+.+..}
  
  but this new dependency connects a HARDIRQ-irq-safe lock:
   ((new_timer-it_lock)-rlock){-.-...}
  ... which became HARDIRQ-irq-safe at:

Shall I start bisecting this ? I can trigger it very easily, but if you
can give me a set of commits to narrow down, it'll speed up the bisection.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/