from:"Frederic Weisbecker"

Work claiming semantics require this operation
to be SMP-safe.

So we want a strict ordering between the data we
want the work to handle and the flags that describe
the work state such that either we claim and we enqueue
the work that will see our data or we fail the claim
but the CPU where the work is still pending will
see the data we prepared when it executes the work:

CPU 0   CPU 1

data = somethingclaim work
smp_mb()smp_mb()
try claim work  execute work (data)

The early check for the pending flag in irq_work_claim()
fails to meet this ordering requirement. As a result,
when we fail to claim a work, it may have been processed
by CPU that were previously owning it already, leaving
our data unprocessed.

Discussing this with Steven Rostedt, we can use memory
barriers to fix this or we can rely on cmpxchg() that
implies full barrier already.

To fix this, we start by speculating about the value we
wish to be in the work-flags but we only make any conclusion
after the value returned by the cmpxchg() call that either
claims the work or does the ordering such that the CPU
where the work is pending handles our data.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 kernel/irq_work.c |   22 +-
 1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 1588e3b..764240a 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -34,15 +34,27 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list);
  */
 static bool irq_work_claim(struct irq_work *work)
 {
-   unsigned long flags, nflags;
+   unsigned long flags, oflags, nflags;
 
+   /*
+* Can't check IRQ_WORK_PENDING bit right now because the work
+* can be running on another CPU and we need correct ordering
+* between work flags and data to compute in work execution
+* such that either we claim and we execute the work or the work
+* is still pending on another CPU but it's guaranteed it will see
+* our data when it executes the work.
+* Start with our best wish as a premise but only deal with
+* flags value for real after cmpxchg() ordering.
+*/
+   flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
-   flags = work-flags;
-   if (flags  IRQ_WORK_PENDING)
-   return false;
nflags = flags | IRQ_WORK_FLAGS;
-   if (cmpxchg(work-flags, flags, nflags) == flags)
+   oflags = cmpxchg(work-flags, flags, nflags);
+   if (oflags == flags)
break;
+   if (oflags  IRQ_WORK_PENDING)
+   return false;
+   flags = oflags;
cpu_relax();
}
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] irq_work: Fix racy IRQ_WORK_BUSY flag setting

The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.

This is necessary because if we want to enqueue a work but we
fail the claim, we want to ensure that the CPU where that work
is still pending will see and handle the data we expected the
work to compute.

This might not work as expected though because IRQ_WORK_BUSY
isn't set atomically. By the time a CPU fails a work claim,
this work may well have been already executed by the CPU where
it was previously pending.

Due to the lack of appropriate memory barrier, the IRQ_WORK_BUSY
flag value may not be visible by the CPU trying to claim while
the work is executing, and that until we clear the busy bit in
the work flags using cmpxchg() that implies the full barrier.

One solution could involve a full barrier between setting
IRQ_WORK_BUSY flag and the work execution. This way we
ensure that the work execution site sees the expected data
and the claim site sees the IRQ_WORK_BUSY:

CPU 0 CPU 1

data = something flags = IRQ_WORK_BUSY
smp_mb() (implicit with cmpxchg  smp_mb()
  on flags in claim) execute_work (sees data from CPU 0)
try to claim

As a shortcut, let's just use xchg() that implies a full memory
barrier.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 kernel/irq_work.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 764240a..ea79365 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -130,9 +130,12 @@ void irq_work_run(void)
 
/*
 * Clear the PENDING bit, after this point the @work
-* can be re-used.
+* can be re-used. Use xchg to force ordering against
+* data to process, such that if claiming fails on
+* another CPU, we see and handle the data it wants
+* us to process on the work.
 */
-   work-flags = IRQ_WORK_BUSY;
+   xchg(work-flags, IRQ_WORK_BUSY);
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] irq_work: A couple fixes

Hi,

The first patch is extracted from my printk patches, with changelog
reworked. The second patch is an addition.

And I still wonder if cpu_relax() is enough to prevent the compiler
from correctly reloading work-flags in irq_work_sync() loop.
Do we need ACCESS_ONCE()?

Thanks.

Frederic Weisbecker (2):
  irq_work: Fix racy check on work pending flag
  irq_work: Fix racy IRQ_WORK_BUSY flag setting

 kernel/irq_work.c |   29 ++---
 1 files changed, 22 insertions(+), 7 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] irq_work: A couple fixes

2012/10/30 Steven Rostedt rost...@goodmis.org:
 On Tue, 2012-10-30 at 16:34 +0100, Frederic Weisbecker wrote:
 Hi,

 And I still wonder if cpu_relax() is enough to prevent the compiler
 from correctly reloading work-flags in irq_work_sync() loop.
 Do we need ACCESS_ONCE()?

 You mean this loop:

flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
nflags = flags | IRQ_WORK_FLAGS;
oflags = cmpxchg(work-flags, flags, nflags);
if (oflags == flags)
break;
if (oflags  IRQ_WORK_PENDING)
return false;
flags = oflags;
cpu_relax();
}

 After the first loading of work-flags, you are worried about the
 work-flags in the cmpxchg()?  The cmpxchg() will handle that itself. I
 don't see any place that a ACCESS_ONCE() is required here. The cmpxchg()
 acts on the address of work-flags, the compiler isn't involved with the
 value at that address.

No I was worried about the cpu_relax() in irq_work_sync()
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] irq_work: A couple fixes

2012/10/30 Steven Rostedt rost...@goodmis.org:
 On Tue, 2012-10-30 at 17:25 +0100, Frederic Weisbecker wrote:

 No I was worried about the cpu_relax() in irq_work_sync()

 That one is fine too, as this is the purpose of cpu_relax(). Not only to
 relax the cpu, but also to tell gcc that the loop needs to be reread.

Ok, should be fine then.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/32] x86: New cpuset nohz irq vector

2012/10/30 Steven Rostedt rost...@goodmis.org:
 On Mon, 2012-10-29 at 16:27 -0400, Steven Rostedt wrote:
 plain text document attachment
 (0004-x86-New-cpuset-nohz-irq-vector.patch)
 From: Frederic Weisbecker fweis...@gmail.com

 We need a way to send an IPI (remote or local) in order to
 asynchronously restart the tick for CPUs in nohz adaptive mode.

 This must be asynchronous such that we can trigger it with irqs
 disabled. This must be usable as a self-IPI as well for example
 in cases where we want to avoid random dealock scenario while
 restarting the tick inline otherwise.

 This only settles the x86 backend. The core tick restart function
 will be defined in a later patch.

 [CHECKME: Perhaps we instead need to use irq work for self IPIs.
 But we also need a way to send async remote IPIs.]

 Probably just use irq_work for self ipis, and normal ipis for other
 CPUs.

Right. And that's one more reason why we want to know if the arch
implements irq work with self ipis or not. If the arch can't, then we
just don't stop the tick.

 Also, what reason do we have to force a task out of nohz? IOW, do we
 really need this?

When a posix CPU timer is enqueued, when a new task is enqueued, etc...


 Also, perhaps we could just tag onto the schedule_ipi() function instead
 of having to create a new IPI for all archs?

irq work should be just fine. No need to add more overhead on the
schedule ipi I think.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/32] nohz/cpuset: Restart the tick if printk needs it

2012/10/30 Steven Rostedt rost...@goodmis.org:
 Probably need to at least disable preemption. I don't see any
 requirement that wake_up_klogd() needs to be called with preemption
 disabled.

 The this_cpu_or() doesn't care which CPU it triggers, but the enabling
 of nohz does.

This patch is deemed to be replaced with the printk in nohz patchset
I'm working on. But it indeed to disable preemption as well and its
irq work should be made per cpu.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012/10/30 anish kumar anish198519851...@gmail.com:
 As I understand without the memory barrier proposed by you the situation
 would be as below:
 CPU 0 CPU 1

 data = something flags = IRQ_WORK_BUSY
 smp_mb() (implicit with cmpxchg  execute_work (sees data from CPU 0)
on flags in claim)
 _success_ in claiming and goes
 ahead and execute the work(wrong?)
  cmpxchg cause flag to IRQ_WORK_BUSY

 Now knows the flag==IRQ_WORK_BUSY

 Am I right?

(Adding Paul in Cc because I'm again confused with memory barriers)

Actually what I had in mind is rather that CPU 0 fails its claim
because it's not seeing the IRQ_WORK_BUSY flag as it should:


CPU 0 CPU 1

data = something  flags = IRQ_WORK_BUSY
cmpxchg() for claim   execute_work (sees data from CPU 0)

CPU 0 should see IRQ_WORK_BUSY but it may not because CPU 1 sets this
value in a non-atomic way.

Also, while browsing Paul's perfbook, I realize my changelog is buggy.
It seems we can't reliably use memory barriers here because we would
be in the following case:

CPU 0  CPU 1

store(work data)store(flags)
smp_mb()smp_mb()
load(flags)load(work data)

On top of this barrier pairing, we can't make the assumption that, for
example, if CPU 1 sees the work data stored in CPU 0 then CPU 0 sees
the flags stored in CPU 1.

So now I wonder if cmpxchg() can give us more confidence:


CPU 0CPU 1

store(work data)  xchg(flags, IRQ_WORK_BUSY)
cmpxchg(flags, IRQ_WORK_FLAGS)load(work data)

Can I make this assumption?

- If CPU 0 fails the cmpxchg() (which means CPU 1 has not yet xchg())
then CPU 1 will execute the work and see our data.

At least cmpxchg / xchg pair orders correctly to ensure somebody will
execute our work. Now probably some locking is needed from the work
function itself if it's not per cpu.


 Probably a stupid question.Why do we return the bool from irq_work_queue
 when no one bothers to check the return value?Wouldn't it be better if
 this function is void as used by the users of this function or am I
 looking at the wrong code.

No idea. Probably Peter had plans there.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/32] x86: New cpuset nohz irq vector

2012/10/31 Steven Rostedt rost...@goodmis.org:
 On Wed, 2012-10-31 at 00:51 +0100, Frederic Weisbecker wrote:

  Probably just use irq_work for self ipis, and normal ipis for other
  CPUs.

 Right. And that's one more reason why we want to know if the arch
 implements irq work with self ipis or not. If the arch can't, then we
 just don't stop the tick.

 We can just allow certain archs to have cpuset/nohz. Make it depend on
 features that you want (or makes nohz easier to implement).

Right.


  Also, what reason do we have to force a task out of nohz? IOW, do we
  really need this?

 When a posix CPU timer is enqueued, when a new task is enqueued, etc...

 I was thinking about something other than itself. That is, who would
 enqueue a posix cpu timer on the cpu other than the task running with
 nohz on that cpu?

If the posix cpu timer is process wide (ie: whole threadgroup) this can happen.

 A new task would send the schedule ipi too. Which would enqueue the task
 and take the cpu out of nohz, no?

Not if it's enqueued locally. And in this case we don't want to
restart the tick from the ttwu path in order to avoid funny locking
scenario. So a self IPI would do the trick.

 irq work should be just fine. No need to add more overhead on the
 schedule ipi I think.

 irq_work can send the work to another CPU right? This part I wasn't sure
 about.

Claiming a work itself can be a cross CPU competition: multiple CPUs
may want to queue the work at the same time, only one should succeed.
Once claimed though, the work can only been enqueued locally.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012/10/31 Steven Rostedt rost...@goodmis.org:
 More confidence over what? The xchg()? They are equivalent (wrt memory
 barriers).

 Here's the issue that currently exists. Let's look at the code:


 /*
  * Claim the entry so that no one else will poke at it.
  */
 static bool irq_work_claim(struct irq_work *work)
 {
 unsigned long flags, nflags;

 for (;;) {
 flags = work-flags;
 if (flags  IRQ_WORK_PENDING)
 return false;
 nflags = flags | IRQ_WORK_FLAGS;
 if (cmpxchg(work-flags, flags, nflags) == flags)
 break;
 cpu_relax();
 }

 return true;
 }

 and

 llnode = llist_del_all(this_list);
 while (llnode != NULL) {
 work = llist_entry(llnode, struct irq_work, llnode);

 llnode = llist_next(llnode);

 /*
  * Clear the PENDING bit, after this point the @work
  * can be re-used.
  */
 work-flags = IRQ_WORK_BUSY;
 work-func(work);
 /*
  * Clear the BUSY bit and return to the free state if
  * no-one else claimed it meanwhile.
  */
 (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
 }

 The irq_work_claim() will only queue its work if it's not already
 pending. If it is pending, then someone is going to process it anyway.
 But once we start running the function, new work needs to be processed
 again.

 Thus we have:

 CPU 1   CPU 2
 -   -
 (flags = 0)
 cmpxchg(flags, 0, IRQ_WORK_FLAGS)
 (flags = 3)
 [...]

 if (flags  IRQ_WORK_PENDING)
 return false
 flags = IRQ_WORK_BUSY
 (flags = 2)
 func()

 The above shows the normal case were CPU 2 doesn't need to queue work,
 because its going to be done for it by CPU 1. But...



 CPU 1   CPU 2
 -   -
 (flags = 0)
 cmpxchg(flags, 0, IRQ_WORK_FLAGS)
 (flags = 3)
 [...]
 flags = IRQ_WORK_BUSY
 (flags = 2)
 func()
 (sees flags = 3)
 if (flags  IRQ_WORK_PENDING)
 return false
 cmpxchg(flags, 2, 0);
 (flags = 0)


 Here, because we did not do a memory barrier after
 flags = IRQ_WORK_BUSY, CPU 2 saw stale data and did not queue its work,
 and missed the opportunity. Now if you had this fix with the xchg() as
 you have in your patch, then CPU 2 would not see the stale flags.
 Except, with the code I showed above it still can!

 CPU 1   CPU 2
 -   -
 (flags = 0)
 cmpxchg(flags, 0, IRQ_WORK_FLAGS)
 (flags = 3)
 [...]
 (fetch flags)
 xchg(flags, IRQ_WORK_BUSY)
 (flags = 2)
 func()
 (sees flags = 3)
 if (flags  IRQ_WORK_PENDING)
 return false
 cmpxchg(flags, 2, 0);
 (flags = 0)


 Even with the update of xchg(), if CPU2 fetched the flags before CPU1
 did the xchg, then it would still lose out. But that's where your
 previous patch comes in that does:

flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
nflags = flags | IRQ_WORK_FLAGS;
oflags = cmpxchg(work-flags, flags, nflags);
if (oflags == flags)
break;
if (oflags  IRQ_WORK_PENDING)
return false;
flags = oflags;
cpu_relax();
}


 This now does:

 CPU 1   CPU 2
 -   -
 (flags = 0)
 cmpxchg(flags, 0, IRQ_WORK_FLAGS)
 (flags = 3)
 [...]
 xchg(flags, IRQ_WORK_BUSY)
 (flags = 2)
 func()
 oflags = cmpxchg(flags, 
 flags, nflags);
 (sees flags = 2)
 if (flags  IRQ_WORK_PENDING)
 (not true)
 (loop)
 cmpxchg(flags, 2, 0);
 (flags = 2)
 flags =

Re: [PATCH 2/2] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012-10-31 Thread Frederic Weisbecker

2012/10/31 Steven Rostedt rost...@goodmis.org:
 On Wed, 2012-10-31 at 20:04 +0900, anish kumar wrote:
 nflags = 1 | 3
 nflags = 2 | 3
 In both cases the result would be same.If I am right then wouldn't this
 operation be redundant?

 Right. Actually we could change the new loop to:

 for (;;) {
 oflags = cmpxchg(work-flags, flags, IRQ_WORK_FLAGS);
 if (oflags == flags)
 break;
 if (oflags  IRQ_WORK_PENDING)
 return false;
 flags = oflags;
 cpu_relax();
 }

We could. But I wanted to keep the code able to handle new flags in
the future (such as IRQ_WORK_LAZY).

 Frederic,

 Would you like to add my explanation to your change log? You can add the
 entire thing, which I think would explain a lot to people.

It's indeed a very clear explanation. I'll put that in the changelog, thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/core/rcu 1/2] rcu: Add callback-free CPUs

2012-10-31 Thread Frederic Weisbecker

2012/10/31 Paul E. McKenney paul...@linux.vnet.ibm.com:
 +/*
 + * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
 + * callbacks queued by the corresponding no-CBs CPU.
 + */
 +static int rcu_nocb_kthread(void *arg)
 +{
 +   int c, cl;
 +   struct rcu_head *list;
 +   struct rcu_head *next;
 +   struct rcu_head **tail;
 +   struct rcu_data *rdp = arg;
 +
 +   /* Each pass through this loop invokes one batch of callbacks */
 +   for (;;) {
 +   /* If not polling, wait for next batch of callbacks. */
 +   if (!rcu_nocb_poll)
 +   wait_event(rdp-nocb_wq, rdp-nocb_head);
 +   list = ACCESS_ONCE(rdp-nocb_head);
 +   if (!list) {
 +   schedule_timeout_interruptible(1);
 +   continue;
 +   }
 +
 +   /*
 +* Extract queued callbacks, update counts, and wait
 +* for a grace period to elapse.
 +*/
 +   ACCESS_ONCE(rdp-nocb_head) = NULL;
 +   tail = xchg(rdp-nocb_tail, rdp-nocb_head);
 +   c = atomic_long_xchg(rdp-nocb_q_count, 0);
 +   cl = atomic_long_xchg(rdp-nocb_q_count_lazy, 0);
 +   ACCESS_ONCE(rdp-nocb_p_count) += c;
 +   ACCESS_ONCE(rdp-nocb_p_count_lazy) += cl;
 +   wait_rcu_gp(rdp-rsp-call_remote);
 +
 +   /* Each pass through the following loop invokes a callback. */
 +   trace_rcu_batch_start(rdp-rsp-name, cl, c, -1);
 +   c = cl = 0;
 +   while (list) {
 +   next = list-next;
 +   /* Wait for enqueuing to complete, if needed. */
 +   while (next == NULL  list-next != tail) {
 +   schedule_timeout_interruptible(1);
 +   next = list-next;
 +   }
 +   debug_rcu_head_unqueue(list);
 +   local_bh_disable();
 +   if (__rcu_reclaim(rdp-rsp-name, list))
 +   cl++;
 +   c++;
 +   local_bh_enable();
 +   list = next;
 +   }
 +   trace_rcu_batch_end(rdp-rsp-name, c, !!list, 0, 0, 1);
 +   ACCESS_ONCE(rdp-nocb_p_count) -= c;
 +   ACCESS_ONCE(rdp-nocb_p_count_lazy) -= cl;
 +   rdp-n_cbs_invoked += c;
 +   }
 +   return 0;
 +}
 +
 +/* Initialize per-rcu_data variables for no-CBs CPUs. */
 +static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 +{
 +   rdp-nocb_tail = rdp-nocb_head;
 +   init_waitqueue_head(rdp-nocb_wq);
 +}
 +
 +/* Create a kthread for each RCU flavor for each no-CBs CPU. */
 +static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 +{
 +   int cpu;
 +   struct rcu_data *rdp;
 +   struct task_struct *t;
 +
 +   if (rcu_nocb_mask == NULL)
 +   return;
 +   for_each_cpu(cpu, rcu_nocb_mask) {
 +   rdp = per_cpu_ptr(rsp-rda, cpu);
 +   t = kthread_run(rcu_nocb_kthread, rdp, rcuo%d, cpu);

Sorry, I think I left my brain in the middle of the diff. But there is
something I'm misunderstanding I think. Here you're creating an
rcu_nocb_kthread per nocb cpu. Looking at the code of
rcu_nocb_kthread(), it seems to execute the callbacks with
__rcu_reclaim().

So, in the end, no callbacks CPU execute their callbacks. Isn't it the
opposite than what is expected? (again, just referring to my
misunderstanding).

Thanks.

 +   BUG_ON(IS_ERR(t));
 +   ACCESS_ONCE(rdp-nocb_kthread) = t;
 +   }
 +}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] irq_work: A couple fixes v2

2012-11-02 Thread Frederic Weisbecker

Hey,

After some discussion with Steve, this is a respin with changelogs and
comments sanitized. The code itself hasn't changed.

Thanks.

Frederic Weisbecker (2):
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
  irq_work: Fix racy check on work pending flag

 kernel/irq_work.c |   21 +++--
 1 files changed, 15 insertions(+), 6 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012-11-02 Thread Frederic Weisbecker

The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.

So if we have specific data to compute in our work, we ensure it's
either handled by another CPU or locally by enqueuing the work again.
This state machine is guanranteed by atomic operations on the flags.

So when we set IRQ_WORK_BUSY without using an xchg-like operation,
we break this guarantee as in the following summarized scenario:

CPU 1   CPU 2
-   -
(flags = 0)
old_flags = flags;
(flags = 0)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(flags = 3)
[...]
flags = IRQ_WORK_BUSY
(flags = 2)
func()
(sees flags = 3)
cmpxchg(flags, old_flags,
old_flags | 
IRQ_WORK_FLAGS)
(give up)

cmpxchg(flags, 2, 0);
(flags = 0)

CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and
the work is again in a claimable state. Now CPU 2 has new data to process
and try to claim that work but it may see a stale value of the flags
and think the work is still pending somewhere that will handle our data.
This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically.

As a result, the data expected to be handle by CPU 2 won't get handled.

To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2
will see the correct value with cmpxchg() using the expected ordering.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 1588e3b..57be1a6 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -119,8 +119,11 @@ void irq_work_run(void)
/*
 * Clear the PENDING bit, after this point the @work
 * can be re-used.
+* Make it immediately visible so that other CPUs trying
+* to claim that work don't rely on us to handle their data
+* while we are in the middle of the func.
 */
-   work-flags = IRQ_WORK_BUSY;
+   xchg(work-flags, IRQ_WORK_BUSY);
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] irq_work: Fix racy check on work pending flag

2012-11-02 Thread Frederic Weisbecker

Work claiming wants to be SMP-safe.

And by the time we try to claim a work, if it is already executing
concurrently on another CPU, we want to succeed the claiming and queue
the work again because the other CPU may have missed the data we wanted
to handle in our work if it's about to complete there.

This scenario is summarized below:

CPU 1   CPU 2
-   -
(flags = 0)
cmpxchg(flags, 0, IRQ_WORK_FLAGS)
(flags = 3)
[...]
xchg(flags, IRQ_WORK_BUSY)
(flags = 2)
func()
if (flags  IRQ_WORK_PENDING)
(not true)
cmpxchg(flags, flags, 
IRQ_WORK_FLAGS)
(flags = 3)
[...]
cmpxchg(flags, IRQ_WORK_BUSY, 0);
(fail, pending on CPU 2)

This state machine is synchronized using [cmp]xchg() on the flags.
As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy.
By the time we check it, we may be dealing with a stale value because
we aren't using an atomic accessor. As a result, CPU 2 may see
that the work is still pending on another CPU while it may be
actually completing the work function exection already, leaving
our data unprocessed.

To fix this, we start by speculating about the value we wish to be
in the work-flags but we only make any conclusion after the value
returned by the cmpxchg() call that either claims the work or let
the current owner handle the pending work for us.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |   16 +++-
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 57be1a6..64eddd5 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list);
  */
 static bool irq_work_claim(struct irq_work *work)
 {
-   unsigned long flags, nflags;
+   unsigned long flags, oflags, nflags;
 
+   /*
+* Start with our best wish as a premise but only trust any
+* flag value after cmpxchg() result.
+*/
+   flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
-   flags = work-flags;
-   if (flags  IRQ_WORK_PENDING)
-   return false;
nflags = flags | IRQ_WORK_FLAGS;
-   if (cmpxchg(work-flags, flags, nflags) == flags)
+   oflags = cmpxchg(work-flags, flags, nflags);
+   if (oflags == flags)
break;
+   if (oflags  IRQ_WORK_PENDING)
+   return false;
+   flags = oflags;
cpu_relax();
}
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Frederic Weisbecker

2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
 * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
 
  Just to make sure I understand the combinations:
 
  o   All stalls have happened when running a minimal userspace.
  o   CONFIG_NO_HZ=n suppresses the stalls.
  o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
  no observable effect on the stalls.

 The reason why you may need minimal userspace is to cut down
 the number of timers waking up the system with NO_HZ.
 Booting with init=/bin/sh might also do the trick for that.

 Good point!  This does make for a very quiet system, but does not
 reproduce the problem under kvm, even after waiting for four minutes.
 I will leave it for more time, but it looks like I really might need to
 ask Linaro for remote access to a Panda.

I have one. I'm currently installing Ubuntu on it and I'll try to
manage to build
a kernel and reproduce the issue.

I'll give more news soon.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

2012/9/23 Sasha Levin levinsasha...@gmail.com:
 On 09/23/2012 02:21 AM, Paul E. McKenney wrote:
 On Sat, Sep 22, 2012 at 02:27:35PM -0700, Paul E. McKenney wrote:
 On Sat, Sep 22, 2012 at 07:50:29PM +0200, Sasha Levin wrote:
 On 09/22/2012 05:56 PM, Paul E. McKenney wrote:
 And now the prime suspect is the new CONFIG_RCU_USER_QS=y.  Do these
 warnings ever show up with CONFIG_RCU_USER_QS=n?

 It seems that disabling that does make the warnings go away.

 I'll keep the tests running in case it just reduces the chances or 
 something
 like that.

 Thank you for testing this!

 And of course the reason that I didn't see these problems is that I
 failed to update my tests to enable CONFIG_RCU_USER_QS.  :-/

 Also the fact that I run 32-bit guests on x86.  Sigh!

 I take it that you are running 64-bit guests?

 Yes, that's correct.

Sasha,

Can you please test the following branch:

git://github.com/fweisbec/linux-dynticks.git  rcu/idle-for-v3.7-take3

with CONFIG_RCU_USER_QS and CONFIG_RCU_USER_QS_FORCE enabled.

I hope this fixes the warning.
The changes are:

* add x86: Unspaghettize do_general_protection()
* updated x86: Exception hooks for userspace RCU extended QS to
handle some missed trap handlers. Especially do_general_protection()
because I can see the problem triggered there in Sasha's warnings. I
fixed more handlers in the way.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

2012/9/25 Sasha Levin levinsasha...@gmail.com:
 On 09/25/2012 12:47 AM, Sasha Levin wrote:
  - While I no longer see the warnings I've originally noticed, if I run with 
 Paul's last debug patch I see the following warning:

 Correction: Original warnings are still there, they just got buried in the 
 huge spew that was caused by additional debug warnings
 so I've missed them initially.

Are they the same? Could you send me your dmesg?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

2012/9/25 Sasha Levin levinsasha...@gmail.com:
 On 09/25/2012 01:06 AM, Frederic Weisbecker wrote:
 2012/9/25 Sasha Levin levinsasha...@gmail.com:
 On 09/25/2012 12:47 AM, Sasha Levin wrote:
  - While I no longer see the warnings I've originally noticed, if I run 
 with Paul's last debug patch I see the following warning:

 Correction: Original warnings are still there, they just got buried in the 
 huge spew that was caused by additional debug warnings
 so I've missed them initially.

 Are they the same? Could you send me your dmesg?

 Thanks.


 Log is attached, you can go directly to 168.703017 when the warnings begin.

Thanks!

So here is the first relevant warning:

[  168.703017] [ cut here ]
[  168.708117] WARNING: at kernel/rcutree.c:502 rcu_eqs_exit_common+0x4a/0x3a0()
[  168.710034] Pid: 7871, comm: trinity-child65 Tainted: GW
3.6.0-rc6-next-20120924-sasha-00030-g71f256c #5
[  168.710034] Call Trace:
[  168.710034]  IRQ  [811c737a] ? rcu_eqs_exit_common+0x4a/0x3a0
[  168.710034]  [811078b6] warn_slowpath_common+0x86/0xb0
[  168.710034]  [811079a5] warn_slowpath_null+0x15/0x20
[  168.710034]  [811c737a] rcu_eqs_exit_common+0x4a/0x3a0
[  168.710034]  [811c79cc] rcu_eqs_exit+0x9c/0xb0
[  168.710034]  [811c7a4c] rcu_user_exit+0x6c/0xd0
[  168.710034]  [8106eb1f] do_general_protection+0x1f/0x170
[  168.710034]  [83a0e624] ? restore_args+0x30/0x30
[  168.710034]  [83a0e875] general_protection+0x25/0x30
[  168.710034]  [810a3f06] ? native_read_msr_safe+0x6/0x20
[  168.710034]  [81a0b34b] __rdmsr_safe_on_cpu+0x2b/0x50
[  168.710034]  [819ec971] ? list_del+0x11/0x40
[  168.710034]  [811886dc]
generic_smp_call_function_single_interrupt+0xec/0x120
[  168.710034]  [81151147] ? account_system_vtime+0xd7/0x140
[  168.710034]  [81096f72]
smp_call_function_single_interrupt+0x22/0x40
[  168.710034]  [83a0fe2f] call_function_single_interrupt+0x6f/0x80
[  168.710034]  EOI  [83a0e5f4] ? retint_restore_args+0x13/0x13
[  168.710034]  [811c7285] ? rcu_user_enter+0x105/0x110
[  168.710034]  [8107e06d] syscall_trace_leave+0xfd/0x150
[  168.710034]  [83a0f1ef] int_check_syscall_exit_work+0x34/0x3d
[  168.710034] ---[ end trace fd408dd21b70b87c ]---

This is an exception inside an interrupt, and the interrupt
interrupted RCU user mode.
And we have that nesting:

rcu_irq_enter(); --- irq entry
rcu_user_exit(); --- exception entry

And rcu_eqs_exit() doesn't handle that very well...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

2012/9/25 Frederic Weisbecker fweis...@gmail.com:
 2012/9/25 Sasha Levin levinsasha...@gmail.com:
 On 09/25/2012 01:06 AM, Frederic Weisbecker wrote:
 2012/9/25 Sasha Levin levinsasha...@gmail.com:
 On 09/25/2012 12:47 AM, Sasha Levin wrote:
  - While I no longer see the warnings I've originally noticed, if I run 
 with Paul's last debug patch I see the following warning:

 Correction: Original warnings are still there, they just got buried in the 
 huge spew that was caused by additional debug warnings
 so I've missed them initially.

 Are they the same? Could you send me your dmesg?

 Thanks.


 Log is attached, you can go directly to 168.703017 when the warnings begin.

 Thanks!

 So here is the first relevant warning:

 [  168.703017] [ cut here ]
 [  168.708117] WARNING: at kernel/rcutree.c:502 
 rcu_eqs_exit_common+0x4a/0x3a0()
 [  168.710034] Pid: 7871, comm: trinity-child65 Tainted: GW
 3.6.0-rc6-next-20120924-sasha-00030-g71f256c #5
 [  168.710034] Call Trace:
 [  168.710034]  IRQ  [811c737a] ? rcu_eqs_exit_common+0x4a/0x3a0
 [  168.710034]  [811078b6] warn_slowpath_common+0x86/0xb0
 [  168.710034]  [811079a5] warn_slowpath_null+0x15/0x20
 [  168.710034]  [811c737a] rcu_eqs_exit_common+0x4a/0x3a0
 [  168.710034]  [811c79cc] rcu_eqs_exit+0x9c/0xb0
 [  168.710034]  [811c7a4c] rcu_user_exit+0x6c/0xd0
 [  168.710034]  [8106eb1f] do_general_protection+0x1f/0x170
 [  168.710034]  [83a0e624] ? restore_args+0x30/0x30
 [  168.710034]  [83a0e875] general_protection+0x25/0x30
 [  168.710034]  [810a3f06] ? native_read_msr_safe+0x6/0x20
 [  168.710034]  [81a0b34b] __rdmsr_safe_on_cpu+0x2b/0x50
 [  168.710034]  [819ec971] ? list_del+0x11/0x40
 [  168.710034]  [811886dc]
 generic_smp_call_function_single_interrupt+0xec/0x120
 [  168.710034]  [81151147] ? account_system_vtime+0xd7/0x140
 [  168.710034]  [81096f72]
 smp_call_function_single_interrupt+0x22/0x40
 [  168.710034]  [83a0fe2f] call_function_single_interrupt+0x6f/0x80
 [  168.710034]  EOI  [83a0e5f4] ? retint_restore_args+0x13/0x13
 [  168.710034]  [811c7285] ? rcu_user_enter+0x105/0x110
 [  168.710034]  [8107e06d] syscall_trace_leave+0xfd/0x150
 [  168.710034]  [83a0f1ef] int_check_syscall_exit_work+0x34/0x3d
 [  168.710034] ---[ end trace fd408dd21b70b87c ]---

 This is an exception inside an interrupt, and the interrupt
 interrupted RCU user mode.
 And we have that nesting:

 rcu_irq_enter(); --- irq entry
 rcu_user_exit(); --- exception entry

 And rcu_eqs_exit() doesn't handle that very well...

So either I should return immediately from rcu_user_exit() if
we are in an interrupt, or we make rcu_user_exit() able to nest
on rcu_irq_enter()   :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

On Mon, Sep 24, 2012 at 09:04:20PM -0700, Paul E. McKenney wrote:
 On Tue, Sep 25, 2012 at 01:41:18AM +0200, Frederic Weisbecker wrote:
  
   [  168.703017] [ cut here ]
   [  168.708117] WARNING: at kernel/rcutree.c:502 
   rcu_eqs_exit_common+0x4a/0x3a0()
   [  168.710034] Pid: 7871, comm: trinity-child65 Tainted: GW
   3.6.0-rc6-next-20120924-sasha-00030-g71f256c #5
   [  168.710034] Call Trace:
   [  168.710034]  IRQ  [811c737a] ? 
   rcu_eqs_exit_common+0x4a/0x3a0
   [  168.710034]  [811078b6] warn_slowpath_common+0x86/0xb0
   [  168.710034]  [811079a5] warn_slowpath_null+0x15/0x20
   [  168.710034]  [811c737a] rcu_eqs_exit_common+0x4a/0x3a0
   [  168.710034]  [811c79cc] rcu_eqs_exit+0x9c/0xb0
   [  168.710034]  [811c7a4c] rcu_user_exit+0x6c/0xd0
   [  168.710034]  [8106eb1f] do_general_protection+0x1f/0x170
   [  168.710034]  [83a0e624] ? restore_args+0x30/0x30
   [  168.710034]  [83a0e875] general_protection+0x25/0x30
   [  168.710034]  [810a3f06] ? native_read_msr_safe+0x6/0x20
   [  168.710034]  [81a0b34b] __rdmsr_safe_on_cpu+0x2b/0x50
   [  168.710034]  [819ec971] ? list_del+0x11/0x40
   [  168.710034]  [811886dc]
   generic_smp_call_function_single_interrupt+0xec/0x120
   [  168.710034]  [81151147] ? account_system_vtime+0xd7/0x140
   [  168.710034]  [81096f72]
   smp_call_function_single_interrupt+0x22/0x40
   [  168.710034]  [83a0fe2f] 
   call_function_single_interrupt+0x6f/0x80
   [  168.710034]  EOI  [83a0e5f4] ? 
   retint_restore_args+0x13/0x13
   [  168.710034]  [811c7285] ? rcu_user_enter+0x105/0x110
   [  168.710034]  [8107e06d] syscall_trace_leave+0xfd/0x150
   [  168.710034]  [83a0f1ef] int_check_syscall_exit_work+0x34/0x3d
   [  168.710034] ---[ end trace fd408dd21b70b87c ]---
  
   This is an exception inside an interrupt, and the interrupt
   interrupted RCU user mode.
   And we have that nesting:
  
   rcu_irq_enter(); --- irq entry
   rcu_user_exit(); --- exception entry
  
   And rcu_eqs_exit() doesn't handle that very well...
  
  So either I should return immediately from rcu_user_exit() if
  we are in an interrupt, or we make rcu_user_exit() able to nest
  on rcu_irq_enter()   :)
 
 Both of the two are eminently doable, with varying degrees of hackery.
 
 What makes the most sense from an adaptive-idle viewpoint?

Given that we have:

rcu_irq_enter()
rcu_user_exit()
rcu_user_enter()
rcu_irq_exit()

And we already have rcu_user_exit_after_irq(), this starts to be confusing
if we allow that nesting. Although if we find a solution that, in the end,
merge rcu_user_exit() with rcu_user_exit_after_irq() and same for the enter 
version,
this would probably be a good thing. Provided this doesn't involve some more
complicated rdtp-dyntick_nesting trickies nor more overhead.

Otherwise we could avoid to call rcu_user_* when we are in an irq. When we'll 
have
the user_hooks layer, we can perhaps manage that from that place. For
now may be we can return after in_interrupt() in the rcu user apis.

Let's first ensure I diagnosed it well and we don't have other problems detected
by Sasha. I'm cooking a testing patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

On Tue, Sep 25, 2012 at 01:10:27AM +0200, Sasha Levin wrote:
 On 09/25/2012 01:06 AM, Frederic Weisbecker wrote:
  2012/9/25 Sasha Levin levinsasha...@gmail.com:
  On 09/25/2012 12:47 AM, Sasha Levin wrote:
   - While I no longer see the warnings I've originally noticed, if I run 
  with Paul's last debug patch I see the following warning:
 
  Correction: Original warnings are still there, they just got buried in the 
  huge spew that was caused by additional debug warnings
  so I've missed them initially.
  
  Are they the same? Could you send me your dmesg?
  
  Thanks.
  
 
 Log is attached, you can go directly to 168.703017 when the warnings begin.

Sasha, sorry to burden you with more testing request.
Could you please try out this new branch? It includes some fixes after Wu 
Fenguang and
Dan Carpenter reports (not related to your warnings though) and a patch on the 
top
of the pile to ensure I diagnosed well the problem, which return immediately 
from
rcu_user_*() APIs if we are in an interrupt.

This way we'll have a clearer view. I also would like to know if there are other
problems with the rcu user mode.

Thanks!

Branch is:

git://github.com/fweisbec/linux-dynticks.git
rcu/idle-for-v3.7-take4

Diff is:

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cb20776..3789675 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -232,8 +232,8 @@ DO_ERROR_INFO(X86_TRAP_AC, SIGBUS, alignment check, 
alignment_check,
 dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
 {
exception_enter(regs);
-   if (!notify_die(DIE_TRAP, stack segment, regs, error_code,
-  X86_TRAP_SS, SIGBUS) == NOTIFY_STOP) {
+   if (notify_die(DIE_TRAP, stack segment, regs, error_code,
+  X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
preempt_conditional_sti(regs);
do_trap(X86_TRAP_SS, SIGBUS, stack segment, regs, error_code, 
NULL);
preempt_conditional_cli(regs);
@@ -285,8 +285,8 @@ do_general_protection(struct pt_regs *regs, long error_code)
 
tsk-thread.error_code = error_code;
tsk-thread.trap_nr = X86_TRAP_GP;
-   if (!notify_die(DIE_GPF, general protection fault, regs, 
error_code,
-  X86_TRAP_GP, SIGSEGV) == NOTIFY_STOP)
+   if (notify_die(DIE_GPF, general protection fault, regs, 
error_code,
+  X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
die(general protection fault, regs, error_code);
goto exit;
}
@@ -678,8 +678,8 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long 
error_code)
info.si_errno = 0;
info.si_code = ILL_BADSTK;
info.si_addr = NULL;
-   if (!notify_die(DIE_TRAP, iret exception, regs, error_code,
-   X86_TRAP_IRET, SIGILL) == NOTIFY_STOP) {
+   if (notify_die(DIE_TRAP, iret exception, regs, error_code,
+   X86_TRAP_IRET, SIGILL) != NOTIFY_STOP) {
do_trap(X86_TRAP_IRET, SIGILL, iret exception, regs, 
error_code,
info);
}
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d249719..e0500c6 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -446,6 +446,9 @@ void rcu_user_enter(void)
 
WARN_ON_ONCE(!current-mm);
 
+   if (in_interrupt())
+   return;
+
local_irq_save(flags);
rdtp = __get_cpu_var(rcu_dynticks);
if (!rdtp-ignore_user_qs  !rdtp-in_user) {
@@ -592,6 +595,9 @@ void rcu_user_exit(void)
unsigned long flags;
struct rcu_dynticks *rdtp;
 
+   if (in_interrupt())
+   return;
+
local_irq_save(flags);
rdtp = __get_cpu_var(rcu_dynticks);
if (rdtp-in_user) {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86: Unspaghettize do_trap()

Cleanup the label maze in this function. Having a
seperate function to first handle the traps that don't
generate a signal makes it easier to convert into
more readable conditional paths.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: H. Peter Anvin h...@zytor.com
---
 arch/x86/kernel/traps.c |   58 ++
 1 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b481341..1a8dd87 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -107,30 +107,45 @@ static inline void preempt_conditional_cli(struct pt_regs 
*regs)
dec_preempt_count();
 }
 
-static void __kprobes
-do_trap(int trapnr, int signr, char *str, struct pt_regs *regs,
-   long error_code, siginfo_t *info)
+static int __kprobes
+do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
+ struct pt_regs *regs, long error_code)
 {
-   struct task_struct *tsk = current;
-
 #ifdef CONFIG_X86_32
if (regs-flags  X86_VM_MASK) {
/*
 * traps 0, 1, 3, 4, and 5 should be forwarded to vm86.
 * On nmi (interrupt 2), do_trap should not be called.
 */
-   if (trapnr  X86_TRAP_UD)
-   goto vm86_trap;
-   goto trap_signal;
+   if (trapnr  X86_TRAP_UD) {
+   if (!handle_vm86_trap((struct kernel_vm86_regs *) regs,
+   error_code, trapnr))
+   return 0;
+   }
+   return -1
}
 #endif
+   if (!user_mode(regs)) {
+   if (!fixup_exception(regs)) {
+   tsk-thread.error_code = error_code;
+   tsk-thread.trap_nr = trapnr;
+   die(str, regs, error_code);
+   }
+   return 0;
+   }
 
-   if (!user_mode(regs))
-   goto kernel_trap;
+   return -1;
+}
 
-#ifdef CONFIG_X86_32
-trap_signal:
-#endif
+static void __kprobes
+do_trap(int trapnr, int signr, char *str, struct pt_regs *regs,
+   long error_code, siginfo_t *info)
+{
+   struct task_struct *tsk = current;
+
+
+   if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code))
+   return;
/*
 * We want error_code and trap_nr set for userspace faults and
 * kernelspace faults which result in die(), but not
@@ -158,23 +173,6 @@ trap_signal:
force_sig_info(signr, info, tsk);
else
force_sig(signr, tsk);
-   return;
-
-kernel_trap:
-   if (!fixup_exception(regs)) {
-   tsk-thread.error_code = error_code;
-   tsk-thread.trap_nr = trapnr;
-   die(str, regs, error_code);
-   }
-   return;
-
-#ifdef CONFIG_X86_32
-vm86_trap:
-   if (handle_vm86_trap((struct kernel_vm86_regs *) regs,
-   error_code, trapnr))
-   goto trap_signal;
-   return;
-#endif
 }
 
 #define DO_ERROR(trapnr, signr, str, name) \
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] cputime: More cleanups v2

Hi Ingo, Thomas,

Seconds take for these cleanups (and more are to come). I haven't got
reviews in two weeks so I'm resending with a pull request.

Changes in v2:

* Fix wrong condition when using is_idle_task() (patch 2/6)
* Remove unnecessary !S390 dependency for finegrained irqtime accounting
option (reported by Russell King)

Please pull from:

git://github.com/fweisbec/linux-dynticks.git
cputime/cleanups-v2

It is based on tip:sched/core

Thanks.

Frederic Weisbecker (6):
  cputime: Use a proper subsystem naming for vtime related APIs
  vtime: Consolidate system/idle context detection
  ia64: Consolidate user vtime accounting
  ia64: Reuse system and user vtime accounting functions on task switch
  cputime: Gather time/stats accounting config options into a single
menu
  cputime: Make finegrained irqtime accounting generally available

 arch/Kconfig|6 ++
 arch/ia64/kernel/time.c |   64 +-
 arch/powerpc/kernel/time.c  |   53 --
 arch/s390/include/asm/cputime.h |3 +
 arch/s390/kernel/vtime.c|6 +-
 arch/x86/Kconfig|   12 +---
 include/linux/hardirq.h |8 +-
 include/linux/kernel_stat.h |6 +-
 include/linux/kvm_host.h|4 +-
 init/Kconfig|  146 ---
 kernel/sched/core.c |2 +-
 kernel/sched/cputime.c  |   34 -
 kernel/softirq.c|6 +-
 13 files changed, 209 insertions(+), 141 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] cputime: Use a proper subsystem naming for vtime related APIs

Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() - vtime_account()
- account_switch_vtime() - vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Peter Zijlstra pet...@infradead.org
---
 arch/ia64/kernel/time.c |6 +++---
 arch/powerpc/kernel/time.c  |   10 +-
 arch/s390/kernel/vtime.c|6 +++---
 include/linux/hardirq.h |8 
 include/linux/kernel_stat.h |4 ++--
 include/linux/kvm_host.h|4 ++--
 kernel/sched/core.c |2 +-
 kernel/sched/cputime.c  |8 
 kernel/softirq.c|6 +++---
 9 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 6247197..16bb6ed 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -88,7 +88,7 @@ extern cputime_t cycle_to_cputime(u64 cyc);
  * accumulated times to the current process, and to prepare accounting on
  * the next process.
  */
-void account_switch_vtime(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev)
 {
struct thread_info *pi = task_thread_info(prev);
struct thread_info *ni = task_thread_info(current);
@@ -116,7 +116,7 @@ void account_switch_vtime(struct task_struct *prev)
  * Account time for a transition between system, hard irq or soft irq state.
  * Note that this function is called with interrupts enabled.
  */
-void account_system_vtime(struct task_struct *tsk)
+void vtime_account(struct task_struct *tsk)
 {
struct thread_info *ti = task_thread_info(tsk);
unsigned long flags;
@@ -138,7 +138,7 @@ void account_system_vtime(struct task_struct *tsk)
 
local_irq_restore(flags);
 }
-EXPORT_SYMBOL_GPL(account_system_vtime);
+EXPORT_SYMBOL_GPL(vtime_account);
 
 /*
  * Called from the timer interrupt handler to charge accumulated user time
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 49da7f0..39899d7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -291,7 +291,7 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
  * Account time for a transition between system, hard irq
  * or soft irq state.
  */
-void account_system_vtime(struct task_struct *tsk)
+void vtime_account(struct task_struct *tsk)
 {
u64 now, nowscaled, delta, deltascaled;
unsigned long flags;
@@ -343,14 +343,14 @@ void account_system_vtime(struct task_struct *tsk)
}
local_irq_restore(flags);
 }
-EXPORT_SYMBOL_GPL(account_system_vtime);
+EXPORT_SYMBOL_GPL(vtime_account);
 
 /*
  * Transfer the user and system times accumulated in the paca
  * by the exception entry and exit code to the generic process
  * user and system time records.
  * Must be called with interrupts disabled.
- * Assumes that account_system_vtime() has been called recently
+ * Assumes that vtime_account() has been called recently
  * (i.e. since the last entry from usermode) so that
  * get_paca()-user_time_scaled is up to date.
  */
@@ -366,9 +366,9 @@ void account_process_tick(struct task_struct *tsk, int 
user_tick)
account_user_time(tsk, utime, utimescaled);
 }
 
-void account_switch_vtime(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev)
 {
-   account_system_vtime(prev);
+   vtime_account(prev);
account_process_tick(prev, 0);
 }
 
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 449ac22..cb5093c 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -99,7 +99,7 @@ static int do_account_vtime(struct task_struct *tsk, int 
hardirq_offset)
return virt_timer_forward(user + system);
 }
 
-void account_switch_vtime(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev)
 {
struct thread_info *ti;
 
@@ -122,7 +122,7 @@ void account_process_tick(struct task_struct *tsk, int 
user_tick)
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer  steal_clock.
  */
-void account_system_vtime(struct task_struct *tsk)
+void vtime_account(struct task_struct *tsk)
 {
struct thread_info *ti = task_thread_info(tsk);
u64 timer, system;
@@ -138,7 +138,7 @@ void account_system_vtime(struct task_struct *tsk)
 
virt_timer_forward(system);
 }
-EXPORT_SYMBOL_GPL

[PATCH 2/6] vtime: Consolidate system/idle context detection

Move the code that finds out to which context we account the
cputime into generic layer.

Archs that consider the whole time spent in the idle task as idle
time (ia64, powerpc) can rely on the generic vtime_account()
and implement vtime_account_system() and vtime_account_idle(),
letting the generic code to decide when to call which API.

Archs that have their own meaning of idle time, such as s390
that only considers the time spent in CPU low power mode as idle
time, can just override vtime_account().

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Peter Zijlstra pet...@infradead.org
---
 arch/ia64/kernel/time.c |   25 +++-
 arch/powerpc/kernel/time.c  |   47 +++---
 arch/s390/include/asm/cputime.h |3 ++
 include/linux/kernel_stat.h |2 +
 kernel/sched/cputime.c  |   26 +
 5 files changed, 73 insertions(+), 30 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 16bb6ed..01cd43e 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -116,29 +116,32 @@ void vtime_task_switch(struct task_struct *prev)
  * Account time for a transition between system, hard irq or soft irq state.
  * Note that this function is called with interrupts enabled.
  */
-void vtime_account(struct task_struct *tsk)
+static cputime_t vtime_delta(struct task_struct *tsk)
 {
struct thread_info *ti = task_thread_info(tsk);
-   unsigned long flags;
cputime_t delta_stime;
__u64 now;
 
-   local_irq_save(flags);
-
now = ia64_get_itc();
 
delta_stime = cycle_to_cputime(ti-ac_stime + (now - ti-ac_stamp));
-   if (irq_count() || idle_task(smp_processor_id()) != tsk)
-   account_system_time(tsk, 0, delta_stime, delta_stime);
-   else
-   account_idle_time(delta_stime);
ti-ac_stime = 0;
-
ti-ac_stamp = now;
 
-   local_irq_restore(flags);
+   return delta_stime;
+}
+
+void vtime_account_system(struct task_struct *tsk)
+{
+   cputime_t delta = vtime_delta(tsk);
+
+   account_system_time(tsk, 0, delta, delta);
+}
+
+void vtime_account_idle(struct task_struct *tsk)
+{
+   account_idle_time(vtime_delta(tsk));
 }
-EXPORT_SYMBOL_GPL(vtime_account);
 
 /*
  * Called from the timer interrupt handler to charge accumulated user time
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 39899d7..29b6d3e 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -291,13 +291,12 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
  * Account time for a transition between system, hard irq
  * or soft irq state.
  */
-void vtime_account(struct task_struct *tsk)
+static u64 vtime_delta(struct task_struct *tsk,
+   u64 *sys_scaled, u64 *stolen)
 {
-   u64 now, nowscaled, delta, deltascaled;
-   unsigned long flags;
-   u64 stolen, udelta, sys_scaled, user_scaled;
+   u64 now, nowscaled, deltascaled;
+   u64 udelta, delta, user_scaled;
 
-   local_irq_save(flags);
now = mftb();
nowscaled = read_spurr(now);
get_paca()-system_time += now - get_paca()-starttime;
@@ -305,7 +304,7 @@ void vtime_account(struct task_struct *tsk)
deltascaled = nowscaled - get_paca()-startspurr;
get_paca()-startspurr = nowscaled;
 
-   stolen = calculate_stolen_time(now);
+   *stolen = calculate_stolen_time(now);
 
delta = get_paca()-system_time;
get_paca()-system_time = 0;
@@ -322,28 +321,38 @@ void vtime_account(struct task_struct *tsk)
 * the user ticks get saved up in paca-user_time_scaled to be
 * used by account_process_tick.
 */
-   sys_scaled = delta;
+   *sys_scaled = delta;
user_scaled = udelta;
if (deltascaled != delta + udelta) {
if (udelta) {
-   sys_scaled = deltascaled * delta / (delta + udelta);
-   user_scaled = deltascaled - sys_scaled;
+   *sys_scaled = deltascaled * delta / (delta + udelta);
+   user_scaled = deltascaled - *sys_scaled;
} else {
-   sys_scaled = deltascaled;
+   *sys_scaled = deltascaled;
}
}
get_paca()-user_time_scaled += user_scaled;
 
-   if (in_interrupt() || idle_task(smp_processor_id()) != tsk) {
-   account_system_time(tsk, 0, delta, sys_scaled);
-   if (stolen)
-   account_steal_time(stolen);
-   } else {
-   account_idle_time

[PATCH 3/6] ia64: Consolidate user vtime accounting

Factorize the code that accounts user time into a
single function to avoid code duplication.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Peter Zijlstra pet...@infradead.org
---
 arch/ia64/kernel/time.c |   28 +++-
 1 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 01cd43e..351df58 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -83,6 +83,18 @@ static struct clocksource *itc_clocksource;
 
 extern cputime_t cycle_to_cputime(u64 cyc);
 
+static void vtime_account_user(struct task_struct *tsk)
+{
+   cputime_t delta_utime;
+   struct thread_info *ti = task_thread_info(tsk);
+
+   if (ti-ac_utime) {
+   delta_utime = cycle_to_cputime(ti-ac_utime);
+   account_user_time(tsk, delta_utime, delta_utime);
+   ti-ac_utime = 0;
+   }
+}
+
 /*
  * Called from the context switch with interrupts disabled, to charge all
  * accumulated times to the current process, and to prepare accounting on
@@ -92,7 +104,7 @@ void vtime_task_switch(struct task_struct *prev)
 {
struct thread_info *pi = task_thread_info(prev);
struct thread_info *ni = task_thread_info(current);
-   cputime_t delta_stime, delta_utime;
+   cputime_t delta_stime;
__u64 now;
 
now = ia64_get_itc();
@@ -103,10 +115,7 @@ void vtime_task_switch(struct task_struct *prev)
else
account_idle_time(delta_stime);
 
-   if (pi-ac_utime) {
-   delta_utime = cycle_to_cputime(pi-ac_utime);
-   account_user_time(prev, delta_utime, delta_utime);
-   }
+   vtime_account_user(prev);
 
pi-ac_stamp = ni-ac_stamp = now;
ni-ac_stime = ni-ac_utime = 0;
@@ -149,14 +158,7 @@ void vtime_account_idle(struct task_struct *tsk)
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-   struct thread_info *ti = task_thread_info(p);
-   cputime_t delta_utime;
-
-   if (ti-ac_utime) {
-   delta_utime = cycle_to_cputime(ti-ac_utime);
-   account_user_time(p, delta_utime, delta_utime);
-   ti-ac_utime = 0;
-   }
+   vtime_account_user(p);
 }
 
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] ia64: Reuse system and user vtime accounting functions on task switch

To avoid code duplication.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Peter Zijlstra pet...@infradead.org
---
 arch/ia64/kernel/time.c |   11 +++
 1 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 351df58..80ff9ac 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -104,20 +104,15 @@ void vtime_task_switch(struct task_struct *prev)
 {
struct thread_info *pi = task_thread_info(prev);
struct thread_info *ni = task_thread_info(current);
-   cputime_t delta_stime;
-   __u64 now;
-
-   now = ia64_get_itc();
 
-   delta_stime = cycle_to_cputime(pi-ac_stime + (now - pi-ac_stamp));
if (idle_task(smp_processor_id()) != prev)
-   account_system_time(prev, 0, delta_stime, delta_stime);
+   vtime_account_system(prev);
else
-   account_idle_time(delta_stime);
+   vtime_account_idle(prev);
 
vtime_account_user(prev);
 
-   pi-ac_stamp = ni-ac_stamp = now;
+   pi-ac_stamp = ni-ac_stamp;
ni-ac_stime = ni-ac_utime = 0;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] cputime: Gather time/stats accounting config options into a single menu

This debloats a bit the general config menu and make these
config options easier to find.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Peter Zijlstra pet...@infradead.org
---
 init/Kconfig |  116 ++
 1 files changed, 60 insertions(+), 56 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index c40d0fb..2c5aa34 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -267,6 +267,65 @@ config POSIX_MQUEUE_SYSCTL
depends on SYSCTL
default y
 
+config FHANDLE
+   bool open by fhandle syscalls
+   select EXPORTFS
+   help
+ If you say Y here, a user level program will be able to map
+ file names to handle and then later use the handle for
+ different file system operations. This is useful in implementing
+ userspace file servers, which now track files using handles instead
+ of names. The handle would remain the same even if file names
+ get renamed. Enables open_by_handle_at(2) and name_to_handle_at(2)
+ syscalls.
+
+config AUDIT
+   bool Auditing support
+   depends on NET
+   help
+ Enable auditing infrastructure that can be used with another
+ kernel subsystem, such as SELinux (which requires this for
+ logging of avc messages output).  Does not do system-call
+ auditing without CONFIG_AUDITSYSCALL.
+
+config AUDITSYSCALL
+   bool Enable system-call auditing support
+   depends on AUDIT  (X86 || PPC || S390 || IA64 || UML || SPARC64 || 
SUPERH || (ARM  AEABI  !OABI_COMPAT))
+   default y if SECURITY_SELINUX
+   help
+ Enable low-overhead system-call auditing infrastructure that
+ can be used independently or with another kernel subsystem,
+ such as SELinux.
+
+config AUDIT_WATCH
+   def_bool y
+   depends on AUDITSYSCALL
+   select FSNOTIFY
+
+config AUDIT_TREE
+   def_bool y
+   depends on AUDITSYSCALL
+   select FSNOTIFY
+
+config AUDIT_LOGINUID_IMMUTABLE
+   bool Make audit loginuid immutable
+   depends on AUDIT
+   help
+ The config option toggles if a task setting its loginuid requires
+ CAP_SYS_AUDITCONTROL or if that task should require no special 
permissions
+ but should instead only allow setting its loginuid if it was never
+ previously set.  On systems which use systemd or a similar central
+ process to restart login services this should be set to true.  On 
older
+ systems in which an admin would typically have to directly stop and
+ start processes this should be set to false.  Setting this to true 
allows
+ one to drop potentially dangerous capabilites from the login tasks,
+ but may not be backwards compatible with older init systems.
+
+source kernel/irq/Kconfig
+source kernel/time/Kconfig
+
+menu CPU/Task time and stats accounting
+
 config VIRT_CPU_ACCOUNTING
bool Deterministic task and CPU time accounting
depends on HAVE_VIRT_CPU_ACCOUNTING
@@ -305,18 +364,6 @@ config BSD_PROCESS_ACCT_V3
  for processing it. A preliminary version of these tools is available
  at http://www.gnu.org/software/acct/.
 
-config FHANDLE
-   bool open by fhandle syscalls
-   select EXPORTFS
-   help
- If you say Y here, a user level program will be able to map
- file names to handle and then later use the handle for
- different file system operations. This is useful in implementing
- userspace file servers, which now track files using handles instead
- of names. The handle would remain the same even if file names
- get renamed. Enables open_by_handle_at(2) and name_to_handle_at(2)
- syscalls.
-
 config TASKSTATS
bool Export task/process statistics through netlink (EXPERIMENTAL)
depends on NET
@@ -359,50 +406,7 @@ config TASK_IO_ACCOUNTING
 
  Say N if unsure.
 
-config AUDIT
-   bool Auditing support
-   depends on NET
-   help
- Enable auditing infrastructure that can be used with another
- kernel subsystem, such as SELinux (which requires this for
- logging of avc messages output).  Does not do system-call
- auditing without CONFIG_AUDITSYSCALL.
-
-config AUDITSYSCALL
-   bool Enable system-call auditing support
-   depends on AUDIT  (X86 || PPC || S390 || IA64 || UML || SPARC64 || 
SUPERH || (ARM  AEABI  !OABI_COMPAT))
-   default y if SECURITY_SELINUX
-   help
- Enable low-overhead system-call auditing infrastructure that
- can be used independently

[PATCH 6/6] cputime: Make finegrained irqtime accounting generally available

There is no known reason for this option to be unavailable on other
archs than x86. They just need to call enable_sched_clock_irqtime()
if they have a sufficiently finegrained clock to make it working.

Move it to the general option and let the user choose between
it and pure tick based or virtual cputime accounting.

Note that virtual cputime accounting already performs a finegrained
irqtime accounting. CONFIG_IRQ_TIME_ACCOUNTING is a kind of middle ground
between tick and virtual based accounting. So CONFIG_IRQ_TIME_ACCOUNTING
and CONFIG_VIRT_CPU_ACCOUNTING are mutually exclusive choices.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Peter Zijlstra pet...@infradead.org
Cc: Russell King li...@arm.linux.org.uk
---
 arch/Kconfig |6 ++
 arch/x86/Kconfig |   12 +---
 init/Kconfig |   30 +-
 3 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index f78de57..101c31a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -284,4 +284,10 @@ config SECCOMP_FILTER
 config HAVE_VIRT_CPU_ACCOUNTING
bool
 
+config HAVE_IRQ_TIME_ACCOUNTING
+   bool
+   help
+ Archs need to ensure they use a high enough resolution clock to
+ support irq time accounting and then call 
enable_sched_clock_irqtime().
+
 source kernel/gcov/Kconfig
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ec3a1a..b86833a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -97,6 +97,7 @@ config X86
select KTIME_SCALAR if X86_32
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
+   select HAVE_IRQ_TIME_ACCOUNTING
 
 config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS || UPROBES)
@@ -796,17 +797,6 @@ config SCHED_MC
  making when dealing with multi-core CPU chips at a cost of slightly
  increased overhead in some places. If unsure say N here.
 
-config IRQ_TIME_ACCOUNTING
-   bool Fine granularity task level IRQ time accounting
-   default n
-   ---help---
- Select this option to enable fine granularity task irq time
- accounting. This is done by reading a timestamp on each
- transitions between softirq and hardirq state, so there can be a
- small performance impact.
-
- If in doubt, say N here.
-
 source kernel/Kconfig.preempt
 
 config X86_UP_APIC
diff --git a/init/Kconfig b/init/Kconfig
index 2c5aa34..1862c68 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -326,10 +326,25 @@ source kernel/time/Kconfig
 
 menu CPU/Task time and stats accounting
 
+choice
+   prompt Cputime accounting
+   default TICK_CPU_ACCOUNTING if !PPC64
+   default VIRT_CPU_ACCOUNTING if PPC64
+
+# Kind of a stub config for the pure tick based cputime accounting
+config TICK_CPU_ACCOUNTING
+   bool Simple tick based cputime accounting
+   depends on !S390
+   help
+ This is the basic tick based cputime accounting that maintains
+ statistics about user, system and idle time spent on per jiffies
+ granularity.
+
+ If unsure, say Y.
+
 config VIRT_CPU_ACCOUNTING
bool Deterministic task and CPU time accounting
depends on HAVE_VIRT_CPU_ACCOUNTING
-   default y if PPC64
help
  Select this option to enable more accurate task and CPU time
  accounting.  This is done by reading a CPU counter on each
@@ -339,6 +354,19 @@ config VIRT_CPU_ACCOUNTING
  this also enables accounting of stolen time on logically-partitioned
  systems.
 
+config IRQ_TIME_ACCOUNTING
+   bool Fine granularity task level IRQ time accounting
+   depends on HAVE_IRQ_TIME_ACCOUNTING
+   help
+ Select this option to enable fine granularity task irq time
+ accounting. This is done by reading a timestamp on each
+ transitions between softirq and hardirq state, so there can be a
+ small performance impact.
+
+ If in doubt, say N here.
+
+endchoice
+
 config BSD_PROCESS_ACCT
bool BSD Process Accounting
help
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Unspaghettize do_trap()

2012/9/26 Ingo Molnar mi...@kernel.org:

 * Frederic Weisbecker fweis...@gmail.com wrote:

 + return -1

 this bit wasn't very well tested. I applied it with the obvious
 fix, lets hope it holds up in testing.

Ouch. Bad indeed. I booted with x86-64 but forgot to try x86-32.

Sorry about that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

2012/9/25 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Tue, Sep 25, 2012 at 01:59:26PM +0200, Frederic Weisbecker wrote:
 Given that we have:

 rcu_irq_enter()
   rcu_user_exit()
   rcu_user_enter()
 rcu_irq_exit()

 Indeed, the code to deal with irq misnestings won't like that at all.
 And we are in the kernel between rcu_user_exit() and rcu_user_enter()
 (right?), so we could in fact see irq misnestings.

Exactly.


 And we already have rcu_user_exit_after_irq(), this starts to be confusing
 if we allow that nesting. Although if we find a solution that, in the end,
 merge rcu_user_exit() with rcu_user_exit_after_irq() and same for the enter 
 version,
 this would probably be a good thing. Provided this doesn't involve some more
 complicated rdtp-dyntick_nesting trickies nor more overhead.

 Otherwise we could avoid to call rcu_user_* when we are in an irq. When 
 we'll have
 the user_hooks layer, we can perhaps manage that from that place. For
 now may be we can return after in_interrupt() in the rcu user apis.

 This last sounds best.

Ok.


 My main concern is irq misnesting.  We might need to do something ugly
 like record the interrupt nesting level at rcu_user_exit() and restore
 it at rcu_user_enter().  Sigh!!!

Right, but that doesn't seem to apply in x86? I suggest we start
simple and think
about some wider solution when more architecture implement this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

2012/9/25 Sasha Levin levinsasha...@gmail.com:
 On 09/25/2012 02:06 PM, Frederic Weisbecker wrote:
 Sasha, sorry to burden you with more testing request.
 Could you please try out this new branch? It includes some fixes after Wu 
 Fenguang and
 Dan Carpenter reports (not related to your warnings though) and a patch on 
 the top
 of the pile to ensure I diagnosed well the problem, which return immediately 
 from
 rcu_user_*() APIs if we are in an interrupt.

 This way we'll have a clearer view. I also would like to know if there are 
 other
 problems with the rcu user mode.

 Thanks!

 Alrighty, I don't see any warnings anymore.

 I'll keep everything running just in case.

Cool! I'm pushing the fix tree to Paul then.

Thanks a lot for your testing!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU idle CPU detection is broken in linux-next

On Tue, Sep 25, 2012 at 11:36:54AM -0700, Paul E. McKenney wrote:
 On Tue, Sep 25, 2012 at 08:28:23PM +0200, Sasha Levin wrote:
  On 09/25/2012 02:06 PM, Frederic Weisbecker wrote:
   Sasha, sorry to burden you with more testing request.
   Could you please try out this new branch? It includes some fixes after Wu 
   Fenguang and
   Dan Carpenter reports (not related to your warnings though) and a patch 
   on the top
   of the pile to ensure I diagnosed well the problem, which return 
   immediately from
   rcu_user_*() APIs if we are in an interrupt.
   
   This way we'll have a clearer view. I also would like to know if there 
   are other
   problems with the rcu user mode.
   
   Thanks!
  
  Alrighty, I don't see any warnings anymore.
  
  I'll keep everything running just in case.
 
 Very good news!!!  Thank you both!!!

So, I've pushed the fixes in a new branch.

Changes are:

* Added in_interrupt() checks in rcu: New rcu_user_enter() and rcu_user_exit() 
APIs
so that rcu_user_enter/exit() don't nest in rcu_irq_enter/exit().

We'll need a longer term solution I guess because of:
- Irq bad nesting
- An exception could happen in the middle of irq_exit(), although 
that's quite
  unlikely (ie: between sub_preempt_count(HARDIRQ) and 
add_preempt_count(SOFTIRQ) or between
  sub_preempt_count(SOFTIRQ or HARDIRQ) and rcu_irq_exit().
  May be I should rather check for (rdtp-dynticks_nesting  
(DYNTICK_TASK_FLAG -1))
  to find out if we are in the middle of an irq from an RCU POV.


* Fix the !notifie_die(...) == NOTIFY_STOP to become notifie_die(...) != 
NOTIFY_STOP.
Reported by Wu Fenguang and Dan Carpenter. Fixes are folded inside:
x86: Exception hooks for userspace RCU extended QS.

The branch (rebase on top of your rcu/idle) is:

git://github.com/fweisbec/linux-dynticks.git
rcu/idle-for-v3.7-take5

Diff against your rcu/idle:

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cb20776..3789675 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -232,8 +232,8 @@ DO_ERROR_INFO(X86_TRAP_AC, SIGBUS, alignment check, 
alignment_check,
 dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
 {
exception_enter(regs);
-   if (!notify_die(DIE_TRAP, stack segment, regs, error_code,
-  X86_TRAP_SS, SIGBUS) == NOTIFY_STOP) {
+   if (notify_die(DIE_TRAP, stack segment, regs, error_code,
+  X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
preempt_conditional_sti(regs);
do_trap(X86_TRAP_SS, SIGBUS, stack segment, regs, error_code, 
NULL);
preempt_conditional_cli(regs);
@@ -285,8 +285,8 @@ do_general_protection(struct pt_regs *regs, long error_code)
 
tsk-thread.error_code = error_code;
tsk-thread.trap_nr = X86_TRAP_GP;
-   if (!notify_die(DIE_GPF, general protection fault, regs, 
error_code,
-  X86_TRAP_GP, SIGSEGV) == NOTIFY_STOP)
+   if (notify_die(DIE_GPF, general protection fault, regs, 
error_code,
+  X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
die(general protection fault, regs, error_code);
goto exit;
}
@@ -678,8 +678,8 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long 
error_code)
info.si_errno = 0;
info.si_code = ILL_BADSTK;
info.si_addr = NULL;
-   if (!notify_die(DIE_TRAP, iret exception, regs, error_code,
-   X86_TRAP_IRET, SIGILL) == NOTIFY_STOP) {
+   if (notify_die(DIE_TRAP, iret exception, regs, error_code,
+   X86_TRAP_IRET, SIGILL) != NOTIFY_STOP) {
do_trap(X86_TRAP_IRET, SIGILL, iret exception, regs, 
error_code,
info);
}
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 72453cf..4fb2376 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -418,6 +418,17 @@ void rcu_user_enter(void)
unsigned long flags;
struct rcu_dynticks *rdtp;
 
+   /*
+* Some contexts may involve an exception occuring in an irq,
+* leading to that nesting:
+* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
+* This would mess up the dyntick_nesting count though. And rcu_irq_*()
+* helpers are enough to protect RCU uses inside the exception. So
+* just return immediately if we detect we are in an IRQ.
+*/
+   if (in_interrupt())
+   return;
+
WARN_ON_ONCE(!current-mm);
 
local_irq_save(flags);
@@ -566,6 +577,17 @@ void rcu_user_exit(void)
unsigned long flags;
struct rcu_dynticks *rdtp;
 
+   /*
+* Some contexts may involve an exception occuring in an irq,
+* leading to that nesting:
+* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
+* This would mess

Re: powerpc/perf: hw breakpoints return ENOSPC

2012-09-27 Thread Frederic Weisbecker

2012/9/25 Michael Neuling mi...@neuling.org:
 Michael Neuling mi...@neuling.org wrote:

 Frederic Weisbecker fweis...@gmail.com wrote:

  On Thu, Aug 16, 2012 at 02:23:54PM +1000, Michael Neuling wrote:
   Hi,
  
   I've been trying to get hardware breakpoints with perf to work on POWER7
   but I'm getting the following:
  
 % perf record -e mem:0x1000 true
  
   Error: sys_perf_event_open() syscall returned with 28 (No space left 
   on device).  /bin/dmesg may provide additional information.
  
   Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
  
 true: Terminated
  
   (FWIW adding -a and it works fine)
  
   Debugging it seems that __reserve_bp_slot() is returning ENOSPC because
   it thinks there are no free breakpoint slots on this CPU.
  
   I have a 2 CPUs, so perf userspace is doing two perf_event_open syscalls
   to add a counter to each CPU [1].  The first syscall succeeds but the
   second is failing.
  
   On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1,
   despite there being no breakpoint on this CPU.  This is because the call
   the task_bp_pinned, checks all CPUs, rather than just the current CPU.
   POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we
   return ENOSPC.
  
   The following patch fixes this by checking the associated CPU for each
   breakpoint in task_bp_pinned.  I'm not familiar with this code, so it's
   provided as a reference to the above issue.
  
   Mikey
  
   1. not sure why it doesn't just do one syscall and specify all CPUs, but
   that's another issue.  Using two syscalls should work.
 
  This patch seems to make sense. I'll try it and run some tests.
  Can I have your Signed-off-by ?

 Frederic,

 Did you ever get to testing or integrating this patch?

 Mikey

Sorry, I forgot this in my mailbox. I'm going to look at this in the
next few days.
Feel free to harass me by email or IRC if I don't give news on this soon.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCHSET 00/15] perf report: Add support to accumulate hist periods

2012-09-27 Thread Frederic Weisbecker

On Tue, Sep 25, 2012 at 01:57:26PM +0900, Namhyung Kim wrote:
 Ping.  Any comments for this?
 
 Arun, thanks for testing!
 Namhyung

When Arun was working on this, I asked him to explore if it could make sense to 
reuse
the -b, --branch-stack  perf report option. Because after all, this feature 
is doing
about the same than -b except it's using callchains instead of full branch 
tracing.
But callchains are branches. Just a limited subset of all branches taken on 
excecution.
So you can probably reuse some interface and even ground code there.

What do you think?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv3] perf x86_64: Fix rsp register for system call fast path

On Wed, Oct 03, 2012 at 02:29:47PM +0200, Jiri Olsa wrote:
 +#ifdef CONFIG_X86_64
 +__weak void

Only annotate with __weak the default implementation you want to be
overriden. Here you want it to actually override the default __weak version.

 +arch_sample_regs_user_fixup(struct perf_regs_user *uregs, int kernel)
 +{
 + /*
 +  * If the perf event was triggered within the kernel code
 +  * path, then it was either syscall or interrupt. While
 +  * interrupt stores almost all user registers, the syscall
 +  * fast path does not. At this point we can at least set
 +  * rsp register right, which is crucial for dwarf unwind.
 +  *
 +  * The syscall_get_nr function returns -1 (orig_ax) for
 +  * interrupt, and positive value for syscall.
 +  *
 +  * We have two race windows in here:
 +  *
 +  * 1) Few instructions from syscall entry until old_rsp is
 +  *set.
 +  *
 +  * 2) In syscall/interrupt path from entry until the orig_ax
 +  *is set.
 +  *
 +  * Above described race windows are fractional opposed to
 +  * the syscall fast path, so we get much better results
 +  * fixing rsp this way.
 +  */
 + if (kernel  (syscall_get_nr(current, uregs-regs) = 0)) {
 + /* Make a copy and link it to regs pointer. */
 + memcpy(uregs-regs_copy, uregs-regs, sizeof(*uregs-regs));
 + uregs-regs = uregs-regs_copy;
 +
 + /* And fix the rsp. */
 + uregs-regs-sp = this_cpu_read(old_rsp);
 + }
 +}
 +#endif
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] make CONFIG_EXPERIMENTAL invisible and default

On Wed, Oct 03, 2012 at 10:21:42AM -0700, Greg Kroah-Hartman wrote:
On Wed, Oct 03, 2012 at 09:47:12AM -0700, Paul E. McKenney wrote:
On Wed, Oct 03, 2012 at 09:17:02AM -0700, Greg Kroah-Hartman wrote:
On Wed, Oct 03, 2012 at 06:25:38AM -0700, Paul E. McKenney wrote:
On Tue, Oct 02, 2012 at 12:50:42PM -0700, Kees Cook wrote:
This config item has not carried much meaning for a while now and is
almost always enabled by default. As agreed during the Linux kernel
summit, it should be removed. As a first step, remove it from being
listed, and default it to on. Once it has been removed from all
subsystem Kconfigs, it will be dropped entirely.

CC: Greg KH gre...@linuxfoundation.org
CC: Eric W. Biederman ebied...@xmission.com
CC: Serge Hallyn serge.hal...@canonical.com
CC: Paul E. McKenney paul...@linux.vnet.ibm.com
CC: Andrew Morton a...@linux-foundation.org
CC: Frederic Weisbecker fweis...@gmail.com
Signed-off-by: Kees Cook keesc...@chromium.org
---

This is the first of a series of 202 patches removing EXPERIMENTAL
from
all the Kconfigs in the tree. Should I send them all to lkml (with all
the associated CCs), or do people want to cherry-pick changes from my
tree? I don't want to needlessly flood the list.

http://git.kernel.org/?p=linux/kernel/git/kees/linux.git;a=shortlog;h=refs/heads/experimental

I figure this patch can stand alone to at least make EXPERIMENTAL go
away from the menus, and give us a taste of what the removal would do
to builds.

OK, I will bite... How should I flag an option that is initially only
intended for those willing to take some level of risk?

In the text say You really don't want to enable this option, use at
your own risk! Or something like that :)

OK, so the only real hope for experimental features is to refrain from
creating a config option for them, so that people wishing to use them
must modify the code? Or is the philosophy that we keep things out of
tree until we are comfortable with distros turning them on?

I think that should have been your philosophy for a long time, as they
turn on everything, and I don't blame them.
Why would we have included
it in the kernel tree, unless we wanted people to use the option?

A solution could be to add that option under CONFIG_DEBUG_KERNEL and specify
that it must only be enabled by developers for specific reasons (overhead,
security). CONFIG_PROVE_LOCKING falls into that category, right?

We have CONFIG_RCU_USER_QS that is a specific case. It's an intermediate state
before we implement a true CONFIG_NO_HZ_FULL. But the option is useless on its
own for users. Worse, it introduces a real overhead. OTOH we want it to be
upstream
to make the development of full tickless feature more incremental.

Perhaps we should put that under CONFIG_DEBUG_KERNEL.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] rcu: Remove rcu_switch()

It's only there to call rcu_user_hooks_switch(). Let's
just call rcu_user_hooks_switch() directly, we don't need this
function in the middle.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Josh Triplett j...@joshtriplett.org
Cc: Peter Zijlstra pet...@infradead.org
---
 include/linux/rcupdate.h |2 ++
 include/linux/sched.h|8 
 kernel/sched/core.c  |2 +-
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7c968e4..5d009de 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -204,6 +204,8 @@ static inline void rcu_user_enter(void) { }
 static inline void rcu_user_exit(void) { }
 static inline void rcu_user_enter_after_irq(void) { }
 static inline void rcu_user_exit_after_irq(void) { }
+static inline void rcu_user_hooks_switch(struct task_struct *prev,
+struct task_struct *next) { }
 #endif /* CONFIG_RCU_USER_QS */
 
 extern void exit_rcu(void);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9d51e26..65e2694 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1886,14 +1886,6 @@ static inline void rcu_copy_process(struct task_struct 
*p)
 
 #endif
 
-static inline void rcu_switch(struct task_struct *prev,
- struct task_struct *next)
-{
-#ifdef CONFIG_RCU_USER_QS
-   rcu_user_hooks_switch(prev, next);
-#endif
-}
-
 static inline void tsk_restore_flags(struct task_struct *task,
unsigned long orig_flags, unsigned long flags)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c177472..a48bef7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1887,7 +1887,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 #endif
 
/* Here we just switch the register state and the stack. */
-   rcu_switch(prev, next);
+   rcu_user_hooks_switch(prev, next);
switch_to(prev, next, prev);
 
barrier();
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Remove rcu_switch()

On Wed, Oct 03, 2012 at 11:39:45AM -0700, Paul E. McKenney wrote:
 On Wed, Oct 03, 2012 at 08:21:52PM +0200, Frederic Weisbecker wrote:
  It's only there to call rcu_user_hooks_switch(). Let's
  just call rcu_user_hooks_switch() directly, we don't need this
  function in the middle.
 
 Hello, Frederic!
 
 Doesn't this also require an empty definition of rcu_user_hooks_switch()
 to handle the CONFIG_RCU_USER_QS=n case?  Or is there already such
 a definition that I am too blind to see?

There is, look below:

 
   Thanx, Paul
 
  Signed-off-by: Frederic Weisbecker fweis...@gmail.com
  Cc: Josh Triplett j...@joshtriplett.org
  Cc: Peter Zijlstra pet...@infradead.org
  ---
   include/linux/rcupdate.h |2 ++
   include/linux/sched.h|8 
   kernel/sched/core.c  |2 +-
   3 files changed, 3 insertions(+), 9 deletions(-)
  
  diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
  index 7c968e4..5d009de 100644
  --- a/include/linux/rcupdate.h
  +++ b/include/linux/rcupdate.h
  @@ -204,6 +204,8 @@ static inline void rcu_user_enter(void) { }
   static inline void rcu_user_exit(void) { }
   static inline void rcu_user_enter_after_irq(void) { }
   static inline void rcu_user_exit_after_irq(void) { }
  +static inline void rcu_user_hooks_switch(struct task_struct *prev,
  +struct task_struct *next) { }

Here.

   #endif /* CONFIG_RCU_USER_QS */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] make CONFIG_EXPERIMENTAL invisible and default

On Wed, Oct 03, 2012 at 11:43:32AM -0700, Kees Cook wrote:
 I would expect a simple addition of this is dangerous/buggy to the
 description and default n is likely the way to go for that kind of
 thing.

Agreed.

 I think the history of CONFIG_EXPERIMENTAL has proven there
 isn't a sensible way to create a global flag for this kind of thing.
 To paraphrase Serge: my experimental options are not your experimental
 options.
 For example, some of the things that already had the experimental
 config removed, they left the (EXPERIMENTAL) in their config title.

Right.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] make CONFIG_EXPERIMENTAL invisible and default

On Wed, Oct 03, 2012 at 03:36:53PM -0400, Dave Jones wrote:
 On Wed, Oct 03, 2012 at 07:46:18PM +0200, Frederic Weisbecker wrote:
it in the kernel tree, unless we wanted people to use the option?
   
   A solution could be to add that option under CONFIG_DEBUG_KERNEL and 
 specify
   that it must only be enabled by developers for specific reasons (overhead,
   security). CONFIG_PROVE_LOCKING falls into that category, right?
   
   We have CONFIG_RCU_USER_QS that is a specific case. It's an intermediate 
 state
   before we implement a true CONFIG_NO_HZ_FULL. But the option is useless on 
 its
   own for users. Worse, it introduces a real overhead. OTOH we want it to be 
 upstream
   to make the development of full tickless feature more incremental.
   
   Perhaps we should put that under CONFIG_DEBUG_KERNEL.
 
 Overloading an existing config option for something unrelated seems 
 unpleasant to me.
 It will only take a few people to start doing this, before it turns into a 
 landslide
 where everyone ends up with DEBUG_KERNEL set.
 And what of people who already have DEBUG_KERNEL set ?

Sorry, by wording wasn't clear. I didn't mean overloading CONFIG_DEBUG_KERNEL 
but
rather depend on it.

 
 Just state what you wrote above in the kconfig.
 Currently, RCU_USER_QS says nothing about the fact that it's work in progress.

Yeah I much prefer that. I'll add some details on the Kconfig.

 The missing part that I don't have an answer for however, is what happens
 when you deem this production ready? Distro maintainers won't notice the
 kconfig text changing. But perhaps that's a good thing, and will lead to 
 things
 only being enabled when people explicitly ask for them in distros.

That Kconfig option is likely going to disappear inside a new CONFIG_NO_HZ_FULL 
that
will enables individual features like RCU user mode and stuffs.

And if it stays, it will be enabled by CONFIG_NO_HZ_FULL. So it's not an option
anybody will ever have to deal with directly.

 Alternatively, if you really do want to go the path of a new config option,
 perhaps CONFIG_NOT_DISTRO_READY would spell things out more clearly.
 EXPERIMENTAL is such a wasteland it would take too much manpower to audit
 every case, and update accordingly, but scorching the earth and starting
 afresh might be feasible.

CONFIG_STAGING already does that kind of thing I guess. Although I suspect 
people
are reluctant with core features in -staging.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] perf tools: Check existence of _get_comp_words_by_ref when bash completing

2012-10-04 Thread Frederic Weisbecker

On Thu, Oct 04, 2012 at 10:43:03AM +0900, Namhyung Kim wrote:
 Hi Frederic,
 
 On Tue, 2 Oct 2012 17:54:10 +0200, Frederic Weisbecker wrote:
  On Wed, Oct 03, 2012 at 12:21:32AM +0900, Namhyung Kim wrote:
  The '_get_comp_words_by_ref' function is available from the bash
  completion v1.2 so that earlier version emits following warning:
  
$ perf reTAB_get_comp_words_by_ref: command not found
  
  Use older '_get_cword' method when the above function doesn't exist.
 
  May be only use _get_cword then, if it works everywhere?
 
 It'll work but it's deprecated.

Ok.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU_USER_QS traces.

2012-10-05 Thread Frederic Weisbecker

On Thu, Oct 04, 2012 at 10:41:06AM -0400, Dave Jones wrote:
   We have CONFIG_RCU_USER_QS that is a specific case. It's an intermediate 
 state
   before we implement a true CONFIG_NO_HZ_FULL. But the option is useless on 
 its
   own for users. Worse, it introduces a real overhead. OTOH we want it to be 
 upstream
   to make the development of full tickless feature more incremental.
 
 I couldn't resist trying it.. Did these get reported yet ?

Hi Dave,

Thanks for this report. I need to find the source of the issue.

If you don't mind, could you please apply the following patch? You'll also
need to enable:

- CONFIG_EVENT_TRACING and CONFIG_RCU_TRACE

And you also need this boot parameter:

- trace_event=rcu_dyntick

Thanks a lot!

---
From 824f2ef855597d6dd263bb363727e4585db88ca3 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker fweis...@gmail.com
Date: Thu, 4 Oct 2012 17:30:36 +0200
Subject: [PATCH] rcu: Dump rcu traces and stacktraces on dynticks warnings

Temp patch for debugging.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 kernel/rcutree.c |   38 --
 1 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4fb2376..1bc96fd 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -59,6 +59,12 @@
 
 #include rcu.h
 
+#define trace_rcu_dyntick_stacktrace(...) {\
+   trace_rcu_dyntick(__VA_ARGS__);\
+   trace_dump_stack();\
+}
+
+
 /* Data structures. */
 
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
@@ -334,11 +340,11 @@ static struct rcu_node *rcu_get_root(struct rcu_state 
*rsp)
 static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
bool user)
 {
-   trace_rcu_dyntick(Start, oldval, 0);
+   trace_rcu_dyntick_stacktrace(Start, oldval, 0);
if (!user  !is_idle_task(current)) {
struct task_struct *idle = idle_task(smp_processor_id());
 
-   trace_rcu_dyntick(Error on entry: not idle task, oldval, 0);
+   trace_rcu_dyntick_stacktrace(Error on entry: not idle task, 
oldval, 0);
ftrace_dump(DUMP_ORIG);
WARN_ONCE(1, Current pid: %d comm: %s / Idle pid: %d comm: %s,
  current-pid, current-comm,
@@ -349,7 +355,8 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, 
long long oldval,
smp_mb__before_atomic_inc();  /* See above. */
atomic_inc(rdtp-dynticks);
smp_mb__after_atomic_inc();  /* Force ordering with next sojourn. */
-   WARN_ON_ONCE(atomic_read(rdtp-dynticks)  0x1);
+   if (WARN_ON_ONCE(atomic_read(rdtp-dynticks)  0x1))
+   ftrace_dump(DUMP_ORIG);
 
/*
 * It is illegal to enter an extended quiescent state while
@@ -374,7 +381,8 @@ static void rcu_eqs_enter(bool user)
 
rdtp = __get_cpu_var(rcu_dynticks);
oldval = rdtp-dynticks_nesting;
-   WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0);
+   if (WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0))
+   ftrace_dump(DUMP_ORIG);
if ((oldval  DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE)
rdtp-dynticks_nesting = 0;
else
@@ -489,9 +497,9 @@ void rcu_irq_exit(void)
oldval = rdtp-dynticks_nesting;
rdtp-dynticks_nesting--;
WARN_ON_ONCE(rdtp-dynticks_nesting  0);
-   if (rdtp-dynticks_nesting)
-   trace_rcu_dyntick(--=, oldval, rdtp-dynticks_nesting);
-   else
+   if (rdtp-dynticks_nesting) {
+   trace_rcu_dyntick_stacktrace(--=, oldval, 
rdtp-dynticks_nesting);
+   } else
rcu_eqs_enter_common(rdtp, oldval, true);
local_irq_restore(flags);
 }
@@ -510,13 +518,14 @@ static void rcu_eqs_exit_common(struct rcu_dynticks 
*rdtp, long long oldval,
atomic_inc(rdtp-dynticks);
/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
smp_mb__after_atomic_inc();  /* See above. */
-   WARN_ON_ONCE(!(atomic_read(rdtp-dynticks)  0x1));
+   if (WARN_ON_ONCE(!(atomic_read(rdtp-dynticks)  0x1)))
+   ftrace_dump(DUMP_ORIG);
rcu_cleanup_after_idle(smp_processor_id());
-   trace_rcu_dyntick(End, oldval, rdtp-dynticks_nesting);
+   trace_rcu_dyntick_stacktrace(End, oldval, rdtp-dynticks_nesting);
if (!user  !is_idle_task(current)) {
struct task_struct *idle = idle_task(smp_processor_id());
 
-   trace_rcu_dyntick(Error on exit: not idle task,
+   trace_rcu_dyntick_stacktrace(Error on exit: not idle task,
  oldval, rdtp-dynticks_nesting);
ftrace_dump(DUMP_ORIG);
WARN_ONCE(1, Current pid: %d comm: %s / Idle pid: %d comm: %s,
@@ -536,7 +545,8 @@ static void rcu_eqs_exit(bool user)
 
rdtp = __get_cpu_var(rcu_dynticks);
oldval = rdtp-dynticks_nesting

Re: [PATCH 33/42] perf tools: Complete tracepoint event names

2012-10-05 Thread Frederic Weisbecker

On Thu, Oct 04, 2012 at 03:08:33PM -0300, Arnaldo Carvalho de Melo wrote:
 From: Namhyung Kim namhyung@lge.com
 
 Currently tracepoint events cannot be completed because they contain a
 colon (:) character.  The colon is considered as a word separator when
 bash completion is done - variable COMP_WORDBREAKS contains colon - so
 if a word being completed contains a colon it can be a problem.
 
 Recent versions of bash completion provide -n switch to
 _get_comp_words_by_ref and __ltrim_colon_completions functions in order
 to resolve this issue.  Copy the latter in case not exists.

Thanks for fixing this! I scratched my head on that bug.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] make CONFIG_EXPERIMENTAL invisible and default

2012-10-06 Thread Frederic Weisbecker

2012/10/5 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Thu, Oct 04, 2012 at 07:31:50AM -0700, Paul E. McKenney wrote:
 On Thu, Oct 04, 2012 at 02:55:39AM +0100, Matthew Garrett wrote:
  On Wed, Oct 03, 2012 at 01:03:14PM -0700, Paul E. McKenney wrote:
 
   That has not proven sufficient for me in the past, RCU_FAST_NO_HZ
   being a case in point.
 
  Taint the kernel at boot time? That'd be sufficient to force distros to
  disable it.

 Cool!  That does sound much more socially responsible than my thought
 of forcing a splat (e.g., WARN_ON(1)) during boot.  ;-)

 So, from what I can see, here is the list of the ways of warning distros
 off of a given kernel config option, taken in terms of CONFIG_RCU_USER_QS:

 1.  Make CONFIG_RCU_USER_QS depend on CONFIG_BROKEN.

 It sounds to me like distros would avoid adding this (do they?),
 but tester would probably avoid it as well.

 2.  Make CONFIG_RCU_USER_QS depend on CONFIG_STAGING.

 As Frederic noted, this is more of a driver thing than a
 core-kernel thing, so probably not appropriate.

 3.  Boot-time WARN_ON() if CONFIG_RCU_USER_QS=y.

 This seems to me to be a tad excessive.  But the place to do it
 might be rcu_bootup_announce_oddness() in kernel/rcutree_plugin.h.

 4.  Remove CONFIG_RCU_USER_QS from Kconfig, so that users have to
 patch their kernel to enable it.

 This also seems a tad excessive.

 5.  Maintain CONFIG_RCU_USER_QS out of tree, for example in the
 -rt patchset.

 This is a good place to start, but it has been where
 CONFIG_RCU_USER_QS has been for some time, and although it
 got some good testing, it clearly needs more.  In my view,
 CONFIG_RCU_USER_QS has outgrown its out-of-tree roots.

 6.  Boot-time add_taint() if CONFIG_RCU_USER_QS=y, as suggested
 by Matthew Garrett.  The taint value might be TAINT_CRAP,
 TAINT_OOT_MODULE, TAINT_WARN, or TAINT_FIRMWARE_WORKAROUND --
 all the other taint values disable lockdep.  Of these four,
 TAINT_OOT_MODULE and TAINT_FIRMWARE_WORKAROUND are clearly
 off-topic, leaving TAINT_CRAP and TAINT_WARN.  Taking them one
 at a time:

 TAINT_CRAP: Used when loading modules from staging.

 TAINT_WARN: Used when scheduling while atomic is encountered.

 So I have my tongue only halfway in my cheek when I suggest
 starting with TAINT_CRAP, then moving to TAINT_WARN, then
 removing the tainting altogether.  The place to do this might
 be rcu_bootup_announce_oddness() in kernel/rcutree_plugin.h.

 So how about the following progression?

 A.  Early days, only a few crazies should test.  In this case, the
 code should be out of tree, perhaps in something like -rt,
 perhaps as a set of patches.

 B.  Need more testers, but still not expected to work reasonably.
 Mainline, but depending on CONFIG_BROKEN.  (I am not all that
 enthusiastic about this option, but am including it for
 completeness.)

Yeah neither am I. With a dependency on CONFIG_BROKEN, it considerably
reduce the testing coverage too.


 C.  Need wide testing, but don't want 100,000,000 unsuspecting
 test subjects.  Taint the kernel with TAINT_CRAP.

 D.  OK for production in special situations, but definitely not
 for typical users.  Taint the kernel with TAINT_WARN.

 E.  Ready for general production use.  Mainlined without restrictions.

 I would say that CONFIG_RCU_USER_QS is currently at point C above, it
 clearly now needs testing on a wide variety of hardware, but also is
 clearly not ready for 100,000,000 users.

 Thoughts?

Really I would much prefer to add some Don't enable it unless you're
doing kernel hacking.
If unsure say N text in the Kconfig.

I can understand that distros want to cover as much feature as they
can for their users. But
should it be an excuse for not reading outstanding warnings in Kconfig
help text?

Or may be add some specific warning yeah. I wouldn't mind much.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] context_tracking: New context tracking susbsystem

Create a new subsystem that probes on kernel boundaries
to keep track of the transitions between level contexts
with two basic initial contexts: user or kernel.

This is an abstraction of some RCU code that use such tracking
to implement its userspace extended quiescent state.

We need to pull this up from RCU into this new level of indirection
because this tracking is also going to be used to implement an on
demand generic virtual cputime accounting. A necessary step to
shutdown the tick while still accounting the cputime.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Andrew Morton a...@linux-foundation.org
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 arch/Kconfig   |   12 ++--
 arch/x86/Kconfig   |2 +-
 arch/x86/include/asm/{rcu.h = context_tracking.h} |   13 ++--
 arch/x86/kernel/entry_64.S |2 +-
 arch/x86/kernel/ptrace.c   |8 +-
 arch/x86/kernel/signal.c   |5 +-
 arch/x86/kernel/traps.c|2 +-
 arch/x86/mm/fault.c|2 +-
 include/linux/context_tracking.h   |   18 
 include/linux/rcupdate.h   |2 -
 include/linux/sched.h  |8 --
 init/Kconfig   |   30 
 kernel/Makefile|1 +
 kernel/context_tracking.c  |   83 
 kernel/rcutree.c   |   64 +---
 kernel/sched/core.c|9 +-
 16 files changed, 147 insertions(+), 114 deletions(-)
 rename arch/x86/include/asm/{rcu.h = context_tracking.h} (69%)
 create mode 100644 include/linux/context_tracking.h
 create mode 100644 kernel/context_tracking.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 366ec06..3855e06 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -300,15 +300,15 @@ config SECCOMP_FILTER
 
  See Documentation/prctl/seccomp_filter.txt for details.
 
-config HAVE_RCU_USER_QS
+config HAVE_CONTEXT_TRACKING
bool
help
- Provide kernel entry/exit hooks necessary for userspace
+ Provide kernel/user boundaries probes necessary for userspace
  RCU extended quiescent state. Syscalls need to be wrapped inside
- rcu_user_exit()-rcu_user_enter() through the slow path using
- TIF_NOHZ flag. Exceptions handlers must be wrapped as well. Irqs
- are already protected inside rcu_irq_enter/rcu_irq_exit() but
- preemption or signal handling on irq exit still need to be protected.
+ user_exit()-user_enter() through the slow path using TIF_NOHZ flag.
+ Exceptions handlers must be wrapped as well. Irqs are already
+ protected inside rcu_irq_enter/rcu_irq_exit() but preemption or
+ signal handling on irq exit still need to be protected.
 
 config HAVE_VIRT_CPU_ACCOUNTING
bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..110cfad 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -106,7 +106,7 @@ config X86
select KTIME_SCALAR if X86_32
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
-   select HAVE_RCU_USER_QS if X86_64
+   select HAVE_CONTEXT_TRACKING if X86_64
select HAVE_IRQ_TIME_ACCOUNTING
select GENERIC_KERNEL_THREAD
select GENERIC_KERNEL_EXECVE
diff --git a/arch/x86/include/asm/rcu.h 
b/arch/x86/include/asm/context_tracking.h
similarity index 69%
rename from arch/x86/include/asm/rcu.h
rename to arch/x86/include/asm/context_tracking.h
index d1ac07a..4d5b661 100644
--- a/arch/x86/include/asm/rcu.h
+++ b/arch/x86/include/asm/context_tracking.h
@@ -1,21 +1,20 @@
-#ifndef _ASM_X86_RCU_H
-#define _ASM_X86_RCU_H
+#ifndef _ASM_X86_CONTEXT_TRACKING_H
+#define _ASM_X86_CONTEXT_TRACKING_H
 
 #ifndef __ASSEMBLY__
-
-#include linux/rcupdate.h
+#include linux/context_tracking.h
 #include asm/ptrace.h
 
 static inline void exception_enter(struct pt_regs *regs)
 {
-   rcu_user_exit();
+   user_exit();
 }
 
 static inline void exception_exit(struct pt_regs *regs)
 {
-#ifdef CONFIG_RCU_USER_QS
+#ifdef CONFIG_CONTEXT_TRACKING
if (user_mode(regs))
-   rcu_user_enter();
+   user_enter();
 #endif
 }
 
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b51b2c7..1a1c2ba 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -56,7 +56,7 @@
 #include asm/ftrace.h
 #include asm/percpu.h
 #include asm/asm.h
-#include asm/rcu.h
+#include asm/context_tracking.h
 #include asm/smap.h
 #include linux/err.h
 
diff --git

[PATCH 3/3] cputime: Generic on-demand virtual cputime accounting

If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.

Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.

However implementing CONFIG_VIRT_CPU_ACCOUNTING require
to set low level hooks and involves more overhead. But
we already have a generic context tracking subsystem
that is required for RCU needs by archs which will want to
shut down the tick outside idle.

This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.

There are some upsides of doing this:

- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).

- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.

And a few downsides:

- It relies on jiffies and the hooks are set in high level code. This
results in less precise cputime accounting than with a true native
virtual based cputime accounting which hooks on low level code and use
a cpu hardware clock. Precision is not the goal of this though.

- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Andrew Morton a...@linux-foundation.org
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/context_tracking.h |   28 ++
 include/linux/vtime.h|7 +++
 init/Kconfig |   11 -
 kernel/context_tracking.c|   16 +-
 kernel/sched/cputime.c   |  112 --
 5 files changed, 154 insertions(+), 20 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index e24339c..3b63210 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,12 +3,40 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING
 #include linux/sched.h
+#include linux/percpu.h
+
+struct context_tracking {
+   /*
+* When active is false, hooks are not set to
+* minimize overhead: TIF flags are cleared
+* and calls to user_enter/exit are ignored. This
+* may be further optimized using static keys.
+*/
+   bool active;
+   enum {
+   IN_KERNEL = 0,
+   IN_USER,
+   } state;
+};
+
+DECLARE_PER_CPU(struct context_tracking, context_tracking);
+
+static inline bool context_tracking_in_user(void)
+{
+   return __this_cpu_read(context_tracking.state) == IN_USER;
+}
+
+static inline bool context_tracking_active(void)
+{
+   return __this_cpu_read(context_tracking.active);
+}
 
 extern void user_enter(void);
 extern void user_exit(void);
 extern void context_tracking_task_switch(struct task_struct *prev,
 struct task_struct *next);
 #else
+static inline bool context_tracking_in_user(void) { return false; }
 static inline void user_enter(void) { }
 static inline void user_exit(void) { }
 static inline void context_tracking_task_switch(struct task_struct *prev,
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 85a1f0f..3ea63a1 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -23,6 +23,13 @@ static inline void vtime_account(struct task_struct *tsk) { }
 static inline bool vtime_accounting(void) { return false; }
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+extern void __vtime_account_user(struct task_struct *tsk);
+extern bool vtime_accounting(void);
+#else
+static inline void __vtime_account_user(struct task_struct *tsk) { }
+#endif
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 extern void irqtime_account_irq(struct task_struct *tsk);
 #else
diff --git a/init/Kconfig b/init/Kconfig
index 15e44e7..ad96572 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -344,7 +344,9 @@ config TICK_CPU_ACCOUNTING
 
 config VIRT_CPU_ACCOUNTING
bool Deterministic task and CPU time accounting
-   depends on HAVE_VIRT_CPU_ACCOUNTING
+   depends on HAVE_VIRT_CPU_ACCOUNTING || HAVE_CONTEXT_TRACKING
+   select VIRT_CPU_ACCOUNTING_GEN if !HAVE_VIRT_CPU_ACCOUNTING
+   default y if PPC64
help
  Select this option to enable more accurate task and CPU time
  accounting.  This is done by reading a CPU counter on each
@@ -367,6 +369,13 @@ config IRQ_TIME_ACCOUNTING
 
 endchoice
 
+config VIRT_CPU_ACCOUNTING_GEN
+   select CONTEXT_TRACKING
+   bool
+   help
+ Implement a generic virtual based cputime accounting by using

[PATCH 0/3] cputime: Generic virtual based cputime accounting v4

Hi,

I'm back on this patchset now that the necessary cputime cleanups are
merged, although more cputime consolidation as in the ctx switch and tick
path should also be done in the future, when I'll get time to cleanup
the s390 part.

So this version of the generic vtime is essentially a rebase against
latest changes (tip:sched/core). Once we get that thing in, we'll need
to handle the cputime read side when the write side is in nohz mode. Probably
no big deal but let's move step by step, as usual.

Comments?

This can be fetched from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
vtime/generic-v4


Frederic Weisbecker (3):
  context_tracking: New context tracking susbsystem
  cputime: Allow dynamic switch between tick/virtual based cputime
accounting
  cputime: Generic on-demand virtual cputime accounting

 arch/Kconfig   |   12 +-
 arch/ia64/include/asm/cputime.h|5 +
 arch/ia64/kernel/time.c|2 +-
 arch/powerpc/include/asm/cputime.h |5 +
 arch/powerpc/kernel/time.c |2 +-
 arch/s390/include/asm/cputime.h|5 +
 arch/s390/kernel/vtime.c   |2 +-
 arch/x86/Kconfig   |2 +-
 arch/x86/include/asm/{rcu.h = context_tracking.h} |   13 +-
 arch/x86/kernel/entry_64.S |2 +-
 arch/x86/kernel/ptrace.c   |8 +-
 arch/x86/kernel/signal.c   |5 +-
 arch/x86/kernel/traps.c|2 +-
 arch/x86/mm/fault.c|2 +-
 include/linux/context_tracking.h   |   46 ++
 include/linux/rcupdate.h   |2 -
 include/linux/sched.h  |   13 +--
 include/linux/vtime.h  |   14 ++
 init/Kconfig   |   41 --
 kernel/Makefile|1 +
 kernel/context_tracking.c  |   71 +
 kernel/fork.c  |3 +-
 kernel/rcutree.c   |   64 +
 kernel/sched/core.c|9 +-
 kernel/sched/cputime.c |  152 
 kernel/time/tick-sched.c   |5 +-
 26 files changed, 335 insertions(+), 153 deletions(-)
 rename arch/x86/include/asm/{rcu.h = context_tracking.h} (69%)
 create mode 100644 include/linux/context_tracking.h
 create mode 100644 kernel/context_tracking.c

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] cputime: Allow dynamic switch between tick/virtual based cputime accounting

Allow to dynamically switch between tick and virtual based cputime accounting.
This way we can provide a kind of on-demand virtual based cputime
accounting. In this mode, the kernel will rely on the user hooks
subsystem to dynamically hook on kernel boundaries.

This is in preparation for beeing able to stop the timer tick further
idle. Doing so will depend on CONFIG_VIRT_CPU_ACCOUNTING which makes
it possible to account the cputime without the tick by hooking on
kernel/user boundaries.

Depending whether the tick is stopped or not, we can switch between
tick and vtime based accounting anytime in order to minimize the
overhead associated to user hooks.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Andrew Morton a...@linux-foundation.org
Cc: H. Peter Anvin h...@zytor.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 arch/ia64/include/asm/cputime.h|5 
 arch/ia64/kernel/time.c|2 +-
 arch/powerpc/include/asm/cputime.h |5 
 arch/powerpc/kernel/time.c |2 +-
 arch/s390/include/asm/cputime.h|5 
 arch/s390/kernel/vtime.c   |2 +-
 include/linux/sched.h  |5 +---
 include/linux/vtime.h  |7 ++
 kernel/fork.c  |3 +-
 kernel/sched/cputime.c |   40 ---
 kernel/time/tick-sched.c   |5 ++-
 11 files changed, 48 insertions(+), 33 deletions(-)

diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h
index 3deac95..49782fe 100644
--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -103,5 +103,10 @@ static inline void cputime_to_timeval(const cputime_t ct, 
struct timeval *val)
 #define cputime64_to_clock_t(__ct) \
cputime_to_clock_t((__force cputime_t)__ct)
 
+static inline bool vtime_accounting(void)
+{
+   return true;
+}
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 #endif /* __IA64_CPUTIME_H */
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 5e48503..7b1fa3d 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -151,7 +151,7 @@ void __vtime_account_idle(struct task_struct *tsk)
  * Called from the timer interrupt handler to charge accumulated user time
  * to the current process.  Must be called with interrupts disabled.
  */
-void account_process_tick(struct task_struct *p, int user_tick)
+void vtime_account_process_tick(struct task_struct *p, int user_tick)
 {
vtime_account_user(p);
 }
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 487d46f..e84c2b3 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -228,6 +228,11 @@ static inline cputime_t clock_t_to_cputime(const unsigned 
long clk)
 
 #define cputime64_to_clock_t(ct)   cputime_to_clock_t((cputime_t)(ct))
 
+static inline bool vtime_accounting(void)
+{
+   return true;
+}
+
 #endif /* __KERNEL__ */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 0db456f..1c9a3b8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -363,7 +363,7 @@ void __vtime_account_idle(struct task_struct *tsk)
  * (i.e. since the last entry from usermode) so that
  * get_paca()-user_time_scaled is up to date.
  */
-void account_process_tick(struct task_struct *tsk, int user_tick)
+void vtime_account_process_tick(struct task_struct *tsk, int user_tick)
 {
cputime_t utime, utimescaled;
 
diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index 023d5ae..a96161b 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -191,4 +191,9 @@ static inline int s390_nohz_delay(int cpu)
 
 #define arch_needs_cpu(cpu) s390_nohz_delay(cpu)
 
+static inline bool vtime_accounting(void)
+{
+   return true;
+}
+
 #endif /* _S390_CPUTIME_H */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 783e988..ab180de 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -112,7 +112,7 @@ void vtime_task_switch(struct task_struct *prev)
S390_lowcore.system_timer = ti-system_timer;
 }
 
-void account_process_tick(struct task_struct *tsk, int user_tick)
+void vtime_account_process_tick(struct task_struct *tsk, int user_tick)
 {
if (do_account_vtime(tsk, HARDIRQ_OFFSET))
virt_timer_expire();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6c13fe3..2f9bba0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -580,9 +580,7 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING

Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

2012-11-05 Thread Frederic Weisbecker

2012/11/2 Christoph Lameter c...@linux.com:
 Also could we have this support without cpusets? There are multiple means
 to do system segmentation (f.e. cgroups) and something like hz control is
 pretty basic. Control via some cpumask like irq affinities in f.e.

 /sys/devices/system/cpu/nohz

 or a per cpu flag in

 /sys/devices/system/cpu/cpu0/hz

 would be easier and not be tied to something like cpusets.

You really don't want that cpuset interface, do you? ;-)

Yeah I think I agree with you. This adds a dependency to
cpusets/cgroups, I wish we could avoid that if possible. Also cpuset
may be a bit counter intuitive for this usecase. What if a cpu is
included in both a nohz cpuset and a non-nohz cpuset? What is the
behaviour to adopt? An OR on the nohz flag such that as long as the
CPU is in at least one nohz cpuset, it's considered a nohz CPU? Or
only shutdown the tick for the tasks attached in the nohz cpusets? Do
we really want that per cgroup granularity and the overhead /
complexity that comes along?

No I think we should stay simple and have a simple per CPU property
for that, without involving cgroups aside.

So indeed a cpumask in /sys/devices/system/cpu/nohz looks like a
better interface.

 This has been long asked for by those in the RT community. If a task
 requires uninterruptible CPU time, this would be able to give a task
 that, even without the full PREEMPT-RT patch set.

 Also those interested in low latency are very very interested in this
 feature in particular in support without any preempt support on in the
 kernel.

Sure, we are trying to make that full dyncticks approach as much
generic as possible.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/32] [RFC] nohz/cpuset: Start discussions on nohz CPUs

2012-11-05 Thread Frederic Weisbecker

2012/11/5 Christoph Lameter c...@linux.com:
 On Fri, 2 Nov 2012, Paul E. McKenney wrote:

 On Fri, Nov 02, 2012 at 04:51:50PM -0400, Steven Rostedt wrote:
  On Fri, 2012-11-02 at 13:41 -0700, Paul E. McKenney wrote:
 
   The no-CBs mask would be read-only for some time -- changed only at
   boot.  Longer term, I hope to allow run-time modification, but...
 
  but what? You're not looking to retire already are you? ;-)

 Not for a few decades.  ;-)

 But let's add the no-CBs mask to sysfs when I add the ability to run-time
 modify that mast.

 Well we are creating a user ABi with the boot time option. It would be
 best to get it right out of the door.

I believe that a static setting through a boot option is a nice first
step already. Runtime tuning may involve dynamic migration and other
headaches. The nocb patch is tricky enough to review ;)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 7/9] irq_work: Make self-IPIs optable

2012-11-06 Thread Frederic Weisbecker

2012/10/29 Steven Rostedt rost...@goodmis.org:
 On Mon, 2012-10-29 at 14:28 +0100, Frederic Weisbecker wrote:
 On irq work initialization, let the user choose to define it
 as lazy or not. Lazy means that we don't want to send
 an IPI (provided the arch can anyway) when we enqueue this
 work but we rather prefer to wait for the next timer tick
 to execute our work if possible.

 This is going to be a benefit for non-urgent enqueuers
 (like printk in the future) that may prefer not to raise
 an IPI storm in case of frequent enqueuing on short periods
 of time.

 Signed-off-by: Frederic Weisbecker fweis...@gmail.com
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Paul Gortmaker paul.gortma...@windriver.com
 ---
  include/linux/irq_work.h |   31 ++
  kernel/irq_work.c|   53 
 -
  kernel/time/tick-sched.c |3 +-
  3 files changed, 70 insertions(+), 17 deletions(-)

 diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
 index b39ea0b..7b60c87 100644
 --- a/include/linux/irq_work.h
 +++ b/include/linux/irq_work.h
 @@ -4,6 +4,20 @@
  #include linux/llist.h
  #include asm/irq_work.h

 +/*
 + * An entry can be in one of four states:
 + *

 Can you add a comment to what the pointer value is. I know you just
 moved it to the header, but it's still confusing.

Which pointer value?  You mean the flag? You mean the below need more
details or?


 + * free   NULL, 0 - {claimed}   : free to be used
 + * claimed   NULL, 3 - {pending}   : claimed to be enqueued
 + * pending   next, 3 - {busy}  : queued, pending callback
 + * busy  NULL, 2 - {free, claimed} : callback in progress, can be 
 claimed
 + */
 +
 +#define IRQ_WORK_PENDING 1UL
 +#define IRQ_WORK_BUSY2UL
 +#define IRQ_WORK_FLAGS   3UL
 +#define IRQ_WORK_LAZY4UL /* Doesn't want IPI, wait for tick 
 */
[...]
 @@ -66,10 +56,28 @@ static void __irq_work_queue(struct irq_work *work)
   preempt_disable();

   empty = llist_add(work-llnode, __get_cpu_var(irq_work_list));
 - /* The list was empty, raise self-interrupt to start processing. */
 - if (empty)
 +
 + /*
 +  * In any case, raise an IPI if requested and possible in case
 +  * the queue is empty or it's filled with lazy works.
 +  */
 + if (!(work-flags  IRQ_WORK_LAZY)  arch_irq_work_has_ipi()) {
   arch_irq_work_raise();

 Doesn't this mean that now if we queue up a bunch of work (say in
 tracing), that we will send out an IPI for each queue? We only want to
 send out an IPI if the list isn't empty. Perhaps we should make two
 lists. One for lazy work and one for immediate work. Then, when adding a
 non-lazy work item, we can check the empty variable for that. No need to
 check the result for the lazy queue. That will be done during tick.

Indeed, if an IPI is pending while we queue another work, we'll raise
another one. I would prefer to avoid the complication of adding
another queue though. Perhaps a per cpu ipi_pending flag would be
enough. I'll try something.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/7] irq_work: Fix racy IRQ_WORK_BUSY flag setting

The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.

So if we have specific data to compute in our work, we ensure it's
either handled by another CPU or locally by enqueuing the work again.
This state machine is guanranteed by atomic operations on the flags.

So when we set IRQ_WORK_BUSY without using an xchg-like operation,
we break this guarantee as in the following summarized scenario:

CPU 1   CPU 2
-   -
(flags = 0)
old_flags = flags;
(flags = 0)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(flags = 3)
[...]
flags = IRQ_WORK_BUSY
(flags = 2)
func()
(sees flags = 3)
cmpxchg(flags, old_flags,
old_flags | 
IRQ_WORK_FLAGS)
(give up)

cmpxchg(flags, 2, 0);
(flags = 0)

CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and
the work is again in a claimable state. Now CPU 2 has new data to process
and try to claim that work but it may see a stale value of the flags
and think the work is still pending somewhere that will handle our data.
This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically.

As a result, the data expected to be handle by CPU 2 won't get handled.

To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2
will see the correct value with cmpxchg() using the expected ordering.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 1588e3b..57be1a6 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -119,8 +119,11 @@ void irq_work_run(void)
/*
 * Clear the PENDING bit, after this point the @work
 * can be re-used.
+* Make it immediately visible so that other CPUs trying
+* to claim that work don't rely on us to handle their data
+* while we are in the middle of the func.
 */
-   work-flags = IRQ_WORK_BUSY;
+   xchg(work-flags, IRQ_WORK_BUSY);
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/7] irq_work: Make self-IPIs optable

On irq work initialization, let the user choose to define it
as lazy or not. Lazy means that we don't want to send
an IPI (provided the arch can anyway) when we enqueue this
work but we rather prefer to wait for the next timer tick
to execute our work if possible.

This is going to be a benefit for non-urgent enqueuers
(like printk in the future) that may prefer not to raise
an IPI storm in case of frequent enqueuing on short periods
of time.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |   14 ++
 kernel/irq_work.c|   46 ++
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index a69704f..b28eb60 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,20 @@
 
 #include linux/llist.h
 
+/*
+ * An entry can be in one of four states:
+ *
+ * free NULL, 0 - {claimed}   : free to be used
+ * claimed   NULL, 3 - {pending}   : claimed to be enqueued
+ * pending   next, 3 - {busy}  : queued, pending callback
+ * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
+ */
+
+#define IRQ_WORK_PENDING   1UL
+#define IRQ_WORK_BUSY  2UL
+#define IRQ_WORK_FLAGS 3UL
+#define IRQ_WORK_LAZY  4UL /* Doesn't want IPI, wait for tick */
+
 struct irq_work {
unsigned long flags;
struct llist_node llnode;
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index b3c113a..7090b36 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -12,22 +12,13 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/irqflags.h
+#include linux/tick.h
+#include linux/sched.h
 #include asm/processor.h
 
-/*
- * An entry can be in one of four states:
- *
- * free NULL, 0 - {claimed}   : free to be used
- * claimed   NULL, 3 - {pending}   : claimed to be enqueued
- * pending   next, 3 - {busy}  : queued, pending callback
- * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
- */
-
-#define IRQ_WORK_PENDING   1UL
-#define IRQ_WORK_BUSY  2UL
-#define IRQ_WORK_FLAGS 3UL
 
 static DEFINE_PER_CPU(struct llist_head, irq_work_list);
+static DEFINE_PER_CPU(int, irq_work_raised);
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -67,14 +58,18 @@ void __weak arch_irq_work_raise(void)
  */
 static void __irq_work_queue(struct irq_work *work)
 {
-   bool empty;
-
preempt_disable();
 
-   empty = llist_add(work-llnode, __get_cpu_var(irq_work_list));
-   /* The list was empty, raise self-interrupt to start processing. */
-   if (empty)
-   arch_irq_work_raise();
+   llist_add(work-llnode, __get_cpu_var(irq_work_list));
+
+   /*
+* If the work is flagged as lazy, just wait for the next tick
+* to run it. Otherwise, or if the tick is stopped, raise the irq work.
+*/
+   if (!(work-flags  IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
+   if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
+   arch_irq_work_raise();
+   }
 
preempt_enable();
 }
@@ -116,10 +111,19 @@ bool irq_work_needs_cpu(void)
  */
 void irq_work_run(void)
 {
+   unsigned long flags;
struct irq_work *work;
struct llist_head *this_list;
struct llist_node *llnode;
 
+
+   /*
+* Reset the raised state right before we check the list because
+* an NMI may enqueue after we find the list empty from the runner.
+*/
+   __this_cpu_write(irq_work_raised, 0);
+   barrier();
+
this_list = __get_cpu_var(irq_work_list);
if (llist_empty(this_list))
return;
@@ -140,13 +144,15 @@ void irq_work_run(void)
 * to claim that work don't rely on us to handle their data
 * while we are in the middle of the func.
 */
-   xchg(work-flags, IRQ_WORK_BUSY);
+   flags = work-flags  ~IRQ_WORK_PENDING;
+   xchg(work-flags, flags);
+
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
 * no-one else claimed it meanwhile.
 */
-   (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
+   (void)cmpxchg(work-flags, flags, flags  ~IRQ_WORK_BUSY);
}
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

[PATCH 0/7] printk: Make it usable on nohz CPUs v4

Hi,

There are rough changes in this series, the patchset is much simpler:

* Include the two fixes for SMP-safe irq_work claim [1/7],[2/7] 

* Split the part that prevent to stop the tick with pending work
on the queue, to make the review easier.

* Don't care anymore whether the arch can send self-IPIs or not. The real point
of it is for when we stop the tick outside idle, we'll require the self IPI
support from arch. We'll deal with that later.

* Drop the set_need_resched() hack when we enqueue in dyntick idle mode. I'm
just too worried that printk() calls in idle loop may prevent the tick from ever
beeing stopped, and even the CPU from halting.

* When the work is lazy, just never raise unless the tick is stopped. We don't 
care
whether the arch has obscure way to raise it or not. In any case, the hook in
update_process_times() can take care of it.

* Don't raise an IPI after queueing if there is a pending one already.

* Use per cpu irq work for printk because the printk_pending flags are per CPU
and thus can't be handled remotely.

* Queue an irq work in case of printk_sched() as well.

You can fetch from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/printk-v4

Thanks.

Frederic Weisbecker (7):
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
  irq_work: Fix racy check on work pending flag
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  nohz: Add API to check tick state
  irq_work: Don't stop the tick with pending works
  irq_work: Make self-IPIs optable
  printk: Wake up klogd using irq_work

 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 include/linux/irq_work.h|   20 +
 include/linux/printk.h  |3 -
 include/linux/tick.h|   17 +++-
 init/Kconfig|5 +--
 kernel/irq_work.c   |   76 +++---
 kernel/printk.c |   36 +---
 kernel/time/tick-sched.c|7 ++-
 kernel/timer.c  |1 -
 22 files changed, 112 insertions(+), 67 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/7] irq_work: Remove CONFIG_HAVE_IRQ_WORK

irq work can run on any arch even without IPI
support because of the hook on update_process_times().

So lets remove HAVE_IRQ_WORK because it doesn't reflect
any backend requirement.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 init/Kconfig|4 
 15 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5dd7f5d..e56c2d1 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -5,7 +5,6 @@ config ALPHA
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_SYSCALL_WRAPPERS
-   select HAVE_IRQ_WORK
select HAVE_PCSPKR_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_DMA_ATTRS
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ade7e92..22d378b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -36,7 +36,6 @@ config ARM
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if (PERF_EVENTS  (CPU_V6 || CPU_V6K || 
CPU_V7))
select HAVE_IDE if PCI || ISA || PCMCIA
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ef54a59..dd50d72 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,7 +17,6 @@ config ARM64
select HAVE_GENERIC_DMA_COHERENT
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if PERF_EVENTS
-   select HAVE_IRQ_WORK
select HAVE_MEMBLOCK
select HAVE_PERF_EVENTS
select HAVE_SPARSE_IRQ
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index b6f3ad5..86f891f 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -24,7 +24,6 @@ config BLACKFIN
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_IDE
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP if RAMKERNEL
select HAVE_KERNEL_BZIP2 if RAMKERNEL
select HAVE_KERNEL_LZMA if RAMKERNEL
diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig
index df2eb4b..c44fd6e 100644
--- a/arch/frv/Kconfig
+++ b/arch/frv/Kconfig
@@ -3,7 +3,6 @@ config FRV
default y
select HAVE_IDE
select HAVE_ARCH_TRACEHOOK
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_UID16
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index 0744f7d..40a3185 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -14,7 +14,6 @@ config HEXAGON
# select HAVE_CLK
# select IRQ_PER_CPU
# select GENERIC_PENDING_IRQ if SMP
-   select HAVE_IRQ_WORK
select GENERIC_ATOMIC64
select HAVE_PERF_EVENTS
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index dba9390..3d86d69 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,7 +4,6 @@ config MIPS
select HAVE_GENERIC_DMA_COHERENT
select HAVE_IDE
select HAVE_OPROFILE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_ARCH_KGDB
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 11def45..8f0df47 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -9,7 +9,6 @@ config PARISC
select RTC_DRV_GENERIC
select INIT_ALL_POSSIBLE
select BUG
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select GENERIC_ATOMIC64 if !64BIT
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a902a5c..a90f0c9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -118,7 +118,6 @@ config PPC
select HAVE_SYSCALL_WRAPPERS if PPC64
select GENERIC_ATOMIC64 if PPC32
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5dba755..0816ff0 100644

[PATCH 7/7] printk: Wake up klogd using irq_work

klogd is woken up asynchronously from the tick in order
to do it safely.

However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
for a while. As a result, the user may miss some message.

To fix this, lets implement the printk tick using a lazy irq work.
This subsystem takes care of the timer tick state and can
fix up accordingly.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/printk.h   |3 ---
 init/Kconfig |1 +
 kernel/printk.c  |   36 
 kernel/time/tick-sched.c |2 +-
 kernel/timer.c   |1 -
 5 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 9afc01e..86c4b62 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...)
 extern asmlinkage __printf(1, 2)
 void early_printk(const char *fmt, ...);
 
-extern int printk_needs_cpu(int cpu);
-extern void printk_tick(void);
-
 #ifdef CONFIG_PRINTK
 asmlinkage __printf(5, 0)
 int vprintk_emit(int facility, int level,
diff --git a/init/Kconfig b/init/Kconfig
index cdc152c..c575566 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1196,6 +1196,7 @@ config HOTPLUG
 config PRINTK
default y
bool Enable support for printk if EXPERT
+   select IRQ_WORK
help
  This option enables normal printk support. Removing it
  eliminates most of the message strings from the kernel image
diff --git a/kernel/printk.c b/kernel/printk.c
index 2d607f4..c9104fe 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -42,6 +42,7 @@
 #include linux/notifier.h
 #include linux/rculist.h
 #include linux/poll.h
+#include linux/irq_work.h
 
 #include asm/uaccess.h
 
@@ -1955,30 +1956,32 @@ int is_console_locked(void)
 static DEFINE_PER_CPU(int, printk_pending);
 static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
 
-void printk_tick(void)
+static void wake_up_klogd_work_func(struct irq_work *irq_work)
 {
-   if (__this_cpu_read(printk_pending)) {
-   int pending = __this_cpu_xchg(printk_pending, 0);
-   if (pending  PRINTK_PENDING_SCHED) {
-   char *buf = __get_cpu_var(printk_sched_buf);
-   printk(KERN_WARNING [sched_delayed] %s, buf);
-   }
-   if (pending  PRINTK_PENDING_WAKEUP)
-   wake_up_interruptible(log_wait);
+   int pending = __this_cpu_xchg(printk_pending, 0);
+
+   if (pending  PRINTK_PENDING_SCHED) {
+   char *buf = __get_cpu_var(printk_sched_buf);
+   printk(KERN_WARNING [sched_delayed] %s, buf);
}
-}
 
-int printk_needs_cpu(int cpu)
-{
-   if (cpu_is_offline(cpu))
-   printk_tick();
-   return __this_cpu_read(printk_pending);
+   if (pending  PRINTK_PENDING_WAKEUP)
+   wake_up_interruptible(log_wait);
 }
 
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
+   .func = wake_up_klogd_work_func,
+   .flags = IRQ_WORK_LAZY,
+};
+
 void wake_up_klogd(void)
 {
-   if (waitqueue_active(log_wait))
+   preempt_disable();
+   if (waitqueue_active(log_wait)) {
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
+   }
+   preempt_enable();
 }
 
 static void console_cont_flush(char *text, size_t size)
@@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...)
va_end(args);
 
__this_cpu_or(printk_pending, PRINTK_PENDING_SCHED);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
local_irq_restore(flags);
 
return r;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f249e8c..822d757 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
time_delta = timekeeping_max_deferment();
} while (read_seqretry(xtime_lock, seq));
 
-   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
+   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||
arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
diff --git a/kernel/timer.c b/kernel/timer.c
index 367d008..ff3b516 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1351,7 +1351,6 @@ void update_process_times(int user_tick)
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
-   printk_tick();
 #ifdef

[PATCH 5/7] irq_work: Don't stop the tick with pending works

Don't stop the tick if we have pending irq works on the
queue, otherwise if the arch can't raise self-IPIs, we may not
find an opportunity to execute the pending works for a while.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |6 ++
 kernel/irq_work.c|   11 +++
 kernel/time/tick-sched.c |3 ++-
 3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 6a9e8f5..a69704f 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work);
 void irq_work_run(void);
 void irq_work_sync(struct irq_work *work);
 
+#ifdef CONFIG_IRQ_WORK
+bool irq_work_needs_cpu(void);
+#else
+static bool irq_work_needs_cpu(void) { return false; }
+#endif
+
 #endif /* _LINUX_IRQ_WORK_H */
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 64eddd5..b3c113a 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work)
 }
 EXPORT_SYMBOL_GPL(irq_work_queue);
 
+bool irq_work_needs_cpu(void)
+{
+   struct llist_head *this_list;
+
+   this_list = __get_cpu_var(irq_work_list);
+   if (llist_empty(this_list))
+   return false;
+
+   return true;
+}
+
 /*
  * Run the irq_work entries on this cpu. Requires to be ran from hardirq
  * context with local IRQs disabled.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e945aa..f249e8c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include linux/profile.h
 #include linux/sched.h
 #include linux/module.h
+#include linux/irq_work.h
 
 #include asm/irq_regs.h
 
@@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
} while (read_seqretry(xtime_lock, seq));
 
if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
-   arch_needs_cpu(cpu)) {
+   arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
} else {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/7] nohz: Add API to check tick state

We need some quick way to check if the CPU has stopped
its tick. This will be useful to implement the printk tick
using the irq work subsystem.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/tick.h |   17 -
 kernel/time/tick-sched.c |2 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..2307dd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,8 @@
 
 #include linux/clockchips.h
 #include linux/irqflags.h
+#include linux/percpu.h
+#include linux/hrtimer.h
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 
0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched);
+
+static inline int tick_nohz_tick_stopped(void)
+{
+   return __this_cpu_read(tick_cpu_sched.tick_stopped);
+}
+
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+# else /* !CONFIG_NO_HZ */
+static inline int tick_nohz_tick_stopped(void)
+{
+   return 0;
+}
+
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..9e945aa 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -28,7 +28,7 @@
 /*
  * Per cpu nohz control structure
  */
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
 
 /*
  * The time, when the last jiffy update happened. Protected by xtime_lock.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/7] irq_work: Fix racy check on work pending flag

Work claiming wants to be SMP-safe.

And by the time we try to claim a work, if it is already executing
concurrently on another CPU, we want to succeed the claiming and queue
the work again because the other CPU may have missed the data we wanted
to handle in our work if it's about to complete there.

This scenario is summarized below:

CPU 1   CPU 2
-   -
(flags = 0)
cmpxchg(flags, 0, IRQ_WORK_FLAGS)
(flags = 3)
[...]
xchg(flags, IRQ_WORK_BUSY)
(flags = 2)
func()
if (flags  IRQ_WORK_PENDING)
(not true)
cmpxchg(flags, flags, 
IRQ_WORK_FLAGS)
(flags = 3)
[...]
cmpxchg(flags, IRQ_WORK_BUSY, 0);
(fail, pending on CPU 2)

This state machine is synchronized using [cmp]xchg() on the flags.
As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy.
By the time we check it, we may be dealing with a stale value because
we aren't using an atomic accessor. As a result, CPU 2 may see
that the work is still pending on another CPU while it may be
actually completing the work function exection already, leaving
our data unprocessed.

To fix this, we start by speculating about the value we wish to be
in the work-flags but we only make any conclusion after the value
returned by the cmpxchg() call that either claims the work or let
the current owner handle the pending work for us.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |   16 +++-
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 57be1a6..64eddd5 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list);
  */
 static bool irq_work_claim(struct irq_work *work)
 {
-   unsigned long flags, nflags;
+   unsigned long flags, oflags, nflags;
 
+   /*
+* Start with our best wish as a premise but only trust any
+* flag value after cmpxchg() result.
+*/
+   flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
-   flags = work-flags;
-   if (flags  IRQ_WORK_PENDING)
-   return false;
nflags = flags | IRQ_WORK_FLAGS;
-   if (cmpxchg(work-flags, flags, nflags) == flags)
+   oflags = cmpxchg(work-flags, flags, nflags);
+   if (oflags == flags)
break;
+   if (oflags  IRQ_WORK_PENDING)
+   return false;
+   flags = oflags;
cpu_relax();
}
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/5] printk: Make it usable on nohz CPUs

2012-10-19 Thread Frederic Weisbecker

2012/10/12 Frederic Weisbecker fweis...@gmail.com:
 Hi,

 So here is a proposition on what we can do to make printk
 correctly working on a tickless CPU.

 Although it's targeted to be part of the adaptive tickmess
 implemetation, it's pretty standalone and generic and also
 works for printk() calls in idle.

 It is based on latest linus tree.

 Waiting for your comments.

I've been thinking about this more and I now think we may want to
actually keep the irq_work_run() call in the sched tick but then
always implement the printk_tick() through an irq_work. This means:

- we can remove the direct call to printk_tick() in the sched tick. So
we have a single consolidated way to call that printk tick.
- we only send an IPI if the tick is stopped. Otherwise we wait for
the next tick to do the job (so we avoid IPI storms in case of
frequent printk calls)
- we remove printk_needs_cpu() and introduce irq_work_needs_cpu()
instead, in case we have pending irq works (not triggered with
self-IPIs) before we stop the tick.

I'm going to do this in the next round of this patchset.


 Thanks.

 PS: only built-tested for now.

 Frederic Weisbecker (5):
   irq_work: Move irq_work_raise() declaration/default definition to
 arch headers
   irq_work: Only run irq_work from tick if arch needs it
   x86: Implement arch_irq_work_use_tick
   nohz: Add API to check tick state
   printk: Wake up klogd with irq_work on nohz CPU

  arch/alpha/include/asm/irq_work.h|9 +++
  arch/alpha/kernel/time.c |2 +-
  arch/arm/include/asm/irq_work.h  |1 +
  arch/arm64/include/asm/irq_work.h|1 +
  arch/blackfin/include/asm/irq_work.h |1 +
  arch/frv/include/asm/irq_work.h  |1 +
  arch/hexagon/include/asm/irq_work.h  |1 +
  arch/mips/include/asm/irq_work.h |1 +
  arch/parisc/include/asm/irq_work.h   |1 +
  arch/powerpc/include/asm/irq_work.h  |8 ++
  arch/powerpc/kernel/time.c   |2 +-
  arch/s390/include/asm/irq_work.h |1 +
  arch/sh/include/asm/irq_work.h   |1 +
  arch/sparc/include/asm/irq_work.h|8 ++
  arch/sparc/kernel/pcr.c  |2 +-
  arch/x86/include/asm/irq_work.h  |   15 
  arch/x86/kernel/irq_work.c   |6 ++--
  include/asm-generic/irq_work.h   |   22 +
  include/linux/irq_work.h |1 +
  include/linux/tick.h |   16 -
  kernel/irq_work.c|7 -
  kernel/printk.c  |   42 
 ++
  kernel/time/tick-sched.c |2 +-
  kernel/timer.c   |2 +-
  24 files changed, 137 insertions(+), 16 deletions(-)
  create mode 100644 arch/alpha/include/asm/irq_work.h
  create mode 100644 arch/arm/include/asm/irq_work.h
  create mode 100644 arch/arm64/include/asm/irq_work.h
  create mode 100644 arch/blackfin/include/asm/irq_work.h
  create mode 100644 arch/frv/include/asm/irq_work.h
  create mode 100644 arch/hexagon/include/asm/irq_work.h
  create mode 100644 arch/mips/include/asm/irq_work.h
  create mode 100644 arch/parisc/include/asm/irq_work.h
  create mode 100644 arch/powerpc/include/asm/irq_work.h
  create mode 100644 arch/s390/include/asm/irq_work.h
  create mode 100644 arch/sh/include/asm/irq_work.h
  create mode 100644 arch/sparc/include/asm/irq_work.h
  create mode 100644 arch/x86/include/asm/irq_work.h
  create mode 100644 include/asm-generic/irq_work.h

 --
 1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.7-fixes 1/2] Revert cgroup: Remove task_lock() from cgroup_post_fork()

2012-10-19 Thread Frederic Weisbecker

2012/10/19 Tejun Heo t...@kernel.org:
 On Fri, Oct 19, 2012 at 09:35:26AM -0400, Frederic Weisbecker wrote:
 2012/10/18 Tejun Heo t...@kernel.org:
  From d935a5d6832a264ce52f4257e176f4f96cbaf048 Mon Sep 17 00:00:00 2001
  From: Tejun Heo t...@kernel.org
  Date: Thu, 18 Oct 2012 17:40:30 -0700

  This reverts commit 7e3aa30ac8c904a706518b725c451bb486daaae9.

  The commit incorrectly assumed that fork path always performed
  threadgroup_change_begin/end() and depended on that for
  synchronization against task exit and cgroup migration paths instead
  of explicitly grabbing task_lock().

  threadgroup_change is not locked when forking a new process (as
  opposed to a new thread in the same process) and even if it were it
  wouldn't be effective as different processes use different threadgroup
  locks.

  Revert the incorrect optimization.

 Ok but there is still no good reason to task_lock() there. But the
 comment is indeed wrong,  how about fixing it instead? I can send you
 a patch for that.

 For -stable, I think it's better to revert.  If you want to remove
 task_lock, let's do it for 3.8.

I don't think that a wrong comment justifies a patch to stable.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 3/8] x86: Implement arch_irq_work_has_ipi()

Most of the time, x86 can trigger self-IPIs. Tell
irq work subsystem about it.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/x86/include/asm/irq_work.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/irq_work.h b/arch/x86/include/asm/irq_work.h
index dad8266..c7489cd 100644
--- a/arch/x86/include/asm/irq_work.h
+++ b/arch/x86/include/asm/irq_work.h
@@ -2,8 +2,12 @@
 #define _ASM_X86_IRQ_WORK_H
 
 #ifdef CONFIG_X86_LOCAL_APIC
+#include asm/cpufeature.h
+
 extern void __arch_irq_work_raise(void);
 #define arch_irq_work_raise __arch_irq_work_raise
+
+#define arch_irq_work_has_ipi() (cpu_has_apic)
 #endif
 
 #include asm-generic/irq_work.h
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 4/8] nohz: Add API to check tick state

We need some quick way to check if the CPU has stopped
its tick. This will be useful to implement the printk tick
using the irq work subsystem.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/tick.h |   17 -
 kernel/time/tick-sched.c |2 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..2307dd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,8 @@
 
 #include linux/clockchips.h
 #include linux/irqflags.h
+#include linux/percpu.h
+#include linux/hrtimer.h
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 
0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched);
+
+static inline int tick_nohz_tick_stopped(void)
+{
+   return __this_cpu_read(tick_cpu_sched.tick_stopped);
+}
+
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+# else /* !CONFIG_NO_HZ */
+static inline int tick_nohz_tick_stopped(void)
+{
+   return 0;
+}
+
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f423bdd..ccc1971 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -28,7 +28,7 @@
 /*
  * Per cpu nohz control structure
  */
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
 
 /*
  * The time, when the last jiffy update happened. Protected by xtime_lock.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 7/8] irq_work: Remove CONFIG_HAVE_IRQ_WORK

irq work is supposed to work everywhere because of the irq work
hook in the generic timer tick function.

I might be missing something though...

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 init/Kconfig|4 
 15 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 7da9124..ea86275 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -5,7 +5,6 @@ config ALPHA
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_SYSCALL_WRAPPERS
-   select HAVE_IRQ_WORK
select HAVE_PCSPKR_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_DMA_ATTRS
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 767aae8..44bdbfe 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -30,7 +30,6 @@ config ARM
select HAVE_KERNEL_LZO
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_XZ
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7ff68c9..efa0627 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,7 +17,6 @@ config ARM64
select HAVE_GENERIC_DMA_COHERENT
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if PERF_EVENTS
-   select HAVE_IRQ_WORK
select HAVE_MEMBLOCK
select HAVE_PERF_EVENTS
select HAVE_SPARSE_IRQ
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index ccd9193..9c588f7 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -24,7 +24,6 @@ config BLACKFIN
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_IDE
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP if RAMKERNEL
select HAVE_KERNEL_BZIP2 if RAMKERNEL
select HAVE_KERNEL_LZMA if RAMKERNEL
diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig
index 9d26264..17df48f 100644
--- a/arch/frv/Kconfig
+++ b/arch/frv/Kconfig
@@ -3,7 +3,6 @@ config FRV
default y
select HAVE_IDE
select HAVE_ARCH_TRACEHOOK
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_UID16
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index b2fdfb7..8a902a7 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -14,7 +14,6 @@ config HEXAGON
# select HAVE_CLK
# select IRQ_PER_CPU
# select GENERIC_PENDING_IRQ if SMP
-   select HAVE_IRQ_WORK
select GENERIC_ATOMIC64
select HAVE_PERF_EVENTS
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 35453ea..045bf4d 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,7 +4,6 @@ config MIPS
select HAVE_GENERIC_DMA_COHERENT
select HAVE_IDE
select HAVE_OPROFILE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_ARCH_KGDB
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index b87438b..abbdb7a 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -9,7 +9,6 @@ config PARISC
select RTC_DRV_GENERIC
select INIT_ALL_POSSIBLE
select BUG
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select GENERIC_ATOMIC64 if !64BIT
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index df7edb8..c071bcb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -118,7 +118,6 @@ config PPC
select HAVE_SYSCALL_WRAPPERS if PPC64
select GENERIC_ATOMIC64 if PPC32
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 99d2d79..d3e251b 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -78,7 +78,6 @@ config S390
select

[RFC PATCH 8/8] printk: Wake up klogd using irq_work

klogd is woken up asynchronously from the tick in order
to do it safely.

However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
before a while. As a result, the user may miss some message.

To fix this, lets implement the printk tick using irq work.
This subsystem takes care of the timer tick state and can
fix up accordingly.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/printk.h   |3 ---
 init/Kconfig |1 +
 kernel/printk.c  |   14 +++---
 kernel/time/tick-sched.c |2 +-
 kernel/timer.c   |1 -
 5 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 9afc01e..86c4b62 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...)
 extern asmlinkage __printf(1, 2)
 void early_printk(const char *fmt, ...);
 
-extern int printk_needs_cpu(int cpu);
-extern void printk_tick(void);
-
 #ifdef CONFIG_PRINTK
 asmlinkage __printf(5, 0)
 int vprintk_emit(int facility, int level,
diff --git a/init/Kconfig b/init/Kconfig
index 8ba9fc3..9c42e6a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1184,6 +1184,7 @@ config HOTPLUG
 config PRINTK
default y
bool Enable support for printk if EXPERT
+   select IRQ_WORK
help
  This option enables normal printk support. Removing it
  eliminates most of the message strings from the kernel image
diff --git a/kernel/printk.c b/kernel/printk.c
index 66a2ea3..721f760 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -42,6 +42,7 @@
 #include linux/notifier.h
 #include linux/rculist.h
 #include linux/poll.h
+#include linux/irq_work.h
 
 #include asm/uaccess.h
 
@@ -1956,7 +1957,7 @@ int is_console_locked(void)
 static DEFINE_PER_CPU(int, printk_pending);
 static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
 
-void printk_tick(void)
+static void wake_up_klogd_work_func(struct irq_work *irq_work)
 {
if (__this_cpu_read(printk_pending)) {
int pending = __this_cpu_xchg(printk_pending, 0);
@@ -1969,17 +1970,16 @@ void printk_tick(void)
}
 }
 
-int printk_needs_cpu(int cpu)
-{
-   if (cpu_is_offline(cpu))
-   printk_tick();
-   return __this_cpu_read(printk_pending);
-}
+static struct irq_work wake_up_klogd_work = {
+   .func = wake_up_klogd_work_func
+};
 
 void wake_up_klogd(void)
 {
if (waitqueue_active(log_wait))
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+
+   irq_work_queue(wake_up_klogd_work, false);
 }
 
 static void console_cont_flush(char *text, size_t size)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5f87bb5..04dd027 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -288,7 +288,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
time_delta = timekeeping_max_deferment();
} while (read_seqretry(xtime_lock, seq));
 
-   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
+   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||
arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
diff --git a/kernel/timer.c b/kernel/timer.c
index d5de1b2..5d6d0f1 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1349,7 +1349,6 @@ void update_process_times(int user_tick)
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
-   printk_tick();
 #ifdef CONFIG_IRQ_WORK
if (in_irq())
irq_work_run();
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 6/8] irq_work: Handle queuing without IPI support in dyntick idle mode

If we enqueue a work while in dyntick idle mode and the arch doesn't
have self-IPI support, we may not find an opportunity to run the work
before a while.

In this case, exit the idle loop to re-evaluate irq_work_needs_cpu()
and restart the tick.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 kernel/irq_work.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 19f537b..f3bdcf4 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -71,6 +71,17 @@ static void __irq_work_queue(struct irq_work *work, bool ipi)
 */
if (ipi || !arch_irq_work_has_ipi() || tick_nohz_tick_stopped())
arch_irq_work_raise();
+
+   /*
+* If we rely on the timer tick or some obscure way to run the 
work
+* while the CPU is in dyntick idle mode, we may not have an 
opportunity
+* to do so before a while. Let's just exit the idle loop and 
hope we
+* haven't yet reached the last need_resched() check before the 
CPU goes
+* to low power mode.
+*/
+   if (!arch_irq_work_has_ipi()  tick_nohz_tick_stopped()
+is_idle_task(current))
+   set_need_resched();
}
preempt_enable();
 }
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 0/8] printk: Make it usable on nohz CPUs v2

Hi,

So the design is not quite the same here. Instead of using the ad hoc
printk_tick() in periodic mode and irq_work on nohz mode, printk
is now always using irq_work.

In turn, irq_work subsystem is able to let enqueuers choose between IPI
or lazy tick hook to execute works, and that in order to avoid IPI storm
when we have lots of enqueuing of non-urgent works like klogd wakeup
in short period of time so this keeps the old printk_tick behaviour.

It also teaches irq_work to handle nohz mode.

Warning: only compile tested in x86 for now.

Frederic Weisbecker (8):
  irq_work: Move irq_work_raise() declaration/default definition to
arch headers
  irq_work: Let the arch tell us about self-IPI support
  x86: Implement arch_irq_work_has_ipi()
  nohz: Add API to check tick state
  irq_work: Make self-IPIs optable
  irq_work: Handle queuing without IPI support in dyntick idle mode
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  printk: Wake up klogd using irq_work

 arch/alpha/Kconfig   |1 -
 arch/alpha/include/asm/irq_work.h|9 +
 arch/alpha/kernel/time.c |2 +-
 arch/arm/Kconfig |1 -
 arch/arm/include/asm/irq_work.h  |1 +
 arch/arm64/Kconfig   |1 -
 arch/arm64/include/asm/irq_work.h|1 +
 arch/blackfin/Kconfig|1 -
 arch/blackfin/include/asm/irq_work.h |1 +
 arch/frv/Kconfig |1 -
 arch/frv/include/asm/irq_work.h  |1 +
 arch/hexagon/Kconfig |1 -
 arch/hexagon/include/asm/irq_work.h  |1 +
 arch/mips/Kconfig|1 -
 arch/mips/include/asm/irq_work.h |1 +
 arch/parisc/Kconfig  |1 -
 arch/parisc/include/asm/irq_work.h   |1 +
 arch/powerpc/Kconfig |1 -
 arch/powerpc/include/asm/irq_work.h  |8 
 arch/powerpc/kernel/time.c   |2 +-
 arch/s390/Kconfig|1 -
 arch/s390/include/asm/irq_work.h |1 +
 arch/sh/Kconfig  |1 -
 arch/sh/include/asm/irq_work.h   |1 +
 arch/sparc/Kconfig   |1 -
 arch/sparc/include/asm/irq_work.h|8 
 arch/sparc/kernel/pcr.c  |2 +-
 arch/x86/Kconfig |1 -
 arch/x86/include/asm/irq_work.h  |   15 
 arch/x86/kernel/cpu/mcheck/mce.c |2 +-
 arch/x86/kernel/irq_work.c   |6 ++--
 arch/x86/kvm/pmu.c   |2 +-
 drivers/acpi/apei/ghes.c |2 +-
 drivers/staging/iio/trigger/Kconfig  |1 -
 drivers/staging/iio/trigger/iio-trig-sysfs.c |2 +-
 include/asm-generic/irq_work.h   |   23 
 include/linux/irq_work.h |9 -
 include/linux/printk.h   |3 --
 include/linux/tick.h |   17 -
 init/Kconfig |5 +--
 kernel/events/core.c |4 +-
 kernel/events/ring_buffer.c  |2 +-
 kernel/irq_work.c|   48 +++--
 kernel/printk.c  |   14 
 kernel/time/tick-sched.c |6 ++--
 kernel/timer.c   |1 -
 46 files changed, 156 insertions(+), 59 deletions(-)
 create mode 100644 arch/alpha/include/asm/irq_work.h
 create mode 100644 arch/arm/include/asm/irq_work.h
 create mode 100644 arch/arm64/include/asm/irq_work.h
 create mode 100644 arch/blackfin/include/asm/irq_work.h
 create mode 100644 arch/frv/include/asm/irq_work.h
 create mode 100644 arch/hexagon/include/asm/irq_work.h
 create mode 100644 arch/mips/include/asm/irq_work.h
 create mode 100644 arch/parisc/include/asm/irq_work.h
 create mode 100644 arch/powerpc/include/asm/irq_work.h
 create mode 100644 arch/s390/include/asm/irq_work.h
 create mode 100644 arch/sh/include/asm/irq_work.h
 create mode 100644 arch/sparc/include/asm/irq_work.h
 create mode 100644 arch/x86/include/asm/irq_work.h
 create mode 100644 include/asm-generic/irq_work.h

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 5/8] irq_work: Make self-IPIs optable

While queuing an irq work, let the caller choose between
triggering a self-IPI right away, provided the arch is able
to do so, or waiting for the next timer interrupt to run the work.

Some non-urgent enqueuers like printk may prefer not to raise
an IPI storm in case of frequent calls on short periods of
time.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/x86/kernel/cpu/mcheck/mce.c |2 +-
 arch/x86/kvm/pmu.c   |2 +-
 drivers/acpi/apei/ghes.c |2 +-
 drivers/staging/iio/trigger/iio-trig-sysfs.c |2 +-
 include/linux/irq_work.h |8 +-
 kernel/events/core.c |4 +-
 kernel/events/ring_buffer.c  |2 +-
 kernel/irq_work.c|   32 +-
 kernel/time/tick-sched.c |2 +-
 9 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 29e87d3..3020e95 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -549,7 +549,7 @@ static void mce_report_event(struct pt_regs *regs)
return;
}
 
-   irq_work_queue(__get_cpu_var(mce_irq_work));
+   irq_work_queue(__get_cpu_var(mce_irq_work), true);
 }
 
 /*
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index cfc258a..0dfc716 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -128,7 +128,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 * NMI context. Do it from irq work instead.
 */
if (!kvm_is_in_guest())
-   irq_work_queue(pmc-vcpu-arch.pmu.irq_work);
+   irq_work_queue(pmc-vcpu-arch.pmu.irq_work, true);
else
kvm_make_request(KVM_REQ_PMI, pmc-vcpu);
}
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1599566..44be554 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -874,7 +874,7 @@ next:
ghes_clear_estatus(ghes);
}
 #ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
-   irq_work_queue(ghes_proc_irq_work);
+   irq_work_queue(ghes_proc_irq_work, true);
 #endif
 
 out:
diff --git a/drivers/staging/iio/trigger/iio-trig-sysfs.c 
b/drivers/staging/iio/trigger/iio-trig-sysfs.c
index 3bac972..7d6f9a9 100644
--- a/drivers/staging/iio/trigger/iio-trig-sysfs.c
+++ b/drivers/staging/iio/trigger/iio-trig-sysfs.c
@@ -105,7 +105,7 @@ static ssize_t iio_sysfs_trigger_poll(struct device *dev,
struct iio_trigger *trig = to_iio_trigger(dev);
struct iio_sysfs_trig *sysfs_trig = trig-private_data;
 
-   irq_work_queue(sysfs_trig-work);
+   irq_work_queue(sysfs_trig-work, true);
 
return count;
 }
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index b39ea0b..71a33b7 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -17,8 +17,14 @@ void init_irq_work(struct irq_work *work, void 
(*func)(struct irq_work *))
work-func = func;
 }
 
-bool irq_work_queue(struct irq_work *work);
+bool irq_work_queue(struct irq_work *work, bool ipi);
 void irq_work_run(void);
 void irq_work_sync(struct irq_work *work);
 
+#ifdef CONFIG_IRQ_WORK
+bool irq_work_needs_cpu(void);
+#else
+static bool irq_work_needs_cpu(void) { return false; }
+#endif
+
 #endif /* _LINUX_IRQ_WORK_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index cda3ebd..e7cbbcc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4900,7 +4900,7 @@ static int __perf_event_overflow(struct perf_event *event,
ret = 1;
event-pending_kill = POLL_HUP;
event-pending_disable = 1;
-   irq_work_queue(event-pending);
+   irq_work_queue(event-pending, true);
}
 
if (event-overflow_handler)
@@ -4910,7 +4910,7 @@ static int __perf_event_overflow(struct perf_event *event,
 
if (event-fasync  event-pending_kill) {
event-pending_wakeup = 1;
-   irq_work_queue(event-pending);
+   irq_work_queue(event-pending, true);
}
 
return ret;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 23cb34f..620df7a 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -39,7 +39,7 @@ static void perf_output_wakeup(struct perf_output_handle 
*handle)
atomic_set(handle-rb-poll, POLL_IN);
 
handle-event-pending_wakeup = 1;
-   irq_work_queue(handle-event-pending);
+   irq_work_queue(handle-event-pending, true);
 }
 
 /*
diff

[RFC PATCH 2/8] irq_work: Let the arch tell us about self-IPI support

This prepares us to make printk working on nohz CPUs
using irq work.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/include/asm/irq_work.h   |5 -
 arch/alpha/kernel/time.c|2 +-
 arch/powerpc/include/asm/irq_work.h |4 +++-
 arch/powerpc/kernel/time.c  |2 +-
 arch/sparc/include/asm/irq_work.h   |4 +++-
 arch/sparc/kernel/pcr.c |2 +-
 arch/x86/include/asm/irq_work.h |9 +
 arch/x86/kernel/irq_work.c  |2 +-
 include/asm-generic/irq_work.h  |   14 ++
 9 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/alpha/include/asm/irq_work.h 
b/arch/alpha/include/asm/irq_work.h
index 814ff3d..3d32132 100644
--- a/arch/alpha/include/asm/irq_work.h
+++ b/arch/alpha/include/asm/irq_work.h
@@ -1,6 +1,9 @@
 #ifndef _ALPHA_IRQ_WORK_H
 #define _ALPHA_IRQ_WORK_H
 
-extern void arch_irq_work_raise(void);
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
+
+#include asm-generic/irq_work.h
 
 #endif
diff --git a/arch/alpha/kernel/time.c b/arch/alpha/kernel/time.c
index e336694..91c5eec 100644
--- a/arch/alpha/kernel/time.c
+++ b/arch/alpha/kernel/time.c
@@ -90,7 +90,7 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 #define test_irq_work_pending()  __get_cpu_var(irq_work_pending)
 #define clear_irq_work_pending() __get_cpu_var(irq_work_pending) = 0
 
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
set_irq_work_pending_flag();
 }
diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h
index 8b9927f..8aa36aa 100644
--- a/arch/powerpc/include/asm/irq_work.h
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_POWERPC_IRQ_WORK_H
 #define _ASM_POWERPC_IRQ_WORK_H
 
-extern void arch_irq_work_raise(void);
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
 
+#include asm-generic/irq_work.h
 #endif
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index c9986fd..31565ac 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -466,7 +466,7 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 
 #endif /* 32 vs 64 bit */
 
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
preempt_disable();
set_irq_work_pending_flag();
diff --git a/arch/sparc/include/asm/irq_work.h 
b/arch/sparc/include/asm/irq_work.h
index 1d062a6..383772d 100644
--- a/arch/sparc/include/asm/irq_work.h
+++ b/arch/sparc/include/asm/irq_work.h
@@ -1,6 +1,8 @@
 #ifndef ___ASM_SPARC_IRQ_H
 #define ___ASM_SPARC_IRQ_H
 
-extern void arch_irq_work_raise(void);
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
 
+#include asm-generic/irq_work.h
 #endif
diff --git a/arch/sparc/kernel/pcr.c b/arch/sparc/kernel/pcr.c
index 269af58..d1e1ecf 100644
--- a/arch/sparc/kernel/pcr.c
+++ b/arch/sparc/kernel/pcr.c
@@ -43,7 +43,7 @@ void __irq_entry deferred_pcr_work_irq(int irq, struct 
pt_regs *regs)
set_irq_regs(old_regs);
 }
 
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
set_softint(1  PIL_DEFERRED_PCR_WORK);
 }
diff --git a/arch/x86/include/asm/irq_work.h b/arch/x86/include/asm/irq_work.h
index 38eed96..dad8266 100644
--- a/arch/x86/include/asm/irq_work.h
+++ b/arch/x86/include/asm/irq_work.h
@@ -1,10 +1,11 @@
 #ifndef _ASM_X86_IRQ_WORK_H
 #define _ASM_X86_IRQ_WORK_H
 
-#ifndef CONFIG_X86_LOCAL_APIC
-#include asm-generic/irq_work.h
-# else
-extern void arch_irq_work_raise(void);
+#ifdef CONFIG_X86_LOCAL_APIC
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
 #endif
 
+#include asm-generic/irq_work.h
+
 #endif
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 95f5d4e..7389d5e 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -19,7 +19,7 @@ void smp_irq_work_interrupt(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_X86_LOCAL_APIC
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
if (!cpu_has_apic)
return;
diff --git a/include/asm-generic/irq_work.h b/include/asm-generic/irq_work.h
index a2d4108..e3e4f02 100644
--- a/include/asm-generic/irq_work.h
+++ b/include/asm-generic/irq_work.h
@@ -4,6 +4,20 @@
 /*
  * Lame architectures will get the timer tick callback
  */
+#ifndef arch_irq_work_raise
 static inline void arch_irq_work_raise(void) { }
+#endif
+
+/*
+ * Unless told otherwise, consider the arch doesn't implement irq work
+ * using self IPIs but through another way like defaulting to the hook
+ * on the sched tick.
+ */
+#ifndef arch_irq_work_has_ipi
+static inline bool

[RFC PATCH 1/8] irq_work: Move irq_work_raise() declaration/default definition to arch headers

This optimization doesn't matter much. But this prepares the
arch headers that we need to add a new API in order to detect
when the arch can trigger self IPIs to implement the irq work.

This is necessary later to make printk working in nohz CPUs.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/include/asm/irq_work.h|6 ++
 arch/arm/include/asm/irq_work.h  |1 +
 arch/arm64/include/asm/irq_work.h|1 +
 arch/blackfin/include/asm/irq_work.h |1 +
 arch/frv/include/asm/irq_work.h  |1 +
 arch/hexagon/include/asm/irq_work.h  |1 +
 arch/mips/include/asm/irq_work.h |1 +
 arch/parisc/include/asm/irq_work.h   |1 +
 arch/powerpc/include/asm/irq_work.h  |6 ++
 arch/s390/include/asm/irq_work.h |1 +
 arch/sh/include/asm/irq_work.h   |1 +
 arch/sparc/include/asm/irq_work.h|6 ++
 arch/x86/include/asm/irq_work.h  |   10 ++
 arch/x86/kernel/irq_work.c   |4 ++--
 include/asm-generic/irq_work.h   |9 +
 include/linux/irq_work.h |1 +
 kernel/irq_work.c|7 ---
 17 files changed, 49 insertions(+), 9 deletions(-)
 create mode 100644 arch/alpha/include/asm/irq_work.h
 create mode 100644 arch/arm/include/asm/irq_work.h
 create mode 100644 arch/arm64/include/asm/irq_work.h
 create mode 100644 arch/blackfin/include/asm/irq_work.h
 create mode 100644 arch/frv/include/asm/irq_work.h
 create mode 100644 arch/hexagon/include/asm/irq_work.h
 create mode 100644 arch/mips/include/asm/irq_work.h
 create mode 100644 arch/parisc/include/asm/irq_work.h
 create mode 100644 arch/powerpc/include/asm/irq_work.h
 create mode 100644 arch/s390/include/asm/irq_work.h
 create mode 100644 arch/sh/include/asm/irq_work.h
 create mode 100644 arch/sparc/include/asm/irq_work.h
 create mode 100644 arch/x86/include/asm/irq_work.h
 create mode 100644 include/asm-generic/irq_work.h

diff --git a/arch/alpha/include/asm/irq_work.h 
b/arch/alpha/include/asm/irq_work.h
new file mode 100644
index 000..814ff3d
--- /dev/null
+++ b/arch/alpha/include/asm/irq_work.h
@@ -0,0 +1,6 @@
+#ifndef _ALPHA_IRQ_WORK_H
+#define _ALPHA_IRQ_WORK_H
+
+extern void arch_irq_work_raise(void);
+
+#endif
diff --git a/arch/arm/include/asm/irq_work.h b/arch/arm/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/arm/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/arm64/include/asm/irq_work.h 
b/arch/arm64/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/arm64/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/blackfin/include/asm/irq_work.h 
b/arch/blackfin/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/blackfin/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/frv/include/asm/irq_work.h b/arch/frv/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/frv/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/hexagon/include/asm/irq_work.h 
b/arch/hexagon/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/hexagon/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/mips/include/asm/irq_work.h b/arch/mips/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/mips/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/parisc/include/asm/irq_work.h 
b/arch/parisc/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/parisc/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h
new file mode 100644
index 000..8b9927f
--- /dev/null
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_POWERPC_IRQ_WORK_H
+#define _ASM_POWERPC_IRQ_WORK_H
+
+extern void arch_irq_work_raise(void);
+
+#endif
diff --git a/arch/s390/include/asm/irq_work.h b/arch/s390/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/s390/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/sh/include/asm/irq_work.h b/arch/sh/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/sh/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/sparc/include/asm/irq_work.h 
b/arch/sparc/include/asm/irq_work.h
new file mode 100644
index 000..1d062a6

Re: [RFC PATCH 8/8] printk: Wake up klogd using irq_work

2012/10/20 Joe Perches j...@perches.com:
 On Sat, 2012-10-20 at 12:22 -0400, Frederic Weisbecker wrote:
 lets implement the printk tick using irq work.

 Hi Frederic.

 Can you redo this change please against -next in a few days?

 Andrew Morton picked up this series,
 https://lkml.org/lkml/2012/10/17/41
 but it's not yet in -next.

 https://lkml.org/lkml/2012/10/18/600

 kernel/printk.c has been moved to kernel/printk/printk.c
 and broken up into multiple files.

Sure, I will rebase to your work on the next version.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.7-fixes 1/2] Revert cgroup: Remove task_lock() from cgroup_post_fork()

2012/10/19 Tejun Heo t...@kernel.org:
 Hello, Frederic.

 On Fri, Oct 19, 2012 at 03:44:20PM -0400, Frederic Weisbecker wrote:
  For -stable, I think it's better to revert.  If you want to remove
  task_lock, let's do it for 3.8.

 I don't think that a wrong comment justifies a patch to stable.

 I'm not really sure whether it's safe or not.  It seems all usages are
 protected by write locking css_set_lock but maybe I'm missing
 something and as the commit is born out of confusion, I'm very
 inclined to revert it by default.  Are you sure this one is safe?

Thinking about it further, one scenario is worrying me but it
eventually looks safe but by accident.

CPU 0
CPU 1

cgroup_task_migrate {
task_lock(p)
rcu_assign_pointer(tsk-cgroups, newcg);
task_unlock(tsk);

write_lock(css_set_lock);
if (!list_empty(tsk-cg_list))
list_move(tsk-cg_list, newcg-tasks);
write_unlock(css_set_lock);

  write_lock(css_set_lock);
put_css_set(oldcg);
 list_add(child-cg_list, child-cgroups-tasks); (1)

On (1), child-cgroups should have the value of newcg and not oldcg
due to the memory ordering implied by the locking of css_set_lock. Now
I can't guarantee that because I'm no memory ordering expert. And even
if it's safe, it's so very non obvious that I now agree with you:
let's revert  the patch and restart with a better base by gathering
all the cgroup fork code in the current cgroup_post_fork place.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.7-fixes 1/2] Revert cgroup: Remove task_lock() from cgroup_post_fork()

2012/10/20 Frederic Weisbecker fweis...@gmail.com:
 2012/10/19 Tejun Heo t...@kernel.org:
 Hello, Frederic.

 On Fri, Oct 19, 2012 at 03:44:20PM -0400, Frederic Weisbecker wrote:
  For -stable, I think it's better to revert.  If you want to remove
  task_lock, let's do it for 3.8.

 I don't think that a wrong comment justifies a patch to stable.

 I'm not really sure whether it's safe or not.  It seems all usages are
 protected by write locking css_set_lock but maybe I'm missing
 something and as the commit is born out of confusion, I'm very
 inclined to revert it by default.  Are you sure this one is safe?

 Thinking about it further, one scenario is worrying me but it
 eventually looks safe but by accident.

 CPU 0
 CPU 1

 cgroup_task_migrate {
 task_lock(p)
 rcu_assign_pointer(tsk-cgroups, newcg);
 task_unlock(tsk);

 write_lock(css_set_lock);
 if (!list_empty(tsk-cg_list))
 list_move(tsk-cg_list, newcg-tasks);
 write_unlock(css_set_lock);

   write_lock(css_set_lock);
 put_css_set(oldcg);
  list_add(child-cg_list, child-cgroups-tasks); (1)

gmail mangled everything :(
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.7-fixes 1/2] Revert cgroup: Remove task_lock() from cgroup_post_fork()

2012-10-22 Thread Frederic Weisbecker

2012/10/21 Tejun Heo t...@kernel.org:
 Hello, Frederic.

 On Sat, Oct 20, 2012 at 02:21:43PM -0400, Frederic Weisbecker wrote:
 CPU 0
 CPU 1

 cgroup_task_migrate {
 task_lock(p)
 rcu_assign_pointer(tsk-cgroups, newcg);
 task_unlock(tsk);

 write_lock(css_set_lock);
 if (!list_empty(tsk-cg_list))
 list_move(tsk-cg_list, newcg-tasks);
 write_unlock(css_set_lock);

   write_lock(css_set_lock);
   put_css_set(oldcg);
  list_add(child-cg_list, child-cgroups-tasks); (1)

 Man, that's confusing. :)

Sorry and I'm currently stuck in some airport and too lazy to reorder
the above lines :)


 On (1), child-cgroups should have the value of newcg and not oldcg
 due to the memory ordering implied by the locking of css_set_lock. Now
 I can't guarantee that because I'm no memory ordering expert. And even
 if it's safe, it's so very non obvious that I now agree with you:
 let's revert  the patch and restart with a better base by gathering
 all the cgroup fork code in the current cgroup_post_fork place.

 Aye aye, let's move everything to cgroup_post_fork() and then we don't
 have to worry about grabbing task_lock multiple times.

Agreed. and Acked-by: Frederic Weisbecker fweis...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 5/8] irq_work: Make self-IPIs optable

2012-10-23 Thread Frederic Weisbecker

2012/10/22 Peter Zijlstra pet...@infradead.org:
 On Sat, 2012-10-20 at 12:22 -0400, Frederic Weisbecker wrote:
 +   if (empty) {
 +   /*
 +* If an IPI is requested, raise it right away. Otherwise 
 wait
 +* for the next tick unless it's stopped. Now if the arch 
 uses
 +* some other obscure way than IPI to raise an irq work, 
 just raise
 +* and don't think further.
 +*/
 +   if (ipi || !arch_irq_work_has_ipi() || 
 tick_nohz_tick_stopped())
 +   arch_irq_work_raise();
 +   }
 preempt_enable();
  }

 Doesn't this have a problem where we enqueue the first lazy and then one
 with ipi? In that case it appears we won't send the IPI because the
 queue wasn't empty.

Good point! I need to send an ipi in that case. Will fix on the next version.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] kvm: Directly account vtime to system on guest switch

Switching to or from guest context is done on ioctl context.
So by the time we call kvm_guest_enter() or kvm_guest_exit()
we know we are not running the idle task.

As a result, we can directly account the cputime using
vtime_account_system().

There are two good reasons to do this:

* We avoid one indirect call and some useless checks on guest switch.
It optimizes a bit this fast path.

* In the case of CONFIG_IRQ_TIME_ACCOUNTING, calling vtime_account()
checks for irq time to account. This is pointless since we know
we are not in an irq on guest switch. This is wasting cpu cycles
for no good reason. vtime_account_system() OTOH is a no-op in
this config option.

A further optimization may consist in introducing a vtime_account_guest()
that directly calls account_guest_time().

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Joerg Roedel joerg.roe...@amd.com
Cc: Alexander Graf ag...@suse.de
Cc: Xiantao Zhang xiantao.zh...@intel.com
Cc: Christian Borntraeger borntrae...@de.ibm.com
Cc: Cornelia Huck cornelia.h...@de.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
---
 arch/ia64/kernel/time.c |1 +
 arch/powerpc/kernel/time.c  |1 +
 arch/s390/kernel/vtime.c|4 
 include/linux/kernel_stat.h |1 +
 include/linux/kvm_host.h|   12 ++--
 5 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 80ff9ac..3337e97 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -141,6 +141,7 @@ void vtime_account_system(struct task_struct *tsk)
 
account_system_time(tsk, 0, delta, delta);
 }
+EXPORT_SYMBOL_GPL(vtime_account_system);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index eaa9d0e..5547452 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -345,6 +345,7 @@ void vtime_account_system(struct task_struct *tsk)
if (stolen)
account_steal_time(stolen);
 }
+EXPORT_SYMBOL_GPL(vtime_account_system);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index cb5093c..972082d 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -140,6 +140,10 @@ void vtime_account(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account);
 
+void vtime_account_system(struct task_struct *tsk)
+__attribute__((alias(vtime_account)));
+EXPORT_SYMBOL_GPL(vtime_account_system);
+
 void __kprobes vtime_stop_cpu(void)
 {
struct s390_idle_data *idle = __get_cpu_var(s390_idle);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 36d12f0..6747d4b 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -136,6 +136,7 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 #else
 static inline void vtime_task_switch(struct task_struct *prev) { }
+static inline void vtime_account_system(struct task_struct *tsk) { }
 #endif
 
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8a59e0a..148db3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -685,7 +685,11 @@ static inline int kvm_deassign_device(struct kvm *kvm,
 static inline void kvm_guest_enter(void)
 {
BUG_ON(preemptible());
-   vtime_account(current);
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system(current);
current-flags |= PF_VCPU;
/* KVM does not hold any references to rcu protected data when it
 * switches CPU into a guest mode. In fact switching to a guest mode
@@ -699,7 +703,11 @@ static inline void kvm_guest_enter(void)
 
 static inline void kvm_guest_exit(void)
 {
-   vtime_account(current);
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system(current);
current-flags = ~PF_VCPU;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] cputime: Specialize irq vtime hooks

With CONFIG_VIRT_CPU_TIME_ACCOUNTING, When vtime_account()
is called in irq entry/exit, we perform a check on the
context: if we are interrupting the idle task when the
interrupt fires, account the pending cputime to idle,
otherwise account to system time or its sub-areas: tsk-stime,
hardirq time, softirq time, ...

However this check for idle only concerns the hardirq entry.
We only account pending idle time when a hardirq interrupts idle.

In the other cases we always account to system/irq time:

* On hardirq exit we account the time to hardirq time.

* Softirqs don't interrupt idle directly. They are either
following a hardirq that has already accounted the pending
idle time or we are running ksoftird and idle time has been
accounted in a previous context switch.

To optimize this and avoid the indirect call to vtime_account()
and the checks it performs, specialize the vtime irq APIs and
only perform the check on hard irq entry. Other vtime calls
can directly call vtime_account_system().

CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
maps to its own vtime_account() implementation. One may want
to take benefits from the new APIs to optimize irq time accounting
as well in the future.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/hardirq.h |   82 +++
 include/linux/kernel_stat.h |9 -
 kernel/softirq.c|6 ++--
 3 files changed, 70 insertions(+), 27 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index cab3da3..c126ffb 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -131,13 +131,65 @@ extern void synchronize_irq(unsigned int irq);
 
 struct task_struct;
 
-#if !defined(CONFIG_VIRT_CPU_ACCOUNTING)  
!defined(CONFIG_IRQ_TIME_ACCOUNTING)
-static inline void vtime_account(struct task_struct *tsk)
+#ifdef CONFIG_TICK_CPU_ACCOUNTING
+static inline void vtime_account(struct task_struct *tsk) { }
+static inline void vtime_account_irq_enter(struct task_struct *tsk,
+  unsigned long offset) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk,
+ unsigned long offset) { }
+#else /* !CONFIG_TICK_CPU_ACCOUNTING */
+extern void vtime_account(struct task_struct *tsk);
+#endif /* !CONFIG_TICK_CPU_ACCOUNTING */
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+extern void vtime_task_switch(struct task_struct *prev);
+extern void vtime_account_system(struct task_struct *tsk);
+extern void vtime_account_idle(struct task_struct *tsk);
+
+static inline void vtime_account_irq_enter(struct task_struct *tsk,
+  unsigned long offset)
 {
+   /*
+* On hardirq entry We need to check which context we are interrupting.
+* Time may be accounted to either idle or system.
+*/
+   if (offset == HARDIRQ_OFFSET) {
+   vtime_account(tsk);
+   } else {
+   /*
+* Softirqs never interrupt idle directly. Either the hardirq
+* already did and accounted the idle time or we run in
+* ksoftirqd and idle time was accounted on context switch.
+*/
+   vtime_account_system(tsk);
+   }
 }
-#else
-extern void vtime_account(struct task_struct *tsk);
-#endif
+
+static inline void vtime_account_irq_exit(struct task_struct *tsk,
+ unsigned long offset)
+{
+   /* On hard|softirq exit we always account to hard|softirq cputime */
+   vtime_account_system(tsk);
+}
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
+static inline void vtime_task_switch(struct task_struct *prev) { }
+static inline void vtime_account_system(struct task_struct *tsk) { }
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static inline void vtime_account_irq_enter(struct task_struct *tsk,
+  unsigned long offset)
+{
+   vtime_account(tsk);
+}
+
+static inline void vtime_account_irq_exit(struct task_struct *tsk,
+ unsigned long offset)
+{
+   vtime_account(tsk);
+}
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
 
 #if defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
 
@@ -160,11 +212,11 @@ extern void rcu_nmi_exit(void);
  * always balanced, so the interrupted value of -hardirq_context
  * will always be restored.
  */
-#define __irq_enter()  \
-   do {\
-   vtime_account(current); \
-   add_preempt_count(HARDIRQ_OFFSET);  \
-   trace_hardirq_enter();  \
+#define __irq_enter()  \
+   do

[PATCH 0/3] cputime: Moar cleanups / enhancements

Hi,

I think that the 1st and 3rd patches are pretty uncontroversial given
how vtime_account() confusingly tries to do everything for
CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.

I believe the 2nd is also desired. vtime_account() is called
two times per irq, sometimes more if softirqs are involved. So
I think we want to optimize that by calling directly its specialized
APIs when possible.

There is still some work to do but I'm proceeding step by step.
I may focuse more on that generic vtime implementation next
time to implement cputime accounting for the tickmess patchset.
That, for sure, will inspire for even more cputime optimizations/cleanups.

Thanks.

PS: tested on x86 and ppc64 (checked reliability of times and /proc/stat).
But only built tested on s390 and ia64.

Frederic Weisbecker (3):
  kvm: Directly account vtime to system on guest switch
  cputime: Specialize irq vtime hooks
  cputime: Separate irqtime accounting from generic vtime

 arch/ia64/kernel/time.c |1 +
 arch/powerpc/kernel/time.c  |1 +
 arch/s390/kernel/vtime.c|4 ++
 include/linux/hardirq.h |   80 +++
 include/linux/kernel_stat.h |8 
 include/linux/kvm_host.h|   12 +-
 kernel/sched/cputime.c  |8 ++--
 kernel/softirq.c|6 ++--
 8 files changed, 88 insertions(+), 32 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] cputime: Separate irqtime accounting from generic vtime

vtime_account() doesn't have the same role in
CONFIG_VIRT_CPU_TIME_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.

In the first case it handles time accounting in any context. In
the second case it only handles irq time accounting.

So when vtime_account() is called from outside vtime_account_irq_*()
this call is pointless to CONFIG_IRQ_TIME_ACCOUNTING.

To fix the confusion, change vtime_account() to irqtime_account_irq()
in CONFIG_IRQ_TIME_ACCOUNTING. This way we ensure future account_vtime()
calls won't waste useless cycles in the irqtime APIs.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/hardirq.h |   26 --
 kernel/sched/cputime.c  |8 
 2 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index c126ffb..dc2052c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -131,17 +131,8 @@ extern void synchronize_irq(unsigned int irq);
 
 struct task_struct;
 
-#ifdef CONFIG_TICK_CPU_ACCOUNTING
-static inline void vtime_account(struct task_struct *tsk) { }
-static inline void vtime_account_irq_enter(struct task_struct *tsk,
-  unsigned long offset) { }
-static inline void vtime_account_irq_exit(struct task_struct *tsk,
- unsigned long offset) { }
-#else /* !CONFIG_TICK_CPU_ACCOUNTING */
-extern void vtime_account(struct task_struct *tsk);
-#endif /* !CONFIG_TICK_CPU_ACCOUNTING */
-
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
+extern void vtime_account(struct task_struct *tsk);
 extern void vtime_task_switch(struct task_struct *prev);
 extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
@@ -174,21 +165,28 @@ static inline void vtime_account_irq_exit(struct 
task_struct *tsk,
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
+extern void irqtime_account_irq(struct task_struct *tsk);
+
 static inline void vtime_account_irq_enter(struct task_struct *tsk,
   unsigned long offset)
 {
-   vtime_account(tsk);
+   irqtime_account_irq(tsk);
 }
 
 static inline void vtime_account_irq_exit(struct task_struct *tsk,
  unsigned long offset)
 {
-   vtime_account(tsk);
+   irqtime_account_irq(tsk);
 }
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#else /* !CONFIG_IRQ_TIME_ACCOUNTING */
+static inline void vtime_account_irq_enter(struct task_struct *tsk,
+  unsigned long offset) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk,
+ unsigned long offset) { }
+#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
+#endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 
 #if defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 81b763b..7ad407a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -10,11 +10,11 @@
 
 /*
  * There are no locks covering percpu hardirq/softirq time.
- * They are only modified in vtime_account, on corresponding CPU
+ * They are only modified in irqtime_account_irq, on corresponding CPU
  * with interrupts disabled. So, writes are safe.
  * They are read and saved off onto struct rq in update_rq_clock().
  * This may result in other CPU reading this CPU's irq time and can
- * race with irq/vtime_account on this CPU. We would either get old
+ * race with irqtime_account_irq on this CPU. We would either get old
  * or new value with a side effect of accounting a slice of irq time to wrong
  * task when irq is in progress while we read rq-clock. That is a worthy
  * compromise in place of having locks on each irq in account_system_time.
@@ -43,7 +43,7 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
  */
-void vtime_account(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr)
 {
unsigned long flags;
s64 delta;
@@ -73,7 +73,7 @@ void vtime_account(struct task_struct *curr)
irq_time_write_end();
local_irq_restore(flags);
 }
-EXPORT_SYMBOL_GPL(vtime_account);
+EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
 static int irqtime_account_hi_update(void)
 {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] rcu: Remove rcu_switch()

2012-10-10 Thread Frederic Weisbecker

It's only there to call rcu_user_hooks_switch(). Let's
just call rcu_user_hooks_switch() directly, we don't need this
function in the middle.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Josh Triplett j...@joshtriplett.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Richard Weinberger rich...@nod.at
Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
---
 arch/um/drivers/mconsole_kern.c |2 +-
 include/linux/rcupdate.h|2 ++
 include/linux/sched.h   |8 
 kernel/sched/core.c |2 +-
 4 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index c17de0d..0b989af 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -705,7 +705,7 @@ static void stack_proc(void *arg)
struct task_struct *from = current, *to = arg;
 
to-thread.saved_task = from;
-   rcu_switch(from, to);
+   rcu_user_hooks_switch(from, to);
switch_to(from, to, from);
 }
 
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7c968e4..5d009de 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -204,6 +204,8 @@ static inline void rcu_user_enter(void) { }
 static inline void rcu_user_exit(void) { }
 static inline void rcu_user_enter_after_irq(void) { }
 static inline void rcu_user_exit_after_irq(void) { }
+static inline void rcu_user_hooks_switch(struct task_struct *prev,
+struct task_struct *next) { }
 #endif /* CONFIG_RCU_USER_QS */
 
 extern void exit_rcu(void);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fade317..c300c7c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1887,14 +1887,6 @@ static inline void rcu_copy_process(struct task_struct 
*p)
 
 #endif
 
-static inline void rcu_switch(struct task_struct *prev,
- struct task_struct *next)
-{
-#ifdef CONFIG_RCU_USER_QS
-   rcu_user_hooks_switch(prev, next);
-#endif
-}
-
 static inline void tsk_restore_flags(struct task_struct *task,
unsigned long orig_flags, unsigned long flags)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd036fe..b53a485 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2081,7 +2081,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 #endif
 
/* Here we just switch the register state and the stack. */
-   rcu_switch(prev, next);
+   rcu_user_hooks_switch(prev, next);
switch_to(prev, next, prev);
 
barrier();
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] rcu: Advise most users not to enable RCU user mode

2012-10-10 Thread Frederic Weisbecker

Discourage distros from enabling CONFIG_RCU_USER_QS
because it brings overhead for no benefits yet.

It's not a useful feature on its own until we can
fully run an adaptive tickless kernel.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 init/Kconfig |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index c26b8a1..89093e1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -451,6 +451,12 @@ config RCU_USER_QS
  excluded from the global RCU state machine and thus doesn't
  to keep the timer tick on for RCU.
 
+ Unless you want to hack and help the development of the full
+ tickless feature, you shouldn't enable this option. It adds
+ unnecessary overhead.
+
+ If unsure say N
+
 config RCU_USER_QS_FORCE
bool Force userspace extended QS by default
depends on RCU_USER_QS
@@ -459,6 +465,12 @@ config RCU_USER_QS_FORCE
  test this feature that treats userspace as an extended quiescent
  state until we have a real user like a full adaptive nohz option.
 
+ Unless you want to hack and help the development of the full
+ tickless feature, you shouldn't enable this option. It adds
+ unnecessary overhead.
+
+ If unsure say N
+
 config RCU_FANOUT
int Tree-based hierarchical RCU fanout value
range 2 64 if 64BIT
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/5] irq_work: Move irq_work_raise() declaration/default definition to arch headers

This optimization doesn't matter much. But this prepares the
arch headers that we need to add a new API in order to detect
when the arch hooks on the tick to implement the irq work.

This is necessary later to make printk working in nohz CPUs.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
---
 arch/alpha/include/asm/irq_work.h|6 ++
 arch/arm/include/asm/irq_work.h  |1 +
 arch/arm64/include/asm/irq_work.h|1 +
 arch/blackfin/include/asm/irq_work.h |1 +
 arch/frv/include/asm/irq_work.h  |1 +
 arch/hexagon/include/asm/irq_work.h  |1 +
 arch/mips/include/asm/irq_work.h |1 +
 arch/parisc/include/asm/irq_work.h   |1 +
 arch/powerpc/include/asm/irq_work.h  |6 ++
 arch/s390/include/asm/irq_work.h |1 +
 arch/sh/include/asm/irq_work.h   |1 +
 arch/sparc/include/asm/irq_work.h|6 ++
 arch/x86/include/asm/irq_work.h  |   10 ++
 arch/x86/kernel/irq_work.c   |4 ++--
 include/asm-generic/irq_work.h   |9 +
 include/linux/irq_work.h |1 +
 kernel/irq_work.c|7 ---
 17 files changed, 49 insertions(+), 9 deletions(-)
 create mode 100644 arch/alpha/include/asm/irq_work.h
 create mode 100644 arch/arm/include/asm/irq_work.h
 create mode 100644 arch/arm64/include/asm/irq_work.h
 create mode 100644 arch/blackfin/include/asm/irq_work.h
 create mode 100644 arch/frv/include/asm/irq_work.h
 create mode 100644 arch/hexagon/include/asm/irq_work.h
 create mode 100644 arch/mips/include/asm/irq_work.h
 create mode 100644 arch/parisc/include/asm/irq_work.h
 create mode 100644 arch/powerpc/include/asm/irq_work.h
 create mode 100644 arch/s390/include/asm/irq_work.h
 create mode 100644 arch/sh/include/asm/irq_work.h
 create mode 100644 arch/sparc/include/asm/irq_work.h
 create mode 100644 arch/x86/include/asm/irq_work.h
 create mode 100644 include/asm-generic/irq_work.h

diff --git a/arch/alpha/include/asm/irq_work.h 
b/arch/alpha/include/asm/irq_work.h
new file mode 100644
index 000..814ff3d
--- /dev/null
+++ b/arch/alpha/include/asm/irq_work.h
@@ -0,0 +1,6 @@
+#ifndef _ALPHA_IRQ_WORK_H
+#define _ALPHA_IRQ_WORK_H
+
+extern void arch_irq_work_raise(void);
+
+#endif
diff --git a/arch/arm/include/asm/irq_work.h b/arch/arm/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/arm/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/arm64/include/asm/irq_work.h 
b/arch/arm64/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/arm64/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/blackfin/include/asm/irq_work.h 
b/arch/blackfin/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/blackfin/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/frv/include/asm/irq_work.h b/arch/frv/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/frv/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/hexagon/include/asm/irq_work.h 
b/arch/hexagon/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/hexagon/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/mips/include/asm/irq_work.h b/arch/mips/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/mips/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/parisc/include/asm/irq_work.h 
b/arch/parisc/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/parisc/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h
new file mode 100644
index 000..8b9927f
--- /dev/null
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_POWERPC_IRQ_WORK_H
+#define _ASM_POWERPC_IRQ_WORK_H
+
+extern void arch_irq_work_raise(void);
+
+#endif
diff --git a/arch/s390/include/asm/irq_work.h b/arch/s390/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/s390/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/sh/include/asm/irq_work.h b/arch/sh/include/asm/irq_work.h
new file mode 100644
index 000..f1bffa2
--- /dev/null
+++ b/arch/sh/include/asm/irq_work.h
@@ -0,0 +1 @@
+#include asm-generic/irq_work.h
diff --git a/arch/sparc/include/asm/irq_work.h 
b/arch/sparc/include/asm/irq_work.h
new file mode 100644
index 000..1d062a6
--- /dev/null
+++ b/arch/sparc/include/asm/irq_work.h

[RFC PATCH 3/5] x86: Implement arch_irq_work_use_tick

Most of the time, x86 can trigger self-IPIs. Tell
irq work subsystem about it.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
---
 arch/x86/include/asm/irq_work.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/irq_work.h b/arch/x86/include/asm/irq_work.h
index dad8266..fc62b60 100644
--- a/arch/x86/include/asm/irq_work.h
+++ b/arch/x86/include/asm/irq_work.h
@@ -2,8 +2,12 @@
 #define _ASM_X86_IRQ_WORK_H
 
 #ifdef CONFIG_X86_LOCAL_APIC
+#include asm/cpufeature.h
+
 extern void __arch_irq_work_raise(void);
 #define arch_irq_work_raise __arch_irq_work_raise
+
+#define arch_irq_work_use_tick() (!cpu_has_apic)
 #endif
 
 #include asm-generic/irq_work.h
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 5/5] printk: Wake up klogd with irq_work on nohz CPU

klogd is woken up asynchronously from the tick in order
to do it safely.

However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
before a while. As a result, the user may miss some message.

To fix this we try to schedule the wake up into an irq work
when the tick is stopped and irq work is not implemented on top
of the tick.

Ideally we could always rely on irq work for this to simplify the
code. But this may result in too much interrupts in case we have
a lot of printk calls in a short period of time. So we do this when
the tick is stopped only.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
---
 kernel/printk.c |   42 ++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 66a2ea3..c8ab918 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -42,6 +42,8 @@
 #include linux/notifier.h
 #include linux/rculist.h
 #include linux/poll.h
+#include linux/tick.h
+#include linux/irq_work.h
 
 #include asm/uaccess.h
 
@@ -1976,10 +1978,50 @@ int printk_needs_cpu(int cpu)
return __this_cpu_read(printk_pending);
 }
 
+#ifdef CONFIG_IRQ_WORK
+static void wake_klogd_irq_work(struct irq_work *irq_work)
+{
+   printk_tick();
+}
+#endif
+
+/*
+ * When the tick is stopped, we need another way to wake up
+ * klogd safely.
+ */
+static void wake_up_klogd_nohz(void)
+{
+   /*
+* If irq work is not itself implemented using the tick
+* it's a safe and fast way to wake up the reader.
+*/
+#ifdef CONFIG_IRQ_WORK
+   if (!arch_irq_work_use_tick()) {
+   static struct irq_work klogd_irq_work = {
+   .func = wake_klogd_irq_work
+   };
+
+   irq_work_queue(klogd_irq_work);
+   return;
+   }
+#endif
+   /*
+* Our last resort in the case of idle is to bet
+* on the fact we haven't yet reached the last need_resched()
+* check before the CPU goes to halt. This way we go through
+* another idle loop to recheck printk_needs_cpu().
+*/
+   if (is_idle_task(current))
+   set_need_resched();
+}
+
 void wake_up_klogd(void)
 {
if (waitqueue_active(log_wait))
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+
+   if (tick_nohz_tick_stopped())
+   wake_up_klogd_nohz();
 }
 
 static void console_cont_flush(char *text, size_t size)
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 2/5] irq_work: Only run irq_work from tick if arch needs it

It may optimize a bit the tick path for archs that have their
own way to run irq work.

This may be further optimized using static keys.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
---
 arch/alpha/include/asm/irq_work.h   |5 -
 arch/alpha/kernel/time.c|2 +-
 arch/powerpc/include/asm/irq_work.h |4 +++-
 arch/powerpc/kernel/time.c  |2 +-
 arch/sparc/include/asm/irq_work.h   |4 +++-
 arch/sparc/kernel/pcr.c |2 +-
 arch/x86/include/asm/irq_work.h |9 +
 arch/x86/kernel/irq_work.c  |2 +-
 include/asm-generic/irq_work.h  |   13 +
 kernel/timer.c  |2 +-
 10 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/alpha/include/asm/irq_work.h 
b/arch/alpha/include/asm/irq_work.h
index 814ff3d..3d32132 100644
--- a/arch/alpha/include/asm/irq_work.h
+++ b/arch/alpha/include/asm/irq_work.h
@@ -1,6 +1,9 @@
 #ifndef _ALPHA_IRQ_WORK_H
 #define _ALPHA_IRQ_WORK_H
 
-extern void arch_irq_work_raise(void);
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
+
+#include asm-generic/irq_work.h
 
 #endif
diff --git a/arch/alpha/kernel/time.c b/arch/alpha/kernel/time.c
index e336694..91c5eec 100644
--- a/arch/alpha/kernel/time.c
+++ b/arch/alpha/kernel/time.c
@@ -90,7 +90,7 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 #define test_irq_work_pending()  __get_cpu_var(irq_work_pending)
 #define clear_irq_work_pending() __get_cpu_var(irq_work_pending) = 0
 
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
set_irq_work_pending_flag();
 }
diff --git a/arch/powerpc/include/asm/irq_work.h 
b/arch/powerpc/include/asm/irq_work.h
index 8b9927f..8aa36aa 100644
--- a/arch/powerpc/include/asm/irq_work.h
+++ b/arch/powerpc/include/asm/irq_work.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_POWERPC_IRQ_WORK_H
 #define _ASM_POWERPC_IRQ_WORK_H
 
-extern void arch_irq_work_raise(void);
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
 
+#include asm-generic/irq_work.h
 #endif
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index c9986fd..31565ac 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -466,7 +466,7 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 
 #endif /* 32 vs 64 bit */
 
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
preempt_disable();
set_irq_work_pending_flag();
diff --git a/arch/sparc/include/asm/irq_work.h 
b/arch/sparc/include/asm/irq_work.h
index 1d062a6..383772d 100644
--- a/arch/sparc/include/asm/irq_work.h
+++ b/arch/sparc/include/asm/irq_work.h
@@ -1,6 +1,8 @@
 #ifndef ___ASM_SPARC_IRQ_H
 #define ___ASM_SPARC_IRQ_H
 
-extern void arch_irq_work_raise(void);
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
 
+#include asm-generic/irq_work.h
 #endif
diff --git a/arch/sparc/kernel/pcr.c b/arch/sparc/kernel/pcr.c
index 269af58..d1e1ecf 100644
--- a/arch/sparc/kernel/pcr.c
+++ b/arch/sparc/kernel/pcr.c
@@ -43,7 +43,7 @@ void __irq_entry deferred_pcr_work_irq(int irq, struct 
pt_regs *regs)
set_irq_regs(old_regs);
 }
 
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
set_softint(1  PIL_DEFERRED_PCR_WORK);
 }
diff --git a/arch/x86/include/asm/irq_work.h b/arch/x86/include/asm/irq_work.h
index 38eed96..dad8266 100644
--- a/arch/x86/include/asm/irq_work.h
+++ b/arch/x86/include/asm/irq_work.h
@@ -1,10 +1,11 @@
 #ifndef _ASM_X86_IRQ_WORK_H
 #define _ASM_X86_IRQ_WORK_H
 
-#ifndef CONFIG_X86_LOCAL_APIC
-#include asm-generic/irq_work.h
-# else
-extern void arch_irq_work_raise(void);
+#ifdef CONFIG_X86_LOCAL_APIC
+extern void __arch_irq_work_raise(void);
+#define arch_irq_work_raise __arch_irq_work_raise
 #endif
 
+#include asm-generic/irq_work.h
+
 #endif
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 95f5d4e..7389d5e 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -19,7 +19,7 @@ void smp_irq_work_interrupt(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_X86_LOCAL_APIC
-void arch_irq_work_raise(void)
+void __arch_irq_work_raise(void)
 {
if (!cpu_has_apic)
return;
diff --git a/include/asm-generic/irq_work.h b/include/asm-generic/irq_work.h
index a2d4108..b172da0 100644
--- a/include/asm-generic/irq_work.h
+++ b/include/asm-generic/irq_work.h
@@ -4,6 +4,19 @@
 /*
  * Lame architectures will get the timer tick callback
  */
+#ifndef arch_irq_work_raise
 static inline void arch_irq_work_raise(void) { }
+#endif
+
+/*
+ * Unless told otherwise, consider the arch implements irq work
+ * through a hook to the timer tick.
+ */
+#ifndef arch_irq_work_use_tick
+static inline bool

[RFC PATCH 0/5] printk: Make it usable on nohz CPUs

Hi,

So here is a proposition on what we can do to make printk
correctly working on a tickless CPU.

Although it's targeted to be part of the adaptive tickmess
implemetation, it's pretty standalone and generic and also
works for printk() calls in idle.

It is based on latest linus tree.

Waiting for your comments.

Thanks.

PS: only built-tested for now.

Frederic Weisbecker (5):
  irq_work: Move irq_work_raise() declaration/default definition to
arch headers
  irq_work: Only run irq_work from tick if arch needs it
  x86: Implement arch_irq_work_use_tick
  nohz: Add API to check tick state
  printk: Wake up klogd with irq_work on nohz CPU

 arch/alpha/include/asm/irq_work.h|9 +++
 arch/alpha/kernel/time.c |2 +-
 arch/arm/include/asm/irq_work.h  |1 +
 arch/arm64/include/asm/irq_work.h|1 +
 arch/blackfin/include/asm/irq_work.h |1 +
 arch/frv/include/asm/irq_work.h  |1 +
 arch/hexagon/include/asm/irq_work.h  |1 +
 arch/mips/include/asm/irq_work.h |1 +
 arch/parisc/include/asm/irq_work.h   |1 +
 arch/powerpc/include/asm/irq_work.h  |8 ++
 arch/powerpc/kernel/time.c   |2 +-
 arch/s390/include/asm/irq_work.h |1 +
 arch/sh/include/asm/irq_work.h   |1 +
 arch/sparc/include/asm/irq_work.h|8 ++
 arch/sparc/kernel/pcr.c  |2 +-
 arch/x86/include/asm/irq_work.h  |   15 
 arch/x86/kernel/irq_work.c   |6 ++--
 include/asm-generic/irq_work.h   |   22 +
 include/linux/irq_work.h |1 +
 include/linux/tick.h |   16 -
 kernel/irq_work.c|7 -
 kernel/printk.c  |   42 ++
 kernel/time/tick-sched.c |2 +-
 kernel/timer.c   |2 +-
 24 files changed, 137 insertions(+), 16 deletions(-)
 create mode 100644 arch/alpha/include/asm/irq_work.h
 create mode 100644 arch/arm/include/asm/irq_work.h
 create mode 100644 arch/arm64/include/asm/irq_work.h
 create mode 100644 arch/blackfin/include/asm/irq_work.h
 create mode 100644 arch/frv/include/asm/irq_work.h
 create mode 100644 arch/hexagon/include/asm/irq_work.h
 create mode 100644 arch/mips/include/asm/irq_work.h
 create mode 100644 arch/parisc/include/asm/irq_work.h
 create mode 100644 arch/powerpc/include/asm/irq_work.h
 create mode 100644 arch/s390/include/asm/irq_work.h
 create mode 100644 arch/sh/include/asm/irq_work.h
 create mode 100644 arch/sparc/include/asm/irq_work.h
 create mode 100644 arch/x86/include/asm/irq_work.h
 create mode 100644 include/asm-generic/irq_work.h

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 4/5] nohz: Add API to check tick state

We need some quick way to check if the CPU has stopped
its tick. This will be useful for printk when it wants
to wake up klogd on nohz CPU.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
---
 include/linux/tick.h |   16 +++-
 kernel/time/tick-sched.c |2 +-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..05d1919 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,7 @@
 
 #include linux/clockchips.h
 #include linux/irqflags.h
+#include linux/percpu.h
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -122,13 +123,26 @@ static inline int tick_oneshot_mode_active(void) { return 
0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched);
+
+static inline int tick_nohz_tick_stopped(void)
+{
+   return __this_cpu_read(tick_cpu_sched.tick_stopped);
+}
+
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+# else /* !CONFIG_NO_HZ */
+static inline int tick_nohz_tick_stopped(void)
+{
+   return 0;
+}
+
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f423bdd..ccc1971 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -28,7 +28,7 @@
 /*
  * Per cpu nohz control structure
  */
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
 
 /*
  * The time, when the last jiffy update happened. Protected by xtime_lock.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lots of suspicious RCU traces

First of all, thanks a lot for your report.

2012/10/24 Sergey Senozhatsky sergey.senozhat...@gmail.com:
 On (10/24/12 20:06), Oleg Nesterov wrote:
 On 10/24, Sergey Senozhatsky wrote:
 
  small question,
 
  ptrace_notify() and forward calls are able to both indirectly and directly 
  call schedule(),
  /* direct call from ptrace_stop()*/,
  should, in this case, rcu_user_enter() be called before 
  tracehook_report_syscall_exit(regs, step)
  and ptrace chain?

 Well, I don't really understand this magic... but why?


 My understanding is (I may be wrong) that we can schedule() from ptrace chain 
 to
 some arbitrary task, which will continue its execution from the point where 
 RCU assumes
 CPU as not idle, while CPU in fact still in idle state -- no one said 
 rcu_idle_exit()
 (or similar) prior to schedule() call.

Yeah but when we are in syscall_trace_leave(), the CPU shouldn't be in
RCU idle mode. That's where the bug is. How do you manage to trigger
this bug?


 if so, does the same apply to in_user?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] vtime: Gather vtime declarations to their own header file

These APIs are scattered around and are going to expand a bit.
Let's create a dedicated header file for sanity.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/hardirq.h |   11 +--
 include/linux/kernel_stat.h |9 +
 include/linux/vtime.h   |   22 ++
 3 files changed, 24 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vtime.h

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index cab3da3..b083a47 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -4,6 +4,7 @@
 #include linux/preempt.h
 #include linux/lockdep.h
 #include linux/ftrace_irq.h
+#include linux/vtime.h
 #include asm/hardirq.h
 
 /*
@@ -129,16 +130,6 @@ extern void synchronize_irq(unsigned int irq);
 # define synchronize_irq(irq)  barrier()
 #endif
 
-struct task_struct;
-
-#if !defined(CONFIG_VIRT_CPU_ACCOUNTING)  
!defined(CONFIG_IRQ_TIME_ACCOUNTING)
-static inline void vtime_account(struct task_struct *tsk)
-{
-}
-#else
-extern void vtime_account(struct task_struct *tsk);
-#endif
-
 #if defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
 
 static inline void rcu_nmi_enter(void)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 36d12f0..1865b1f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -7,6 +7,7 @@
 #include linux/cpumask.h
 #include linux/interrupt.h
 #include linux/sched.h
+#include linux/vtime.h
 #include asm/irq.h
 #include asm/cputime.h
 
@@ -130,12 +131,4 @@ extern void account_process_tick(struct task_struct *, int 
user);
 extern void account_steal_ticks(unsigned long ticks);
 extern void account_idle_ticks(unsigned long ticks);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void vtime_task_switch(struct task_struct *prev);
-extern void vtime_account_system(struct task_struct *tsk);
-extern void vtime_account_idle(struct task_struct *tsk);
-#else
-static inline void vtime_task_switch(struct task_struct *prev) { }
-#endif
-
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
new file mode 100644
index 000..7199c24
--- /dev/null
+++ b/include/linux/vtime.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_KERNEL_VTIME_H
+#define _LINUX_KERNEL_VTIME_H
+
+struct task_struct;
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+extern void vtime_task_switch(struct task_struct *prev);
+extern void vtime_account_system(struct task_struct *tsk);
+extern void vtime_account_idle(struct task_struct *tsk);
+#else
+static inline void vtime_task_switch(struct task_struct *prev) { }
+#endif
+
+#if !defined(CONFIG_VIRT_CPU_ACCOUNTING)  
!defined(CONFIG_IRQ_TIME_ACCOUNTING)
+static inline void vtime_account(struct task_struct *tsk)
+{
+}
+#else
+extern void vtime_account(struct task_struct *tsk);
+#endif
+
+#endif /* _LINUX_KERNEL_VTIME_H */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] cputime: Specialize irq vtime hooks

With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
is called in irq entry/exit, we perform a check on the
context: if we are interrupting the idle task we
account the pending cputime to idle, otherwise account
to system time or its sub-areas: tsk-stime, hardirq time,
softirq time, ...

However this check for idle only concerns the hardirq entry
and softirq entry:

* Hardirq may directly interrupt the idle task, in which case
we need to flush the pending CPU time to idle.

* The idle task may be directly interrupted by a softirq if
it calls local_bh_enable(). There is probably no such call
in any idle task but we need to cover every case. Ksoftirqd
is not concerned because the idle time is flushed on context
switch and softirq in the end of hardirq have the idle time
already flushed from the hardirq entry.

In the other cases we always account to system/irq time:

* On hardirq exit we account the time to hardirq time.
* On softirq exit we account the time to softirq time.

To optimize this and avoid the indirect call to vtime_account()
and the checks it performs, specialize the vtime irq APIs and
only perform the check on irq entry. Irq exit can directly call
vtime_account_system().

CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
maps to its own vtime_account() implementation. One may want
to take benefits from the new APIs to optimize irq time accounting
as well in the future.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/hardirq.h |4 ++--
 include/linux/vtime.h   |   25 +
 kernel/softirq.c|6 +++---
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index b083a47..624ef3f 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -153,7 +153,7 @@ extern void rcu_nmi_exit(void);
  */
 #define __irq_enter()  \
do {\
-   vtime_account(current); \
+   vtime_account_irq_enter(current);   \
add_preempt_count(HARDIRQ_OFFSET);  \
trace_hardirq_enter();  \
} while (0)
@@ -169,7 +169,7 @@ extern void irq_enter(void);
 #define __irq_exit()   \
do {\
trace_hardirq_exit();   \
-   vtime_account(current); \
+   vtime_account_irq_exit(current);\
sub_preempt_count(HARDIRQ_OFFSET);  \
} while (0)
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 57f290e..40634fb 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -21,4 +21,29 @@ static inline void vtime_account(struct task_struct *tsk)
 extern void vtime_account(struct task_struct *tsk);
 #endif
 
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+   /*
+* Hardirq can interrupt idle task anytime. So we need vtime_account()
+* that performs the idle check in CONFIG_VIRT_CPU_ACCOUNTING.
+* Softirq can also interrupt idle task directly if it calls
+* local_bh_enable(). Such case probably don't exist but we never know.
+* Ksoftirqd is not concerned because idle time is flushed on context
+* switch. Softirqs in the end of hardirqs are also not a problem 
because
+* the idle time is flushed on hardirq time already.
+*/
+   vtime_account(tsk);
+}
+
+static inline void vtime_account_irq_exit(struct task_struct *tsk)
+{
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+   /* On hard|softirq exit we always account to hard|softirq cputime */
+   vtime_account_system(tsk);
+#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+   vtime_account(tsk);
+#endif
+}
+
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index cc96bdc..ed567ba 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -221,7 +221,7 @@ asmlinkage void __do_softirq(void)
current-flags = ~PF_MEMALLOC;
 
pending = local_softirq_pending();
-   vtime_account(current);
+   vtime_account_irq_enter(current);
 
__local_bh_disable((unsigned long)__builtin_return_address(0),
SOFTIRQ_OFFSET);
@@ -272,7 +272,7 @@ restart:
 
lockdep_softirq_exit();
 
-   vtime_account(current);
+   vtime_account_irq_exit(current);
__local_bh_enable(SOFTIRQ_OFFSET);
tsk_restore_flags(current, old_flags, PF_MEMALLOC);
 }
@@ -341,7 +341,7 @@ static inline void invoke_softirq(void)
  */
 void irq_exit(void)
 {
-   vtime_account(current);
+   vtime_account_irq_exit(current

[PATCH 3/5] kvm: Directly account vtime to system on guest switch

Switching to or from guest context is done on ioctl context.
So by the time we call kvm_guest_enter() or kvm_guest_exit()
we know we are not running the idle task.

As a result, we can directly account the cputime using
vtime_account_system_irqsafe().

There are two good reasons to do this:

* We avoid some useless checks on guest switch. It optimizes
a bit this fast path.

* In the case of CONFIG_IRQ_TIME_ACCOUNTING, calling vtime_account()
checks for irq time to account. This is pointless since we know
we are not in an irq on guest switch. This is wasting cpu cycles
for no good reason. vtime_account_system() OTOH is a no-op in
this config option.

* s390 doesn't disable irqs in its implementation of vtime_account().
If vtime_account() in kvm races with an irq, the pending time might
be accounted twice. With vtime_account_system_irqsafe() we are protected.

A further optimization may consist in introducing a vtime_account_guest()
that directly calls account_guest_time().

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Heiko Carstens heiko.carst...@de.ibm.com
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Avi Kivity a...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Joerg Roedel joerg.roe...@amd.com
Cc: Alexander Graf ag...@suse.de
Cc: Xiantao Zhang xiantao.zh...@intel.com
Cc: Christian Borntraeger borntrae...@de.ibm.com
Cc: Cornelia Huck cornelia.h...@de.ibm.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/kvm_host.h |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 93bfc9f..f17158b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -737,7 +737,11 @@ static inline int kvm_deassign_device(struct kvm *kvm,
 static inline void kvm_guest_enter(void)
 {
BUG_ON(preemptible());
-   vtime_account(current);
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system_irqsafe(current);
current-flags |= PF_VCPU;
/* KVM does not hold any references to rcu protected data when it
 * switches CPU into a guest mode. In fact switching to a guest mode
@@ -751,7 +755,11 @@ static inline void kvm_guest_enter(void)
 
 static inline void kvm_guest_exit(void)
 {
-   vtime_account(current);
+   /*
+* This is running in ioctl context so we can avoid
+* the call to vtime_account() with its unnecessary idle check.
+*/
+   vtime_account_system_irqsafe(current);
current-flags = ~PF_VCPU;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] cputime: Moar cleanups / enhancements v2

I made some changes to the series:

- Handle possible softirq interrupting idle on local_bh_enable()
- Have a dedicated vtime.h header file
- Do some clearer ifdeffery
- Use an irqsafe vtime_account_system() on kvm.

Patches 1/5 and 5/5 are something I really think we want. Patch 3/5
implicitly fixes a bug in s390. If you prefer I can just fix s390
and drop the rest of the patch that is just a micro-optimization
in the kvm guest switch path.

The rest of the patches is micro optimizations on the irq path.
If you think these are pointless over-optimizations, I can just drop
these and only keep 1/5, rebase 5/5 and extract the bug fix in s390 that
resides in 3/5.

Otherwise I'll send a pull request to Ingo in a week or so.

If you want to test, it is pullable from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
cputime/cleanups-v3

Tested on powerpc and x86. Built tested on ia64. s390 doesn't build defconfig
on v3.7-rc2.

Frederic Weisbecker (5):
  vtime: Gather vtime declarations to their own header file
  vtime: Provide an irq safe version of vtime_account_system()
  kvm: Directly account vtime to system on guest switch
  cputime: Specialize irq vtime hooks
  cputime: Separate irqtime accounting from generic vtime

 arch/s390/kernel/vtime.c|4 +++
 include/linux/hardirq.h |   15 ++--
 include/linux/kernel_stat.h |9 +---
 include/linux/kvm_host.h|   12 +-
 include/linux/vtime.h   |   48 +++
 kernel/sched/cputime.c  |   13 +-
 kernel/softirq.c|6 ++--
 7 files changed, 80 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/vtime.h

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] cputime: Separate irqtime accounting from generic vtime

vtime_account() doesn't have the same role in
CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.

In the first case it handles time accounting in any context. In
the second case it only handles irq time accounting.

So when vtime_account() is called from outside vtime_account_irq_*()
this call is pointless to CONFIG_IRQ_TIME_ACCOUNTING.

To fix the confusion, change vtime_account() to irqtime_account_irq()
in CONFIG_IRQ_TIME_ACCOUNTING. This way we ensure future account_vtime()
calls won't waste useless cycles in the irqtime APIs.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/vtime.h  |   19 +--
 kernel/sched/cputime.c |4 ++--
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 40634fb..f1f91ac 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -8,17 +8,18 @@ extern void vtime_task_switch(struct task_struct *prev);
 extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_system_irqsafe(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
+extern void vtime_account(struct task_struct *tsk);
 #else
 static inline void vtime_task_switch(struct task_struct *prev) { }
+static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { }
+static inline void vtime_account(struct task_struct *tsk) { }
 #endif
 
-#if !defined(CONFIG_VIRT_CPU_ACCOUNTING)  
!defined(CONFIG_IRQ_TIME_ACCOUNTING)
-static inline void vtime_account(struct task_struct *tsk)
-{
-}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+extern void irqtime_account_irq(struct task_struct *tsk);
 #else
-extern void vtime_account(struct task_struct *tsk);
+static inline void irqtime_account_irq(struct task_struct *tsk) { }
 #endif
 
 static inline void vtime_account_irq_enter(struct task_struct *tsk)
@@ -33,17 +34,15 @@ static inline void vtime_account_irq_enter(struct 
task_struct *tsk)
 * the idle time is flushed on hardirq time already.
 */
vtime_account(tsk);
+   irqtime_account_irq(tsk);
 }
 
 static inline void vtime_account_irq_exit(struct task_struct *tsk)
 {
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
/* On hard|softirq exit we always account to hard|softirq cputime */
vtime_account_system(tsk);
-#endif
-#ifdef CONFIG_IRQ_TIME_ACCOUNTING
-   vtime_account(tsk);
-#endif
+   irqtime_account_irq(tsk);
+
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3ccbea0..1a7e1369 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -43,7 +43,7 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
  */
-void vtime_account(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr)
 {
unsigned long flags;
s64 delta;
@@ -73,7 +73,7 @@ void vtime_account(struct task_struct *curr)
irq_time_write_end();
local_irq_restore(flags);
 }
-EXPORT_SYMBOL_GPL(vtime_account);
+EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
 static int irqtime_account_hi_update(void)
 {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] vtime: Provide an irq safe version of vtime_account_system()