[PATCH 10/27] rcu: Restart the tick on non-responding full dynticks CPUs

2012-12-29 Thread Frederic Weisbecker
When a CPU in full dynticks mode doesn't respond to complete
a grace period, issue it a specific IPI so that it restarts
the tick and chases a quiescent state.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
Signed-off-by: Steven Rostedt rost...@goodmis.org
---
 kernel/rcutree.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e441b77..302d360 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -53,6 +53,7 @@
 #include linux/delay.h
 #include linux/stop_machine.h
 #include linux/random.h
+#include linux/tick.h
 
 #include rcutree.h
 #include trace/events/rcu.h
@@ -743,6 +744,12 @@ static int dyntick_save_progress_counter(struct rcu_data 
*rdp)
return (rdp-dynticks_snap  0x1) == 0;
 }
 
+static void rcu_kick_nohz_cpu(int cpu)
+{
+   if (tick_nohz_full_cpu(cpu))
+   smp_send_reschedule(cpu);
+}
+
 /*
  * Return true if the specified CPU has passed through a quiescent
  * state by virtue of being in or having passed through an dynticks
@@ -790,6 +797,9 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
rdp-offline_fqs++;
return 1;
}
+
+   rcu_kick_nohz_cpu(rdp-cpu);
+
return 0;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/27] sched: Update rq clock on nohz CPU before migrating tasks

2012-12-29 Thread Frederic Weisbecker
Because the sched_class::put_prev_task() callback of rt and fair
classes are referring to the rq clock to update their runtime
statistics. A CPU running in tickless mode may carry a stale value.
We need to update it there.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/core.c  |6 ++
 kernel/sched/sched.h |7 +++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfac40f..2fcbb03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4894,6 +4894,12 @@ static void migrate_tasks(unsigned int dead_cpu)
 */
rq-stop = NULL;
 
+   /*
+* -put_prev_task() need to have an up-to-date value
+* of rq-clock[_task]
+*/
+   update_nohz_rq_clock(rq);
+
for ( ; ; ) {
/*
 * There's this thread running, bail when that's the only
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..f24d91e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3,6 +3,7 @@
 #include linux/mutex.h
 #include linux/spinlock.h
 #include linux/stop_machine.h
+#include linux/tick.h
 
 #include cpupri.h
 
@@ -963,6 +964,12 @@ static inline void dec_nr_running(struct rq *rq)
 
 extern void update_rq_clock(struct rq *rq);
 
+static inline void update_nohz_rq_clock(struct rq *rq)
+{
+   if (tick_nohz_full_cpu(cpu_of(rq)))
+   update_rq_clock(rq);
+}
+
 extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
 extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/27] sched: Update rq clock on tickless CPUs before calling check_preempt_curr()

2012-12-29 Thread Frederic Weisbecker
check_preempt_wakeup() of fair class needs an uptodate sched clock
value to update runtime stats of the current task.

When a task is woken up, activate_task() is usually called right before
ttwu_do_wakeup() unless the task is already in the runqueue. In this
case we need to update the rq clock manually in case the CPU runs
tickless because ttwu_do_wakeup() calls check_preempt_wakeup().

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/core.c |   17 -
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2fcbb03..3c1a806 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1346,6 +1346,12 @@ static int ttwu_remote(struct task_struct *p, int 
wake_flags)
 
rq = __task_rq_lock(p);
if (p-on_rq) {
+   /*
+* Ensure check_preempt_curr() won't deal with a stale value
+* of rq clock if the CPU is tickless. BTW do we actually need
+* check_preempt_curr() to be called here?
+*/
+   update_nohz_rq_clock(rq);
ttwu_do_wakeup(rq, p, wake_flags);
ret = 1;
}
@@ -1523,8 +1529,17 @@ static void try_to_wake_up_local(struct task_struct *p)
if (!(p-state  TASK_NORMAL))
goto out;
 
-   if (!p-on_rq)
+   if (!p-on_rq) {
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+   } else {
+   /*
+* Even if the task is on the runqueue we still
+* need to ensure check_preempt_curr() won't
+* deal with a stale rq clock value on a tickless
+* CPU
+*/
+   update_nohz_rq_clock(rq);
+   }
 
ttwu_do_wakeup(rq, p, 0);
ttwu_stat(p, smp_processor_id(), 0);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 18/27] sched: Update nohz rq clock before searching busiest group on load balancing

2012-12-29 Thread Frederic Weisbecker
While load balancing an rq target, we look for the busiest group.
This operation may require an uptodate rq clock if we end up calling
scale_rt_power(). To this end, update it manually if the target is
running tickless.

DOUBT: don't we actually also need this in vanilla kernel, in case
this_cpu is in dyntick-idle mode?

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/fair.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 698137d..473f50f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5023,6 +5023,19 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
schedstat_inc(sd, lb_count[idle]);
 
+   /*
+* find_busiest_group() may need an uptodate cpu clock
+* for find_busiest_group() (see scale_rt_power()). If
+* the CPU is nohz, it's clock may be stale.
+*/
+   if (tick_nohz_full_cpu(this_cpu)) {
+   local_irq_save(flags);
+   raw_spin_lock(this_rq-lock);
+   update_rq_clock(this_rq);
+   raw_spin_unlock(this_rq-lock);
+   local_irq_restore(flags);
+   }
+
 redo:
group = find_busiest_group(env, balance);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/27] nohz: Only stop the tick on RCU nocb CPUs

2012-12-29 Thread Frederic Weisbecker
On a full dynticks CPU, we want the RCU callbacks to be
offlined to another CPU, otherwise we need to keep
the tick to wait for the grace period completion.

Ensure the full dynticks CPU is also an rcu_nocb one.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/rcupdate.h |7 +++
 kernel/rcutree.c |6 +++---
 kernel/rcutree_plugin.h  |   13 -
 kernel/time/tick-sched.c |   20 +---
 4 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 275aa3f..829312e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -992,4 +992,11 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 #define kfree_rcu(ptr, rcu_head)   \
__kfree_rcu(((ptr)-rcu_head), offsetof(typeof(*(ptr)), rcu_head))
 
+#ifdef CONFIG_RCU_NOCB_CPU
+bool rcu_is_nocb_cpu(int cpu);
+#else
+static inline bool rcu_is_nocb_cpu(int cpu) { return false; };
+#endif
+
+
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 302d360..e9e0ffa 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1589,7 +1589,7 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
  struct rcu_node *rnp, struct rcu_data *rdp)
 {
/* No-CBs CPUs do not have orphanable callbacks. */
-   if (is_nocb_cpu(rdp-cpu))
+   if (rcu_is_nocb_cpu(rdp-cpu))
return;
 
/*
@@ -2651,10 +2651,10 @@ static void _rcu_barrier(struct rcu_state *rsp)
 * corresponding CPU's preceding callbacks have been invoked.
 */
for_each_possible_cpu(cpu) {
-   if (!cpu_online(cpu)  !is_nocb_cpu(cpu))
+   if (!cpu_online(cpu)  !rcu_is_nocb_cpu(cpu))
continue;
rdp = per_cpu_ptr(rsp-rda, cpu);
-   if (is_nocb_cpu(cpu)) {
+   if (rcu_is_nocb_cpu(cpu)) {
_rcu_barrier_trace(rsp, OnlineNoCB, cpu,
   rsp-n_barrier_done);
atomic_inc(rsp-barrier_cpu_count);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index f6e5ec2..625b327 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2160,7 +2160,7 @@ static int __init rcu_nocb_setup(char *str)
 __setup(rcu_nocbs=, rcu_nocb_setup);
 
 /* Is the specified CPU a no-CPUs CPU? */
-static bool is_nocb_cpu(int cpu)
+bool rcu_is_nocb_cpu(int cpu)
 {
if (have_rcu_nocb_mask)
return cpumask_test_cpu(cpu, rcu_nocb_mask);
@@ -2218,7 +2218,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct 
rcu_head *rhp,
bool lazy)
 {
 
-   if (!is_nocb_cpu(rdp-cpu))
+   if (!rcu_is_nocb_cpu(rdp-cpu))
return 0;
__call_rcu_nocb_enqueue(rdp, rhp, rhp-next, 1, lazy);
return 1;
@@ -2235,7 +2235,7 @@ static bool __maybe_unused 
rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
long qll = rsp-qlen_lazy;
 
/* If this is not a no-CBs CPU, tell the caller to do it the old way. */
-   if (!is_nocb_cpu(smp_processor_id()))
+   if (!rcu_is_nocb_cpu(smp_processor_id()))
return 0;
rsp-qlen = 0;
rsp-qlen_lazy = 0;
@@ -2275,7 +2275,7 @@ static bool nocb_cpu_expendable(int cpu)
 * If there are no no-CB CPUs or if this CPU is not a no-CB CPU,
 * then offlining this CPU is harmless.  Let it happen.
 */
-   if (!have_rcu_nocb_mask || is_nocb_cpu(cpu))
+   if (!have_rcu_nocb_mask || rcu_is_nocb_cpu(cpu))
return 1;
 
/* If no memory, play it safe and keep the CPU around. */
@@ -2456,11 +2456,6 @@ static void __init rcu_init_nocb(void)
 
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 
-static bool is_nocb_cpu(int cpu)
-{
-   return false;
-}
-
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
bool lazy)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9d31b08..78e5341 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -587,6 +587,19 @@ void tick_nohz_idle_enter(void)
local_irq_enable();
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+static bool can_stop_full_tick(int cpu)
+{
+   if (!sched_can_stop_tick())
+   return false;
+
+   if (!rcu_is_nocb_cpu(cpu

[PATCH 23/27] nohz: Don't stop the tick if posix cpu timers are running

2012-12-29 Thread Frederic Weisbecker
If either a per thread or a per process posix cpu timer is running,
don't stop the tick.

TODO: restart the tick if it is stopped and a posix cpu timer is
enqueued. Check we probably need a memory barrier for the per
process posix timer that can be enqueued from another task
of the group.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/posix-timers.h |1 +
 kernel/posix-cpu-timers.c|   11 +++
 kernel/time/tick-sched.c |4 
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 042058f..97480c2 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -119,6 +119,7 @@ int posix_timer_event(struct k_itimer *timr, int 
si_private);
 void posix_cpu_timer_schedule(struct k_itimer *timer);
 
 void run_posix_cpu_timers(struct task_struct *task);
+bool posix_cpu_timers_running(struct task_struct *tsk);
 void posix_cpu_timers_exit(struct task_struct *task);
 void posix_cpu_timers_exit_group(struct task_struct *task);
 
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 165d476..15f8f4f 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1269,6 +1269,17 @@ static inline int fastpath_timer_check(struct 
task_struct *tsk)
return 0;
 }
 
+bool posix_cpu_timers_running(struct task_struct *tsk)
+{
+   if (!task_cputime_zero(tsk-cputime_expires))
+   return true;
+
+   if (tsk-signal-cputimer.running)
+   return true;
+
+   return false;
+}
+
 /*
  * This is called from the timer interrupt handler.  The irq handler has
  * already updated our counts.  We need to check if any timers fire now.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 04504c4..eb6ad3d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -21,6 +21,7 @@
 #include linux/sched.h
 #include linux/module.h
 #include linux/irq_work.h
+#include linux/posix-timers.h
 
 #include asm/irq_regs.h
 
@@ -599,6 +600,9 @@ static bool can_stop_full_tick(int cpu)
if (rcu_pending(cpu))
return false;
 
+   if (posix_cpu_timers_running(current))
+   return false;
+
return true;
 }
 #endif
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 25/27] rcu: Don't keep the tick for RCU while in userspace

2012-12-29 Thread Frederic Weisbecker
If we are interrupting userspace, we don't need to keep
the tick for RCU: quiescent states don't need to be reported
because we soon run in userspace and local callbacks are handled
by the nocb threads.

CHECKME: Do the nocb threads actually handle the global
grace period completion for local callbacks?

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/time/tick-sched.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 0e1ebff..76d1b08 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -22,6 +22,7 @@
 #include linux/module.h
 #include linux/irq_work.h
 #include linux/posix-timers.h
+#include linux/context_tracking.h
 
 #include asm/irq_regs.h
 
@@ -604,10 +605,9 @@ static bool can_stop_full_tick(int cpu)
 
/*
 * Keep the tick if we are asked to report a quiescent state.
-* This must be further optimized (avoid checks for local callbacks,
-* ignore RCU in userspace, etc...
+* This must be further optimized (avoid checks for local callbacks)
 */
-   if (rcu_pending(cpu)) {
+   if (!context_tracking_in_user()  rcu_pending(cpu)) {
trace_printk(Can't stop: RCU pending\n);
return false;
}
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 26/27] profiling: Remove unused timer hook

2012-12-29 Thread Frederic Weisbecker
The last remaining user was oprofile and its use has been removed
a while ago on commit bc078e4eab65f11bbaeed380593ab8151b30d703
oprofile: convert oprofile from timer_hook to hrtimer.

There doesn't seem to be any upstream user of this hook
for about two years now. And I'm not even aware of any out of tree
user.

Let's remove it.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/profile.h |   13 -
 kernel/profile.c|   24 
 2 files changed, 0 insertions(+), 37 deletions(-)

diff --git a/include/linux/profile.h b/include/linux/profile.h
index a0fc322..2112390 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -82,9 +82,6 @@ int task_handoff_unregister(struct notifier_block * n);
 int profile_event_register(enum profile_type, struct notifier_block * n);
 int profile_event_unregister(enum profile_type, struct notifier_block * n);
 
-int register_timer_hook(int (*hook)(struct pt_regs *));
-void unregister_timer_hook(int (*hook)(struct pt_regs *));
-
 struct pt_regs;
 
 #else
@@ -135,16 +132,6 @@ static inline int profile_event_unregister(enum 
profile_type t, struct notifier_
 #define profile_handoff_task(a) (0)
 #define profile_munmap(a) do { } while (0)
 
-static inline int register_timer_hook(int (*hook)(struct pt_regs *))
-{
-   return -ENOSYS;
-}
-
-static inline void unregister_timer_hook(int (*hook)(struct pt_regs *))
-{
-   return;
-}
-
 #endif /* CONFIG_PROFILING */
 
 #endif /* _LINUX_PROFILE_H */
diff --git a/kernel/profile.c b/kernel/profile.c
index 1f39181..dc3384e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -37,9 +37,6 @@ struct profile_hit {
 #define NR_PROFILE_HIT (PAGE_SIZE/sizeof(struct profile_hit))
 #define NR_PROFILE_GRP (NR_PROFILE_HIT/PROFILE_GRPSZ)
 
-/* Oprofile timer tick hook */
-static int (*timer_hook)(struct pt_regs *) __read_mostly;
-
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
@@ -208,25 +205,6 @@ int profile_event_unregister(enum profile_type type, 
struct notifier_block *n)
 }
 EXPORT_SYMBOL_GPL(profile_event_unregister);
 
-int register_timer_hook(int (*hook)(struct pt_regs *))
-{
-   if (timer_hook)
-   return -EBUSY;
-   timer_hook = hook;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(register_timer_hook);
-
-void unregister_timer_hook(int (*hook)(struct pt_regs *))
-{
-   WARN_ON(hook != timer_hook);
-   timer_hook = NULL;
-   /* make sure all CPUs see the NULL hook */
-   synchronize_sched();  /* Allow ongoing interrupts to complete. */
-}
-EXPORT_SYMBOL_GPL(unregister_timer_hook);
-
-
 #ifdef CONFIG_SMP
 /*
  * Each cpu has a pair of open-addressed hashtables for pending
@@ -436,8 +414,6 @@ void profile_tick(int type)
 {
struct pt_regs *regs = get_irq_regs();
 
-   if (type == CPU_PROFILING  timer_hook)
-   timer_hook(regs);
if (!user_mode(regs)  prof_cpu_mask != NULL 
cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
profile_hit(type, (void *)profile_pc(regs));
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 27/27] timer: Don't run non-pinned timer to full dynticks CPUs

2012-12-29 Thread Frederic Weisbecker
While trying to find a target for a non-pinned timer, use
the following logic:

- Use the closest (from a sched domain POV) busy CPU that
is not full dynticks

- If none, use the closest idle CPU that is not full dynticks.

So this is biased toward isolation over powersaving. This is
a quick hack until we provide a way for the user to tune that
policy. A CPU mask affinity for non pinned timers could be such
a solution.

Original-patch-by: Thomas Gleixner t...@linutronix.de
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/hrtimer.c|3 ++-
 kernel/sched/core.c |   26 +++---
 kernel/timer.c  |3 ++-
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 6db7a5e..f5da6fb 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -159,7 +159,8 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct 
hrtimer *timer,
 static int hrtimer_get_target(int this_cpu, int pinned)
 {
 #ifdef CONFIG_NO_HZ
-   if (!pinned  get_sysctl_timer_migration()  idle_cpu(this_cpu))
+   if (!pinned  get_sysctl_timer_migration() 
+   (idle_cpu(this_cpu) || tick_nohz_full_cpu(this_cpu)))
return get_nohz_timer_target();
 #endif
return this_cpu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b6156a..e2884c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -560,22 +560,42 @@ void resched_cpu(int cpu)
  */
 int get_nohz_timer_target(void)
 {
-   int cpu = smp_processor_id();
int i;
struct sched_domain *sd;
+   int cpu = smp_processor_id();
+   int target = -1;
 
rcu_read_lock();
for_each_domain(cpu, sd) {
for_each_cpu(i, sched_domain_span(sd)) {
+   /*
+* This is biased toward CPU isolation usecase:
+* try to migrate the timer to a busy non-full-nohz
+* CPU. If there is none, then prefer an idle CPU
+* than a full nohz one.
+* We shouldn't do policy here (isolation VS 
powersaving)
+* so this is a temporary hack. Being able to affine
+* non-pinned timers would be a better thing.
+*/
+   if (tick_nohz_full_cpu(i))
+   continue;
+
if (!idle_cpu(i)) {
-   cpu = i;
+   target = i;
goto unlock;
}
+
+   if (target == -1)
+   target = i;
}
}
+   /* Fallback in case of NULL domain */
+   if (target == -1)
+   target = cpu;
 unlock:
rcu_read_unlock();
-   return cpu;
+
+   return target;
 }
 /*
  * When add_timer_on() enqueues a timer into the timer wheel of an
diff --git a/kernel/timer.c b/kernel/timer.c
index 970b57d..51dd02b 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -738,7 +738,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
cpu = smp_processor_id();
 
 #if defined(CONFIG_NO_HZ)  defined(CONFIG_SMP)
-   if (!pinned  get_sysctl_timer_migration()  idle_cpu(cpu))
+   if (!pinned  get_sysctl_timer_migration() 
+   (idle_cpu(cpu) || tick_nohz_full_cpu(cpu)))
cpu = get_nohz_timer_target();
 #endif
new_base = per_cpu(tvec_bases, cpu);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 24/27] nohz: Add some tracing

2012-12-29 Thread Frederic Weisbecker
Not for merge, just for debugging.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/time/tick-sched.c |   27 ++-
 1 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index eb6ad3d..0e1ebff 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -142,6 +142,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct 
pt_regs *regs)
ts-idle_jiffies++;
}
 #endif
+   trace_printk(tick\n);
update_process_times(user_mode(regs));
profile_tick(CPU_PROFILING);
 }
@@ -591,17 +592,30 @@ void tick_nohz_idle_enter(void)
 #ifdef CONFIG_NO_HZ_FULL
 static bool can_stop_full_tick(int cpu)
 {
-   if (!sched_can_stop_tick())
+   if (!sched_can_stop_tick()) {
+   trace_printk(Can't stop: sched\n);
return false;
+   }
 
-   if (!rcu_is_nocb_cpu(cpu))
+   if (!rcu_is_nocb_cpu(cpu)) {
+   trace_printk(Can't stop: not RCU nocb\n);
return false;
+   }
 
-   if (rcu_pending(cpu))
+   /*
+* Keep the tick if we are asked to report a quiescent state.
+* This must be further optimized (avoid checks for local callbacks,
+* ignore RCU in userspace, etc...
+*/
+   if (rcu_pending(cpu)) {
+   trace_printk(Can't stop: RCU pending\n);
return false;
+   }
 
-   if (posix_cpu_timers_running(current))
+   if (posix_cpu_timers_running(current)) {
+   trace_printk(Can't stop: posix CPU timers running\n);
return false;
+   }
 
return true;
 }
@@ -615,12 +629,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched 
*ts)
if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
return;
 
-   if (!ts-tick_stopped  ts-nohz_mode == NOHZ_MODE_INACTIVE)
+   if (!ts-tick_stopped  ts-nohz_mode == NOHZ_MODE_INACTIVE) {
+   trace_printk(Can't stop: NOHZ_MODE_INACTIVE\n);
return;
+   }
 
if (!can_stop_full_tick(cpu))
return;
 
+   trace_printk(Stop tick\n);
tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
 #endif
 }
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 22/27] nohz: Don't turn off the tick if rcu needs it

2012-12-29 Thread Frederic Weisbecker
If RCU is waiting for the current CPU to complete a grace
period, don't turn off the tick. Unlike dynctik-idle, we
are not necessarily going to enter into rcu extended quiescent
state, so we may need to keep the tick to note current CPU's
quiescent states.

[added build fix from Zen Lin]

CHECKME: OTOH we don't want to handle a locally started
grace period, this should be offloaded for rcu_nocb CPUs.
What we want is to be kicked if we stay dynticks in the kernel
for too long (ie: to report a quiescent state).
rcu_pending() is perhaps an overkill just for that.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
Signed-off-by: Steven Rostedt rost...@goodmis.org
---
 include/linux/rcupdate.h |1 +
 kernel/rcutree.c |3 +--
 kernel/time/tick-sched.c |3 +++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 829312e..2ebadac 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -211,6 +211,7 @@ static inline int rcu_preempt_depth(void)
 extern void rcu_sched_qs(int cpu);
 extern void rcu_bh_qs(int cpu);
 extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_pending(int cpu);
 struct notifier_block;
 extern void rcu_idle_enter(void);
 extern void rcu_idle_exit(void);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e9e0ffa..6ba3e02 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -232,7 +232,6 @@ module_param(jiffies_till_next_fqs, ulong, 0644);
 
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
 static void force_quiescent_state(struct rcu_state *rsp);
-static int rcu_pending(int cpu);
 
 /*
  * Return the number of RCU-sched batches processed thus far for debug  stats.
@@ -2521,7 +2520,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct 
rcu_data *rdp)
  * by the current CPU, returning 1 if so.  This function is part of the
  * RCU implementation; it is -not- an exported member of the RCU API.
  */
-static int rcu_pending(int cpu)
+int rcu_pending(int cpu)
 {
struct rcu_state *rsp;
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 78e5341..04504c4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -596,6 +596,9 @@ static bool can_stop_full_tick(int cpu)
if (!rcu_is_nocb_cpu(cpu))
return false;
 
+   if (rcu_pending(cpu))
+   return false;
+
return true;
 }
 #endif
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/27] nohz: Full dynticks mode

2012-12-29 Thread Frederic Weisbecker
When a CPU is in full dynticks mode, try to switch
it to nohz mode from the interrupt exit path if it is
running a single non-idle task.

Then restart the tick if necessary if we are enqueuing a
second task while the timer is stopped, so that the scheduler
tick is rearmed.

[TODO: Check remaining things to be done from scheduler_tick()]

[ Included build fix from Geoff Levand ]

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/sched.h|6 +
 include/linux/tick.h |2 +
 kernel/sched/core.c  |   22 -
 kernel/sched/sched.h |   10 +
 kernel/softirq.c |5 ++-
 kernel/time/tick-sched.c |   47 -
 6 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 32860ae..132897d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2846,6 +2846,12 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)  TASK_SIZE
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL
+extern bool sched_can_stop_tick(void);
+#else
+static inline bool sched_can_stop_tick(void) { return false; }
+#endif
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 2d4f6f0..dfb90ea 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -159,8 +159,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 
*unused) { return -1; }
 
 #ifdef CONFIG_NO_HZ_FULL
 int tick_nohz_full_cpu(int cpu);
+extern void tick_nohz_full_check(void);
 #else
 static inline int tick_nohz_full_cpu(int cpu) { return 0; }
+static inline void tick_nohz_full_check(void) { }
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c1a806..7b6156a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1238,6 +1238,24 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL
+bool sched_can_stop_tick(void)
+{
+   struct rq *rq;
+
+   rq = this_rq();
+
+   /* Make sure rq-nr_running update is visible after the IPI */
+   smp_rmb();
+
+   /* More than one running task need preemption */
+   if (rq-nr_running  1)
+   return false;
+
+   return true;
+}
+#endif
+
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
@@ -1380,7 +1398,8 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
-   if (llist_empty(this_rq()-wake_list)  !got_nohz_idle_kick())
+   if (llist_empty(this_rq()-wake_list)  !got_nohz_idle_kick()
+!tick_nohz_full_cpu(smp_processor_id()))
return;
 
/*
@@ -1397,6 +1416,7 @@ void scheduler_ipi(void)
 * somewhat pessimize the simple resched case.
 */
irq_enter();
+   tick_nohz_full_check();
sched_ttwu_pending();
 
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f24d91e..63915fe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -955,6 +955,16 @@ static inline u64 steal_ticks(u64 steal)
 static inline void inc_nr_running(struct rq *rq)
 {
rq-nr_running++;
+
+#ifdef CONFIG_NO_HZ_FULL
+   if (rq-nr_running == 2) {
+   if (tick_nohz_full_cpu(rq-cpu)) {
+   /* Order rq-nr_running write against the IPI */
+   smp_wmb();
+   smp_send_reschedule(rq-cpu);
+   }
+   }
+#endif
 }
 
 static inline void dec_nr_running(struct rq *rq)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index f5cc25f..6342078 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -307,7 +307,8 @@ void irq_enter(void)
int cpu = smp_processor_id();
 
rcu_irq_enter();
-   if (is_idle_task(current)  !in_interrupt()) {
+
+   if ((is_idle_task(current) || tick_nohz_full_cpu(cpu))  
!in_interrupt()) {
/*
 * Prevent raise_softirq from needlessly waking up ksoftirqd
 * here, as softirq will be serviced on return from interrupt.
@@ -349,7 +350,7 @@ void irq_exit(void)
 
 #ifdef CONFIG_NO_HZ
/* Make sure that timer wheel updates are propagated */
-   if (idle_cpu(smp_processor_id())  !in_interrupt()  !need_resched())
+   if (!in_interrupt())
tick_nohz_irq_exit();
 #endif

[PATCH 17/27] sched: Update rq clock before idle balancing

2012-12-29 Thread Frederic Weisbecker
idle_balance() is called from schedule() right before we schedule the
idle task. It needs to record the idle timestamp at that time and for
this the rq clock must be accurate. If the CPU is running tickless
we need to update the rq clock manually.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/fair.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e78d81104..698137d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5241,6 +5241,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
 
+   update_nohz_rq_clock(this_rq);
this_rq-idle_stamp = this_rq-clock;
 
if (this_rq-avg_idle  sysctl_sched_migration_cost)
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19/27] nohz: Move nohz load balancer selection into idle logic

2012-12-29 Thread Frederic Weisbecker
[ ** BUGGY PATCH: I need to put more thinking into this ** ]

We want the nohz load balancer to be an idle CPU, thus
move that selection to strict dyntick idle logic.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
[ added movement of calc_load_exit_idle() ]
Signed-off-by: Steven Rostedt rost...@goodmis.org
---
 kernel/time/tick-sched.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ab3aa14..164db94 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -444,9 +444,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
 * the scheduler tick in nohz_restart_sched_tick.
 */
if (!ts-tick_stopped) {
-   nohz_balance_enter_idle(cpu);
-   calc_load_enter_idle();
-
ts-last_tick = hrtimer_get_expires(ts-sched_timer);
ts-tick_stopped = 1;
}
@@ -542,8 +539,11 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
ts-idle_expires = expires;
}
 
-   if (!was_stopped  ts-tick_stopped)
+   if (!was_stopped  ts-tick_stopped) {
ts-idle_jiffies = ts-last_jiffies;
+   nohz_balance_enter_idle(cpu);
+   calc_load_enter_idle();
+   }
}
 }
 
@@ -651,7 +651,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched 
*ts, ktime_t now)
tick_do_update_jiffies64(now);
update_cpu_load_nohz();
 
-   calc_load_exit_idle();
touch_softlockup_watchdog();
/*
 * Cancel the scheduled timer and restore the tick
@@ -711,6 +710,8 @@ void tick_nohz_idle_exit(void)
tick_nohz_stop_idle(cpu, now);
 
if (ts-tick_stopped) {
+   nohz_balance_enter_idle(cpu);
+   calc_load_exit_idle();
tick_nohz_restart_sched_tick(ts, now);
tick_nohz_account_idle_ticks(ts);
}
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/27] sched: Update clock of nohz busiest rq before balancing

2012-12-29 Thread Frederic Weisbecker
move_tasks() and active_load_balance_cpu_stop() both need
to have the busiest rq clock uptodate because they may end
up calling can_migrate_task() that uses rq-clock_task
to determine if the task running in the busiest runqueue
is cache hot.

Hence if the busiest runqueue is tickless, update its clock
before reading it.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
[ Forward port conflicts ]
Signed-off-by: Steven Rostedt rost...@goodmis.org
---
 kernel/sched/fair.c |   17 +
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d65ac7..e78d81104 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5002,6 +5002,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 {
int ld_moved, cur_ld_moved, active_balance = 0;
int lb_iterations, max_lb_iterations;
+   int clock_updated;
struct sched_group *group;
struct rq *busiest;
unsigned long flags;
@@ -5045,6 +5046,7 @@ redo:
 
ld_moved = 0;
lb_iterations = 1;
+   clock_updated = 0;
if (busiest-nr_running  1) {
/*
 * Attempt to move tasks. If find_busiest_group has found
@@ -5068,6 +5070,14 @@ more_balance:
 */
cur_ld_moved = move_tasks(env);
ld_moved += cur_ld_moved;
+
+   /*
+* Move tasks may end up calling can_migrate_task() which
+* requires an uptodate value of the rq clock.
+*/
+   update_nohz_rq_clock(busiest);
+   clock_updated = 1;
+
double_rq_unlock(env.dst_rq, busiest);
local_irq_restore(flags);
 
@@ -5163,6 +5173,13 @@ more_balance:
busiest-active_balance = 1;
busiest-push_cpu = this_cpu;
active_balance = 1;
+   /*
+* active_load_balance_cpu_stop may end up 
calling
+* can_migrate_task() which requires an uptodate
+* value of the rq clock.
+*/
+   if (!clock_updated)
+   update_nohz_rq_clock(busiest);
}
raw_spin_unlock_irqrestore(busiest-lock, flags);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/27] sched: Update rq clock earlier in unthrottle_cfs_rq

2012-12-29 Thread Frederic Weisbecker
In this function we are making use of rq-clock right before the
update of the rq clock, let's just call update_rq_clock() just
before that to avoid using a stale rq clock value.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/fair.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a96f0f2..3d65ac7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2279,14 +2279,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
long task_delta;
 
se = cfs_rq-tg-se[cpu_of(rq_of(cfs_rq))];
-
cfs_rq-throttled = 0;
+
+   update_rq_clock(rq);
+
raw_spin_lock(cfs_b-lock);
cfs_b-throttled_time += rq-clock - cfs_rq-throttled_clock;
list_del_rcu(cfs_rq-throttled_list);
raw_spin_unlock(cfs_b-lock);
 
-   update_rq_clock(rq);
/* update hierarchical throttle state */
walk_tg_tree_from(cfs_rq-tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/27] sched: Update rq clock on nohz CPU before setting fair group shares

2012-12-29 Thread Frederic Weisbecker
Because we may update the execution time (sched_group_set_shares()-
update_cfs_shares()-reweight_entity()-update_curr()) before
reweighting the entity after updating the group shares and this requires
an uptodate version of the runqueue clock. Let's update it on the target
CPU if it runs tickless because scheduler_tick() is not there to maintain
it.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/fair.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..a96f0f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6068,6 +6068,11 @@ int sched_group_set_shares(struct task_group *tg, 
unsigned long shares)
se = tg-se[i];
/* Propagate contribution to hierarchy */
raw_spin_lock_irqsave(rq-lock, flags);
+   /*
+* We may call update_curr() which needs an up-to-date
+* version of rq clock if the CPU runs tickless.
+*/
+   update_nohz_rq_clock(rq);
for_each_sched_entity(se)
update_cfs_shares(group_cfs_rq(se));
raw_spin_unlock_irqrestore(rq-lock, flags);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/27] sched: Comment on rq-clock correctness in ttwu_do_wakeup() in nohz

2012-12-29 Thread Frederic Weisbecker
Just to avoid confusion.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/core.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63b25e2..bfac40f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1302,6 +1302,12 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int 
wake_flags)
if (p-sched_class-task_woken)
p-sched_class-task_woken(rq, p);
 
+   /*
+* For adaptive nohz case: We called ttwu_activate()
+* which just updated the rq clock. There is an
+* exception with p-on_rq != 0 but in this case
+* we are not idle and rq-idle_stamp == 0
+*/
if (rq-idle_stamp) {
u64 delta = rq-clock - rq-idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/27] nohz: Wake up full dynticks CPUs when a timer gets enqueued

2012-12-29 Thread Frederic Weisbecker
Wake up a CPU when a timer list timer is enqueued there and
the CPU is in full dynticks mode. Sending an IPI to it makes
it reconsidering the next timer to program on top of recent
updates.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/sched.h |4 ++--
 kernel/sched/core.c   |   18 +-
 kernel/timer.c|2 +-
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3bca36e..32860ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2061,9 +2061,9 @@ static inline void idle_task_exit(void) {}
 #endif
 
 #if defined(CONFIG_NO_HZ)  defined(CONFIG_SMP)
-extern void wake_up_idle_cpu(int cpu);
+extern void wake_up_nohz_cpu(int cpu);
 #else
-static inline void wake_up_idle_cpu(int cpu) { }
+static inline void wake_up_nohz_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..63b25e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -587,7 +587,7 @@ unlock:
  * account when the CPU goes back to idle and evaluates the timer
  * wheel for the next timer event.
  */
-void wake_up_idle_cpu(int cpu)
+static void wake_up_idle_cpu(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
 
@@ -617,6 +617,22 @@ void wake_up_idle_cpu(int cpu)
smp_send_reschedule(cpu);
 }
 
+static bool wake_up_full_nohz_cpu(int cpu)
+{
+   if (tick_nohz_full_cpu(cpu)) {
+   smp_send_reschedule(cpu);
+   return true;
+   }
+
+   return false;
+}
+
+void wake_up_nohz_cpu(int cpu)
+{
+   if (!wake_up_full_nohz_cpu(cpu))
+   wake_up_idle_cpu(cpu);
+}
+
 static inline bool got_nohz_idle_kick(void)
 {
int cpu = smp_processor_id();
diff --git a/kernel/timer.c b/kernel/timer.c
index ff3b516..970b57d 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -936,7 +936,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 * makes sure that a CPU on the way to idle can not evaluate
 * the timer wheel.
 */
-   wake_up_idle_cpu(cpu);
+   wake_up_nohz_cpu(cpu);
spin_unlock_irqrestore(base-lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/27] nohz: Trace timekeeping update

2012-12-29 Thread Frederic Weisbecker
Not for merge. This may become a real tracepoint.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/time/tick-sched.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f19e8bf..ab3aa14 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -118,8 +118,10 @@ static void tick_sched_do_timer(ktime_t now)
 #endif
 
/* Check, if the jiffies need an update */
-   if (tick_do_timer_cpu == cpu)
+   if (tick_do_timer_cpu == cpu) {
+   trace_printk(do timekeeping\n);
tick_do_update_jiffies64(now);
+   }
 }
 
 static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/27] nohz: Basic full dynticks interface

2012-12-29 Thread Frederic Weisbecker
Start with a very simple interface to define full dynticks CPU:
use a boot time option defined cpumask through the full_nohz=
kernel parameter.

Make sure you keep at least one CPU outside this range to handle
the timekeeping.

Also full_nohz= must match rcu_nocb= value.

Suggested-by: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/tick.h |7 +++
 kernel/time/Kconfig  |9 +
 kernel/time/tick-sched.c |   23 +++
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 553272e..2d4f6f0 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -157,6 +157,13 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 
*unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 # endif /* !NO_HZ */
 
+#ifdef CONFIG_NO_HZ_FULL
+int tick_nohz_full_cpu(int cpu);
+#else
+static inline int tick_nohz_full_cpu(int cpu) { return 0; }
+#endif
+
+
 # ifdef CONFIG_CPU_IDLE_GOV_MENU
 extern void menu_hrtimer_cancel(void);
 # else
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 8601f0d..dc6381d 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -70,6 +70,15 @@ config NO_HZ
  only trigger on an as-needed basis both when the system is
  busy and when the system is idle.
 
+config NO_HZ_FULL
+   bool Full tickless system
+   depends on NO_HZ  RCU_USER_QS  VIRT_CPU_ACCOUNTING_GEN  
RCU_NOCB_CPU  SMP
+   select CONTEXT_TRACKING_FORCE
+   help
+ Try to be tickless everywhere, not just in idle. (You need
+to fill up the full_nohz_mask boot parameter).
+
+
 config HIGH_RES_TIMERS
bool High Resolution Timer Support
depends on !ARCH_USES_GETTIMEOFFSET  GENERIC_CLOCKEVENTS
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ad0e6fa..fac9ba4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -142,6 +142,29 @@ static void tick_sched_handle(struct tick_sched *ts, 
struct pt_regs *regs)
profile_tick(CPU_PROFILING);
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+static cpumask_var_t full_nohz_mask;
+bool have_full_nohz_mask;
+
+int tick_nohz_full_cpu(int cpu)
+{
+   if (!have_full_nohz_mask)
+   return 0;
+
+   return cpumask_test_cpu(cpu, full_nohz_mask);
+}
+
+/* Parse the boot-time nohz CPU list from the kernel parameters. */
+static int __init tick_nohz_full_setup(char *str)
+{
+   alloc_bootmem_cpumask_var(full_nohz_mask);
+   have_full_nohz_mask = true;
+   cpulist_parse(str, full_nohz_mask);
+   return 1;
+}
+__setup(full_nohz=, tick_nohz_full_setup);
+#endif
+
 /*
  * NOHZ - aka dynamic tick functionality
  */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/27] cputime: Safely read cputime of full dynticks CPUs

2012-12-29 Thread Frederic Weisbecker
While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel - user transition time snapshot and
we need to add the tickless time spent since this snapshot.

To fix this, flush the cputime of the dynticks CPUs on
kernel - user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.

FIXME: do the same for idle and guest time.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 arch/s390/kernel/vtime.c  |6 +-
 include/asm-generic/cputime.h |1 +
 include/linux/hardirq.h   |4 +-
 include/linux/init_task.h |   11 
 include/linux/sched.h |   16 +
 include/linux/vtime.h |   40 +++---
 kernel/context_tracking.c |2 +-
 kernel/fork.c |6 ++
 kernel/sched/cputime.c|  123 ++---
 kernel/softirq.c  |6 +-
 10 files changed, 154 insertions(+), 61 deletions(-)

diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index e84b8b6..ce9cc5a 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -127,7 +127,7 @@ void vtime_account_user(struct task_struct *tsk)
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer  steal_clock.
  */
-void vtime_account(struct task_struct *tsk)
+void vtime_account_irq_enter(struct task_struct *tsk)
 {
struct thread_info *ti = task_thread_info(tsk);
u64 timer, system;
@@ -145,10 +145,10 @@ void vtime_account(struct task_struct *tsk)
 
virt_timer_forward(system);
 }
-EXPORT_SYMBOL_GPL(vtime_account);
+EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 
 void vtime_account_system(struct task_struct *tsk)
-__attribute__((alias(vtime_account)));
+__attribute__((alias(vtime_account_irq_enter)));
 EXPORT_SYMBOL_GPL(vtime_account_system);
 
 void __kprobes vtime_stop_cpu(void)
diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h
index 9a62937..3e704d5 100644
--- a/include/asm-generic/cputime.h
+++ b/include/asm-generic/cputime.h
@@ -10,6 +10,7 @@ typedef unsigned long __nocast cputime_t;
 #define cputime_to_jiffies(__ct)   (__force unsigned long)(__ct)
 #define cputime_to_scaled(__ct)(__ct)
 #define jiffies_to_cputime(__hz)   (__force cputime_t)(__hz)
+#define jiffies_to_scaled(__hz)(__force cputime_t)(__hz)
 
 typedef u64 __nocast cputime64_t;
 
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 624ef3f..7105d5c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -153,7 +153,7 @@ extern void rcu_nmi_exit(void);
  */
 #define __irq_enter()  \
do {\
-   vtime_account_irq_enter(current);   \
+   account_irq_enter_time(current);\
add_preempt_count(HARDIRQ_OFFSET);  \
trace_hardirq_enter();  \
} while (0)
@@ -169,7 +169,7 @@ extern void irq_enter(void);
 #define __irq_exit()   \
do {\
trace_hardirq_exit();   \
-   vtime_account_irq_exit(current);\
+   account_irq_exit_time(current); \
sub_preempt_count(HARDIRQ_OFFSET);  \
} while (0)
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..a6ef59f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include linux/pid_namespace.h
 #include linux/user_namespace.h
 #include linux/securebits.h
+#include linux/seqlock.h
 #include net/net_namespace.h
 
 #ifdef CONFIG_SMP
@@ -141,6 +142,15 @@ extern struct task_group root_task_group;
 # define INIT_PERF_EVENTS(tsk)
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+# define INIT_VTIME(tsk)   \
+   .vtime_seqlock = __SEQLOCK_UNLOCKED(tsk.vtime_seqlock), \
+   .prev_jiffies = INITIAL_JIFFIES, /* CHECKME */  \
+   .prev_jiffies_whence = JIFFIES_SYS,
+#else
+# define INIT_VTIME(tsk)
+#endif
+
 #define

[PATCH 03/27] cputime: Allow dynamic switch between tick/virtual based cputime accounting

2012-12-29 Thread Frederic Weisbecker
Allow to dynamically switch between tick and virtual based cputime accounting.
This way we can provide a kind of on-demand virtual based cputime
accounting. In this mode, the kernel will rely on the user hooks
subsystem to dynamically hook on kernel boundaries.

This is in preparation for beeing able to stop the timer tick further
idle. Doing so will depend on CONFIG_VIRT_CPU_ACCOUNTING which makes
it possible to account the cputime without the tick by hooking on
kernel/user boundaries.

Depending whether the tick is stopped or not, we can switch between
tick and vtime based accounting anytime in order to minimize the
overhead associated to user hooks.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
---
 include/linux/kernel_stat.h |2 +-
 include/linux/sched.h   |4 +-
 include/linux/vtime.h   |8 ++
 init/Kconfig|6 
 kernel/fork.c   |2 +-
 kernel/sched/cputime.c  |   58 +++---
 kernel/time/tick-sched.c|5 +++-
 7 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 66b7078..ed5f6ed 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -127,7 +127,7 @@ extern void account_system_time(struct task_struct *, int, 
cputime_t, cputime_t)
 extern void account_steal_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline void account_process_tick(struct task_struct *tsk, int user)
 {
vtime_account_user(tsk);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..66b2344 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -605,7 +605,7 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
struct cputime prev_cputime;
 #endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
@@ -1365,7 +1365,7 @@ struct task_struct {
 
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
struct cputime prev_cputime;
 #endif
unsigned long nvcsw, nivcsw; /* context switch counts */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 1151960..e57020d 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -10,12 +10,20 @@ extern void vtime_account_system_irqsafe(struct task_struct 
*tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 extern void vtime_account(struct task_struct *tsk);
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+extern bool vtime_accounting(void);
 #else
+static inline bool vtime_accounting(void) { return true; }
+#endif
+
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { }
 static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_account(struct task_struct *tsk) { }
+static inline bool vtime_accounting(void) { return false; }
 #endif
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/init/Kconfig b/init/Kconfig
index dad2b88..307bc35 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -342,6 +342,7 @@ config VIRT_CPU_ACCOUNTING
bool Deterministic task and CPU time accounting
depends on HAVE_VIRT_CPU_ACCOUNTING || HAVE_CONTEXT_TRACKING
select VIRT_CPU_ACCOUNTING_GEN if !HAVE_VIRT_CPU_ACCOUNTING
+   select VIRT_CPU_ACCOUNTING_NATIVE if HAVE_VIRT_CPU_ACCOUNTING
help
  Select this option to enable more accurate task and CPU time
  accounting.  This is done by reading a CPU counter on each
@@ -366,11 +367,16 @@ endchoice
 
 config VIRT_CPU_ACCOUNTING_GEN
select CONTEXT_TRACKING
+   depends on VIRT_CPU_ACCOUNTING  HAVE_CONTEXT_TRACKING
bool
help
  Implement a generic virtual based cputime accounting by using
  the context tracking subsystem.
 
+config VIRT_CPU_ACCOUNTING_NATIVE
+   depends on VIRT_CPU_ACCOUNTING  HAVE_VIRT_CPU_ACCOUNTING
+   bool

[PATCH 01/27] context_tracking: Add comments on interface and internals

2012-12-29 Thread Frederic Weisbecker
This subsystem lacks many explanations on its purpose and
design. Add these missing comments.

Reported-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Alessio Igor Bogani abog...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Chris Metcalf cmetc...@tilera.com
Cc: Christoph Lameter c...@linux.com
Cc: Geoff Levand ge...@infradead.org
Cc: Gilad Ben Yossef gi...@benyossef.com
Cc: Hakan Akkan hakanak...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Li Zhong zh...@linux.vnet.ibm.com
---
 kernel/context_tracking.c |   73 ++--
 1 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index e0e07fd..9f6c38f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -1,3 +1,19 @@
+/*
+ * Context tracking: Probe on high level context boundaries such as kernel
+ * and userspace. This includes syscalls and exceptions entry/exit.
+ *
+ * This is used by RCU to remove its dependency on the timer tick while a CPU
+ * runs in userspace.
+ *
+ *  Started by Frederic Weisbecker:
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com
+ *
+ * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton,
+ * Steven Rostedt, Peter Zijlstra for suggestions and improvements.
+ *
+ */
+
 #include linux/context_tracking.h
 #include linux/rcupdate.h
 #include linux/sched.h
@@ -6,8 +22,8 @@
 
 struct context_tracking {
/*
-* When active is false, hooks are not set to
-* minimize overhead: TIF flags are cleared
+* When active is false, hooks are unset in order
+* to minimize overhead: TIF flags are cleared
 * and calls to user_enter/exit are ignored. This
 * may be further optimized using static keys.
 */
@@ -24,6 +40,15 @@ static DEFINE_PER_CPU(struct context_tracking, 
context_tracking) = {
 #endif
 };
 
+/**
+ * user_enter - Inform the context tracking that the CPU is going to
+ *  enter userspace mode.
+ *
+ * This function must be called right before we switch from the kernel
+ * to userspace, when it's guaranteed the remaining kernel instructions
+ * to execute won't use any RCU read side critical section because this
+ * function sets RCU in extended quiescent state.
+ */
 void user_enter(void)
 {
unsigned long flags;
@@ -39,40 +64,68 @@ void user_enter(void)
if (in_interrupt())
return;
 
+   /* Kernel threads aren't supposed to go to userspace */
WARN_ON_ONCE(!current-mm);
 
local_irq_save(flags);
if (__this_cpu_read(context_tracking.active) 
__this_cpu_read(context_tracking.state) != IN_USER) {
__this_cpu_write(context_tracking.state, IN_USER);
+   /*
+* At this stage, only low level arch entry code remains and
+* then we'll run in userspace. We can assume there won't be
+* any RCU read-side critical section until the next call to
+* user_exit() or rcu_irq_enter(). Let's remove RCU's dependency
+* on the tick.
+*/
rcu_user_enter();
}
local_irq_restore(flags);
 }
 
+
+/**
+ * user_exit - Inform the context tracking that the CPU is
+ * exiting userspace mode and entering the kernel.
+ *
+ * This function must be called after we entered the kernel from userspace
+ * before any use of RCU read side critical section. This potentially include
+ * any high level kernel code like syscalls, exceptions, signal handling, 
etc...
+ *
+ * This call supports re-entrancy. This way it can be called from any exception
+ * handler without needing to know if we came from userspace or not.
+ */
 void user_exit(void)
 {
unsigned long flags;
 
-   /*
-* Some contexts may involve an exception occuring in an irq,
-* leading to that nesting:
-* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
-* This would mess up the dyntick_nesting count though. And rcu_irq_*()
-* helpers are enough to protect RCU uses inside the exception. So
-* just return immediately if we detect we are in an IRQ.
-*/
if (in_interrupt())
return;
 
local_irq_save(flags);
if (__this_cpu_read(context_tracking.state) == IN_USER) {
__this_cpu_write(context_tracking.state, IN_KERNEL);
+   /*
+* We are going to run code that may use RCU. Inform
+* RCU core about that (ie: we may need the tick again).
+*/
rcu_user_exit

Re: [PATCH 07/27] nohz: Assign timekeeping duty to a non-full-nohz CPU

2013-01-04 Thread Frederic Weisbecker
2013/1/2 Christoph Lameter c...@linux.com:
 On Sat, 29 Dec 2012, Frederic Weisbecker wrote:

 @@ -163,6 +164,8 @@ static int __init tick_nohz_full_setup(char *str)
   return 1;
  }
  __setup(full_nohz=, tick_nohz_full_setup);
 +#else
 +#define have_full_nohz_mask (0)
  #endif

  /*
 @@ -512,6 +515,10 @@ static bool can_stop_idle_tick(int cpu, struct 
 tick_sched *ts)
   return false;
   }

 + /* If there are full nohz CPUs around, we need to keep the timekeeping 
 duty */
 + if (have_full_nohz_mask  tick_do_timer_cpu == cpu)
 + return false;
 +
   return true;
  }



 Ok so I guess this means that if I setup all cpus as nohz then a random
 one will continue to do timekeeping?

In fact, although the code doesn't check that yet, you're supposed to
have at least one online CPU outside the full_nohz mask to handle
that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 21/27] nohz: Only stop the tick on RCU nocb CPUs

2013-01-04 Thread Frederic Weisbecker
2013/1/2 Namhyung Kim namhy...@kernel.org:
 You may want to add the following also to shut up the gcc:

   CC  kernel/rcutree.o
 In file included from /home/namhyung/project/linux/kernel/rcutree.c:58:0:
 /home/namhyung/project/linux/kernel/rcutree.h:539:13: warning: 
 ‘is_nocb_cpu’ declared ‘static’ but never defined [-Wunused-function]


 Thanks,
 Namhyung


 diff --git a/kernel/rcutree.h b/kernel/rcutree.h
 index 4b69291b093d..fbbad931c36a 100644
 --- a/kernel/rcutree.h
 +++ b/kernel/rcutree.h
 @@ -536,7 +536,6 @@ static void print_cpu_stall_info(struct rcu_state *rsp, 
 int cpu);
  static void print_cpu_stall_info_end(void);
  static void zero_cpu_stall_ticks(struct rcu_data *rdp);
  static void increment_cpu_stall_ticks(void);
 -static bool is_nocb_cpu(int cpu);
  static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
 bool lazy);
  static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,

I'm folding that fix for the next version.

Thanks Namhyung!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v1 15/31] ARC: Process/scheduling/clock/Timers/Delay Management

2013-01-04 Thread Frederic Weisbecker
2013/1/2 Vineet Gupta vineet.gup...@synopsys.com:
 On Tuesday 13 November 2012 01:59 AM, Thomas Gleixner wrote:
 On Wed, 7 Nov 2012, Vineet Gupta wrote:
 +void cpu_idle(void)
 +{
 +/* Since we SLEEP in idle loop, TIF_POLLING_NRFLAG can't be set */
 +
 +/* endless idle loop with no priority at all */
 +while (1) {
 +tick_nohz_idle_enter();
 +
 +while (!need_resched())
 +arch_idle();
 +
 +tick_nohz_idle_exit();
 +
 +preempt_enable_no_resched();
 +schedule();
 +preempt_disable();

   schedule_preempt_disabled() please


 OK ! And it seems I was also missing the calls to rcu_idle_enter()/exit() to 
 track
 commit 1268fbc746ea  nohz: Remove tick_nohz_idle_enter_norcu() / ...

Right!

They must be placed around the code that sets the low power mode, when
you know there is no use of RCU between rcu_idle_enter() /
rcu_idle_exit(). Here this would be likely:

while (1) {
tick_nohz_idle_enter();

+  rcu_idle_enter();
while (!need_resched())
arch_idle();
+  rcu_idle_exit()

tick_nohz_idle_exit();
schedule_preempt_disabled();
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/27] nohz: Basic full dynticks interface

2013-01-04 Thread Frederic Weisbecker
2012/12/31 Li Zhong zh...@linux.vnet.ibm.com:
 On Sat, 2012-12-29 at 17:42 +0100, Frederic Weisbecker wrote:
 Start with a very simple interface to define full dynticks CPU:
 use a boot time option defined cpumask through the full_nohz=
 kernel parameter.

 Make sure you keep at least one CPU outside this range to handle
 the timekeeping.

 Also full_nohz= must match rcu_nocb= value.

 Suggested-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 Signed-off-by: Frederic Weisbecker fweis...@gmail.com
 Cc: Alessio Igor Bogani abog...@kernel.org
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: Chris Metcalf cmetc...@tilera.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Geoff Levand ge...@infradead.org
 Cc: Gilad Ben Yossef gi...@benyossef.com
 Cc: Hakan Akkan hakanak...@gmail.com
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
 Cc: Paul Gortmaker paul.gortma...@windriver.com
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Thomas Gleixner t...@linutronix.de
 ---
  include/linux/tick.h |7 +++
  kernel/time/Kconfig  |9 +
  kernel/time/tick-sched.c |   23 +++
  3 files changed, 39 insertions(+), 0 deletions(-)

 diff --git a/include/linux/tick.h b/include/linux/tick.h
 index 553272e..2d4f6f0 100644
 --- a/include/linux/tick.h
 +++ b/include/linux/tick.h
 @@ -157,6 +157,13 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 
 *unused) { return -1; }
  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; 
 }
  # endif /* !NO_HZ */

 +#ifdef CONFIG_NO_HZ_FULL
 +int tick_nohz_full_cpu(int cpu);
 +#else
 +static inline int tick_nohz_full_cpu(int cpu) { return 0; }
 +#endif
 +
 +
  # ifdef CONFIG_CPU_IDLE_GOV_MENU
  extern void menu_hrtimer_cancel(void);
  # else
 diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
 index 8601f0d..dc6381d 100644
 --- a/kernel/time/Kconfig
 +++ b/kernel/time/Kconfig
 @@ -70,6 +70,15 @@ config NO_HZ
 only trigger on an as-needed basis both when the system is
 busy and when the system is idle.

 +config NO_HZ_FULL
 +   bool Full tickless system
 +   depends on NO_HZ  RCU_USER_QS  VIRT_CPU_ACCOUNTING_GEN  
 RCU_NOCB_CPU  SMP

 Does that mean for archs like PPC64, which HAVE_VIRT_CPU_ACCOUNTING, to
 get NO_HZ_FULL supported, we need to use VIRT_CPU_ACCOUTING_GEN instead
 of VIRT_CPU_ACCOUNTING_NATIVE? ( I think the two, *_NATIVE and *_GEN,
 shouldn't be both enabled at the same time? )

Indeed! This sounds  silly in the first place but _GEN does a context
tracking that _NATIVE doesn't perform. And this context tracking must
also be well ordered and serialized against the cputime snapshots.
This is important when we remotely fix up the time from the read side.
ie: if we read the cputime of a task that runs tickless for some time,
we need to know where it runs (user or kernel) then pick either
tsk-utime or tsk-stime as a result and add to it the delta of time
it has been running tickless.

This fixup is performed in task_cputime() using seqlock() for
ordering/serializing. And the write side use seqlocks too from vtime
accounting APIs. But this is not handled by _NATIVE.


 When I tried it on a ppc64 machine, it seems that after I select
 VIRT_CPU_ACCOUNTING, VIRT_CPU_ACCOUNTING_NATIVE is automatically
 selected. And I have no way to enable VIRT_CPU_ACCOUTING_GEN, or disable
 VIRT_CPU_ACCOUNTING_NATIVE. It seems that's because these two don't have
 a configuration name (input prompt).

Yeah I need to fix that. The user should be able to choose between
VIRT_CPU_ACCOUTING_GEN and VIRT_CPU_ACCOUNTING_NATIVE.

I'll fix that for the next release.


 +   select CONTEXT_TRACKING_FORCE
 +   help
 + Try to be tickless everywhere, not just in idle. (You need
 +  to fill up the full_nohz_mask boot parameter).

 Maybe it is better to use the name of the boot parameter full_nohz here
 than the name of the mask variable used in the code?


Right!

Thanks for your reviews!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/27] cputime: Safely read cputime of full dynticks CPUs

2013-01-04 Thread Frederic Weisbecker
2012/12/31 Li Zhong zh...@linux.vnet.ibm.com:
 On Sat, 2012-12-29 at 17:42 +0100, Frederic Weisbecker wrote:
  static inline void vtime_task_switch(struct task_struct *prev) { }
  static inline void vtime_account_system(struct task_struct *tsk) { }
  static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { }
  static inline void vtime_account_user(struct task_struct *tsk) { }
 -static inline void vtime_account(struct task_struct *tsk) { }
 +static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
  static inline bool vtime_accounting(void) { return false; }
  #endif

  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 -static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 +extern void arch_vtime_task_switch(struct task_struct *tsk);
 +extern void vtime_account_irq_exit(struct task_struct *tsk);
 +extern void vtime_user_enter(struct task_struct *tsk);
 +extern bool vtime_accounting(void);
 +#else
 +static inline void vtime_account_irq_exit(struct task_struct *tsk)
 +{
 + /* On hard|softirq exit we always account to hard|softirq cputime */
 + vtime_account_system(tsk);
 +}
 +static inline void vtime_enter_user(struct task_struct *tsk) { }

 I guess the function name above should be vtime_user_enter to match
 the above extern, and the usage in user_enter()?

Totally! Thanks for pointing this out.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] 3.7-nohz1

2013-01-04 Thread Frederic Weisbecker
2012/12/30 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Mon, Dec 24, 2012 at 12:43:25AM +0100, Frederic Weisbecker wrote:
 2012/12/21 Steven Rostedt rost...@goodmis.org:
  On Thu, 2012-12-20 at 19:32 +0100, Frederic Weisbecker wrote:
  Let's imagine you have 4 CPUs. We keep the CPU 0 to offline RCU callbacks 
  there and to
  handle the timekeeping. We set the rest as full dynticks. So you need the 
  following kernel
  parameters:
 
rcu_nocbs=1-3 full_nohz=1-3
 
  (Note rcu_nocbs value must always be the same as full_nohz).
 
  Why? You can't have: rcu_nocbs=1-4 full_nohz=1-3

 That should be allowed.

or: rcu_nocbs=1-3 full_nohz=1-4 ?

 But that not.

 You need to have: rcu_nocbs  full_nohz == full_nohz. This is because
 the tick is not there to maintain the local RCU callbacks anymore. So
 this must be offloaded to the rcu_nocb threads.

 I just have a doubt with rcu_nocb. Do we still need the tick to
 complete the grace period for local rcu callbacks? I need to discuss
 that with Paul.

 The tick is only needed if rcu_needs_cpu() returns false.  Of course,
 this means that if you don't invoke rcu_needs_cpu() before returning to
 adaptive-idle usermode execution, you are correct that a full_nohz CPU
 would also have to be a rcu_nocbs CPU.

 That said, I am getting close to having an rcu_needs_cpu() that only
 returns false if there are callbacks immediately ready to invoke, at
 least if RCU_FAST_NO_HZ=y.

Ok. Also when a CPU enqueues a callback and starts a grace period, the
tick polls on the grace period completion. How is it handled with
rcu_nocbs CPUs? Does rcu_needs_cpu() return false until the grace
period is completed? If so I still need to restart the local tick
whenever a new callback is enqueued.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] sched: Move idle_balance() to post_schedule

2013-01-05 Thread Frederic Weisbecker
2012/12/22 Steven Rostedt rost...@goodmis.org:
 The idle_balance() code is called to do task load balancing just before
 going to idle. This makes sense as the CPU is about to sleep anyway.
 But currently it's called in the middle of the scheduler and in a place
 that must have interrupts disabled. That means, while the load balancing
 is going on, if a task wakes up on this CPU, it wont get to run while
 the interrupts are disabled. The idle task doing the balancing will be
 clueless about it.

 There's no real reason that the idle_balance() needs to be called in the
 middle of schedule anyway. The only benefit is that if a task is pulled
 to this CPU, it can be scheduled without the need to schedule the idle
 task. But load balancing and migrating the task makes a switch to idle
 and back negligible.

This cleanup looks nice as it does not only let us enable interrupts
there but also debloats a bit the schedule() code from idle specific
code. So it would be a pity if the optimization that goes away with
your cleanup has any measurable impact. Is there any sensible
benchmark that can be run against this patch? Something that may
involve a lot of back and forth to idle with some bunch of tasks
running around on other CPUs?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs

2013-01-05 Thread Frederic Weisbecker
Hi Paul,

2013/1/5 Paul E. McKenney paul...@linux.vnet.ibm.com:
 From: Paul Gortmaker paul.gortma...@windriver.com

 The wait_event() at the head of the rcu_nocb_kthread() can result in
 soft-lockup complaints if the CPU in question does not register RCU
 callbacks for an extended period.  This commit therefore changes
 the wait_event() to a wait_event_interruptible().

 Reported-by: Frederic Weisbecker fweis...@gmail.com
 Signed-off-by: Paul Gortmaker paul.gortma...@windriver.com
 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 ---
  kernel/rcutree_plugin.h |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

 diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
 index f6e5ec2..43dba2d 100644
 --- a/kernel/rcutree_plugin.h
 +++ b/kernel/rcutree_plugin.h
 @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg)
 for (;;) {
 /* If not polling, wait for next batch of callbacks. */
 if (!rcu_nocb_poll)
 -   wait_event(rdp-nocb_wq, rdp-nocb_head);
 +   wait_event_interruptible(rdp-nocb_wq, 
 rdp-nocb_head);
 list = ACCESS_ONCE(rdp-nocb_head);
 if (!list) {
 schedule_timeout_interruptible(1);
 +   flush_signals(current);

Why is that needed?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] context_tracking: Add comments on interface and internals

2012-12-13 Thread Frederic Weisbecker
This subsystem lacks many explanations on its purpose and
design. Add these missing comments.

Reported-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Gilad Ben-Yossef gi...@benyossef.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Li Zhong zh...@linux.vnet.ibm.com
---
 kernel/context_tracking.c |   74 ++--
 1 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index e0e07fd..f146b27 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -1,3 +1,19 @@
+/*
+ * Context tracking: Probe on high level context boundaries such as kernel
+ * and userspace. This includes syscalls and exceptions entry/exit.
+ *
+ * This is used by RCU to remove its dependency to the timer tick while a CPU
+ * runs in userspace.
+ *
+ *  Started by Frederic Weisbecker:
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com
+ *
+ * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton,
+ * Steven Rostedt, Peter Zijlstra for suggestions and improvements.
+ *
+ */
+
 #include linux/context_tracking.h
 #include linux/rcupdate.h
 #include linux/sched.h
@@ -6,8 +22,8 @@
 
 struct context_tracking {
/*
-* When active is false, hooks are not set to
-* minimize overhead: TIF flags are cleared
+* When active is false, hooks are unset in order
+* to minimize overhead: TIF flags are cleared
 * and calls to user_enter/exit are ignored. This
 * may be further optimized using static keys.
 */
@@ -24,6 +40,16 @@ static DEFINE_PER_CPU(struct context_tracking, 
context_tracking) = {
 #endif
 };
 
+/**
+ * user_enter - Inform the context tracking that the CPU is going to
+ *  enter in userspace mode.
+ *
+ * This function must be called right before we switch from the kernel
+ * to the user space, when the last remaining kernel instructions to execute
+ * are low level arch code that perform the resuming to userspace.
+ *
+ * This call supports re-entrancy.
+ */
 void user_enter(void)
 {
unsigned long flags;
@@ -39,40 +65,68 @@ void user_enter(void)
if (in_interrupt())
return;
 
+   /* Kernel thread aren't supposed to go to userspace */
WARN_ON_ONCE(!current-mm);
 
local_irq_save(flags);
if (__this_cpu_read(context_tracking.active) 
__this_cpu_read(context_tracking.state) != IN_USER) {
__this_cpu_write(context_tracking.state, IN_USER);
+   /*
+* At this stage, only low level arch entry code remains and
+* then we'll run in userspace. We can assume there won't we
+* any RCU read-side critical section until the next call to
+* user_exit() or rcu_irq_enter(). Let's remove RCU's dependency
+* on the tick.
+*/
rcu_user_enter();
}
local_irq_restore(flags);
 }
 
+
+/**
+ * user_exit - Inform the context tracking that the CPU is
+ * exiting userspace mode and entering the kernel.
+ *
+ * This function must be called right before we run any high level kernel
+ * code (ie: anything that is not low level arch entry code) after we entered
+ * the kernel from userspace.
+ *
+ * This call supports re-entrancy. This way it can be called from any exception
+ * handler without bothering to know if we come from userspace or not.
+ */
 void user_exit(void)
 {
unsigned long flags;
 
-   /*
-* Some contexts may involve an exception occuring in an irq,
-* leading to that nesting:
-* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
-* This would mess up the dyntick_nesting count though. And rcu_irq_*()
-* helpers are enough to protect RCU uses inside the exception. So
-* just return immediately if we detect we are in an IRQ.
-*/
if (in_interrupt())
return;
 
local_irq_save(flags);
if (__this_cpu_read(context_tracking.state) == IN_USER) {
__this_cpu_write(context_tracking.state, IN_KERNEL);
+   /*
+* We are going to run code that may use RCU. Inform
+* RCU core about that (ie: we may need the tick again).
+*/
rcu_user_exit();
}
local_irq_restore(flags);
 }
 
+
+/**
+ * context_tracking_task_switch - context switch the syscall hooks
+ *
+ * The context tracking uses the syscall slow path to implement its user-kernel
+ * boundaries hooks on syscalls. This way it doesn't impact the syscall fast
+ * path on CPUs

Re: [PATCH] context_tracking: Add comments on interface and internals

2012-12-13 Thread Frederic Weisbecker
2012/12/13 Andrew Morton a...@linux-foundation.org:
 On Thu, 13 Dec 2012 21:57:05 +0100
 Frederic Weisbecker fweis...@gmail.com wrote:

 This subsystem lacks many explanations on its purpose and
 design. Add these missing comments.

 Thanks, it helps.

 --- a/kernel/context_tracking.c
 +++ b/kernel/context_tracking.c
 @@ -1,3 +1,19 @@
 +/*
 + * Context tracking: Probe on high level context boundaries such as kernel
 + * and userspace. This includes syscalls and exceptions entry/exit.
 + *
 + * This is used by RCU to remove its dependency to the timer tick while a 
 CPU
 + * runs in userspace.

 on the timer tick

Oops, will fix, along with the other spelling issues you reported.


 + *
 + *  Started by Frederic Weisbecker:
 + *
 + * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker 
 fweis...@redhat.com
 + *
 + * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew 
 Morton,
 + * Steven Rostedt, Peter Zijlstra for suggestions and improvements.
 + *
 + */
 +
  #include linux/context_tracking.h
  #include linux/rcupdate.h
  #include linux/sched.h
 @@ -6,8 +22,8 @@

  struct context_tracking {
   /*
 -  * When active is false, hooks are not set to
 -  * minimize overhead: TIF flags are cleared
 +  * When active is false, hooks are unset in order
 +  * to minimize overhead: TIF flags are cleared
* and calls to user_enter/exit are ignored. This
* may be further optimized using static keys.
*/
 @@ -24,6 +40,16 @@ static DEFINE_PER_CPU(struct context_tracking, 
 context_tracking) = {
  #endif
  };

 +/**
 + * user_enter - Inform the context tracking that the CPU is going to
 + *  enter in userspace mode.

 s/in //

 + *
 + * This function must be called right before we switch from the kernel
 + * to the user space, when the last remaining kernel instructions to execute

 s/the user space/userspace/

 + * are low level arch code that perform the resuming to userspace.

 This is a bit vague - what is right before?  What happens if this is
 done a few instructions early?  I mean, what exactly is the requirement
 here?  Might it be something like after the last rcu_foo operation?

 IOW, if the call to user_enter() were moved earlier and earlier, at
 what point would the kernel gain a bug?  What caused that bug?

That's indeed too vague. So as long as RCU is the only user of this,
the only rule is: call user_enter() when you're about to resume in
userspace and you're sure there will be no use of RCU until we return
to the kernel. Here the precision on when to call that wrt. kernel -
user transition step doesn't matter much. This is only about RCU usage
correctness.

Now this context tracking will soon be used by the cputime subsystem
in order to implement a generic tickless cputime accounting. The
precision induced by the probe location in kernel/user transition will
have an effect on cputime accounting precision. But even there this
shouldn't matter much because this will have a per-jiffies
granularity. This may evolve in the future with a nanosec granularity
but then it will be up to archs to place the probes closer to the real
kernel/user boundaries.

Anyway, I'll comment on the RCU requirement for now and extend the
comments to explain the cputime precision issue when I'll add the
cputime bits.


 + * This call supports re-entrancy.

 Presumably the explanation for user_exit() applies here.

Not sure what you mean here.


 + */
  void user_enter(void)
  {
   unsigned long flags;
 @@ -39,40 +65,68 @@ void user_enter(void)
   if (in_interrupt())
   return;

 + /* Kernel thread aren't supposed to go to userspace */

 s/thread/threads/

   WARN_ON_ONCE(!current-mm);

   local_irq_save(flags);
   if (__this_cpu_read(context_tracking.active) 
   __this_cpu_read(context_tracking.state) != IN_USER) {
   __this_cpu_write(context_tracking.state, IN_USER);
 + /*
 +  * At this stage, only low level arch entry code remains and
 +  * then we'll run in userspace. We can assume there won't we

 s/we/be/

 +  * any RCU read-side critical section until the next call to
 +  * user_exit() or rcu_irq_enter(). Let's remove RCU's 
 dependency
 +  * on the tick.
 +  */
   rcu_user_enter();
   }
   local_irq_restore(flags);
  }

 +
 +/**
 + * user_exit - Inform the context tracking that the CPU is
 + * exiting userspace mode and entering the kernel.
 + *
 + * This function must be called right before we run any high level kernel
 + * code (ie: anything that is not low level arch entry code) after we 
 entered
 + * the kernel from userspace.

 Also a very vague spec.

You're right, as for user_enter(), I'll insist on the RCU and cputime
requirements.

[...]
 +/**
 + * context_tracking_task_switch - context switch the syscall hooks
 + *
 + * The context tracking uses the syscall slow path

Re: [PATCH] context_tracking: Add comments on interface and internals

2012-12-13 Thread Frederic Weisbecker
2012/12/14 Andrew Morton a...@linux-foundation.org:
 On Thu, 13 Dec 2012 23:50:23 +0100
 Frederic Weisbecker fweis...@gmail.com wrote:

 
  + * This call supports re-entrancy.
 
  Presumably the explanation for user_exit() applies here.

 Not sure what you mean here.

 It's unclear what it means to say user_enter() supports reentrancy.
 I mean, zillions of kernel functions are surely reentrant - so what?
 It appears that you had something in mind when pointing this out, but
 what was it?  The comment over user_exit() appears to tell us.

Ah ok. Yeah indeed, the fact user_exit() is reentrant is very
important because I have precise usecases in mind. For user_enter() I
don't, so probably I don't need to inform about it.


  It's mainly this bit which makes me wonder why the code is in lib/.  Is
  there any conceivable prospect that any other subsystem will use this
  code for anything?

 So that's because of that cputime accounting on dynticks CPUs which
 will need to know about user/kernel transitions. I'm preparing that
 for the 3.9 merge window.

 Oh.  That's really the entire reason for the patch and should have been
 in the changelog!

I mentioned it in the changelog:

commit 91d1aa43d30505b0b825db8898ffc80a8eca96c7 context_tracking: New
context tracking susbsystem

We need to pull this up from RCU into this new level of indirection
because this tracking is also going to be used to implement an on
demand generic virtual cputime accounting. A necessary step to
shutdown the tick while still accounting the cputime.


Another reason, more implicit this time, was to avoid that RCU handles
those reentrancy things and context tracking all around by itself.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] context_tracking: Add comments on interface and internals

2012-12-14 Thread Frederic Weisbecker
This subsystem lacks many explanations on its purpose and
design. Add these missing comments.

v2: Address comments from Andrew

Reported-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Gilad Ben-Yossef gi...@benyossef.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Li Zhong zh...@linux.vnet.ibm.com
---
 kernel/context_tracking.c |   73 ++--
 1 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index e0e07fd..9f6c38f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -1,3 +1,19 @@
+/*
+ * Context tracking: Probe on high level context boundaries such as kernel
+ * and userspace. This includes syscalls and exceptions entry/exit.
+ *
+ * This is used by RCU to remove its dependency on the timer tick while a CPU
+ * runs in userspace.
+ *
+ *  Started by Frederic Weisbecker:
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com
+ *
+ * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton,
+ * Steven Rostedt, Peter Zijlstra for suggestions and improvements.
+ *
+ */
+
 #include linux/context_tracking.h
 #include linux/rcupdate.h
 #include linux/sched.h
@@ -6,8 +22,8 @@
 
 struct context_tracking {
/*
-* When active is false, hooks are not set to
-* minimize overhead: TIF flags are cleared
+* When active is false, hooks are unset in order
+* to minimize overhead: TIF flags are cleared
 * and calls to user_enter/exit are ignored. This
 * may be further optimized using static keys.
 */
@@ -24,6 +40,15 @@ static DEFINE_PER_CPU(struct context_tracking, 
context_tracking) = {
 #endif
 };
 
+/**
+ * user_enter - Inform the context tracking that the CPU is going to
+ *  enter userspace mode.
+ *
+ * This function must be called right before we switch from the kernel
+ * to userspace, when it's guaranteed the remaining kernel instructions
+ * to execute won't use any RCU read side critical section because this
+ * function sets RCU in extended quiescent state.
+ */
 void user_enter(void)
 {
unsigned long flags;
@@ -39,40 +64,68 @@ void user_enter(void)
if (in_interrupt())
return;
 
+   /* Kernel threads aren't supposed to go to userspace */
WARN_ON_ONCE(!current-mm);
 
local_irq_save(flags);
if (__this_cpu_read(context_tracking.active) 
__this_cpu_read(context_tracking.state) != IN_USER) {
__this_cpu_write(context_tracking.state, IN_USER);
+   /*
+* At this stage, only low level arch entry code remains and
+* then we'll run in userspace. We can assume there won't be
+* any RCU read-side critical section until the next call to
+* user_exit() or rcu_irq_enter(). Let's remove RCU's dependency
+* on the tick.
+*/
rcu_user_enter();
}
local_irq_restore(flags);
 }
 
+
+/**
+ * user_exit - Inform the context tracking that the CPU is
+ * exiting userspace mode and entering the kernel.
+ *
+ * This function must be called after we entered the kernel from userspace
+ * before any use of RCU read side critical section. This potentially include
+ * any high level kernel code like syscalls, exceptions, signal handling, 
etc...
+ *
+ * This call supports re-entrancy. This way it can be called from any exception
+ * handler without needing to know if we came from userspace or not.
+ */
 void user_exit(void)
 {
unsigned long flags;
 
-   /*
-* Some contexts may involve an exception occuring in an irq,
-* leading to that nesting:
-* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
-* This would mess up the dyntick_nesting count though. And rcu_irq_*()
-* helpers are enough to protect RCU uses inside the exception. So
-* just return immediately if we detect we are in an IRQ.
-*/
if (in_interrupt())
return;
 
local_irq_save(flags);
if (__this_cpu_read(context_tracking.state) == IN_USER) {
__this_cpu_write(context_tracking.state, IN_KERNEL);
+   /*
+* We are going to run code that may use RCU. Inform
+* RCU core about that (ie: we may need the tick again).
+*/
rcu_user_exit();
}
local_irq_restore(flags);
 }
 
+
+/**
+ * context_tracking_task_switch - context switch the syscall hooks
+ *
+ * The context tracking uses the syscall slow path to implement its user-kernel

[RFC GIT PULL] printk: Full dynticks support for 3.8

2012-12-17 Thread Frederic Weisbecker
Linus,

We are currently working on extending the dynticks mode to broader contexts 
than just idle.
Under some conditions on a busy CPU, the tick can be avoided (no need of 
preemption for one
task running, no need of RCU state machine maintainance in userspace, etc...).

The most popular application of this is the implementation of CPU isolation. On 
HPC
workloads, where people run one task per-CPU in order to maximize the CPU 
performances,
the kernel sets itself too much on the way with these often unnecessary 
interrupts.

The result is a performance loss due to stolen CPU time and cache trashing of
the userspace workset.

Now CPU isolation is the most famous user. I expect more. For example we should 
be able
to avoid the tick when we run in guest mode. And more generally this may be a 
win
for most CPU-bound workloads.

So in order to implement this full dynticks mode, we need to find alternatives 
to
handle the many maintainance operations performed periodically and turn them to
more one-shot event driven solutions.

printk() is part of the problem. It must be safely callable from most places 
and for
that purpose it performs an asynchronous wake up of the readers by probing on 
the tick for
pending messages and readers through printk_tick().

Of course if we use printk while the tick is stopped, the pending readers may 
not be woken
up for a while. So a solution to make printk() working even if the CPU is in 
dynticks mode
is to use the irq_work subsystem. This subsystem is typically able to fire 
self-IPIs.
So when printk() is called, it now enqueues an irq_work that does the 
asynchronous wakeup:

* If the tick is stopped, it raises a self-IPI
* If the tick is running periodically then don't fire a self-IPI but wait for 
the next tick
to handle that instead (irq work probes on the timer tick). This avoids 
self-IPIs storm in
case of frequent printk() in short periods of time.

I know this is a sensitive area. We want printk() to stay minimal and not rely 
too much
on other subsystems that add complications and that may use printk themselves.
That's why we chose irq_work because:

- It's pretty small and self-contained
- It's lockless
- It handles most recursivity cases (if it uses printk() itself from the IPI 
path, this won't
fire another IPI)

But because it's sensitive, I'm proposing it as an RFC pull request.

So if you're ok with that, please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
tags/printk-dynticks-for-linus

HEAD: 74876a98a87a115254b3a66a14b27320b7f0acaa printk: Wake up klogd using 
irq_work

It has been in linux-next.

Thanks.

 
Support for printk in dynticks mode:

* Fix two races in irq work claiming

* Generalize irq_work support to all archs

* Don't stop tick with irq works pending. This
fix is generally useful and concerns archs that
can't raise self IPIs.

* Flush irq works before CPU offlining.

* Introduce lazy irq works that can wait for the
next tick to be executed, unless it's stopped.

* Implement klogd wake up using irq work. This
removes the ad-hoc printk_tick()/printk_needs_cpu()
hooks and make it working even in dynticks mode.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com


Frederic Weisbecker (7):
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
  irq_work: Fix racy check on work pending flag
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  nohz: Add API to check tick state
  irq_work: Don't stop the tick with pending works
  irq_work: Make self-IPIs optable
  printk: Wake up klogd using irq_work

Steven Rostedt (2):
  irq_work: Flush work on CPU_DYING
  irq_work: Warn if there's still work on cpu_down

 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 include/linux/irq_work.h|   20 +
 include/linux/printk.h  |3 -
 include/linux/tick.h|   17 -
 init/Kconfig|5 +-
 kernel/irq_work.c   |  131 ++
 kernel/printk.c |   36 +
 kernel/time/tick-sched.c|7 +-
 kernel/timer.c  |1 -
 22 files changed, 161 insertions(+), 73 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel

[PATCH 1/4] vtime: Remove the underscore prefix invasion

2012-11-14 Thread Frederic Weisbecker
Prepending irq-unsafe vtime APIs with underscores was actually
a bad idea as the result is a big mess in the API namespace that
is even waiting to be further extended. Also these helpers
are always called from irq safe callers except kvm. Just
provide a vtime_account_system_irqsafe() for this specific
case so that we can remove the underscore prefix on other
vtime functions.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
---
 arch/ia64/kernel/time.c|8 
 arch/powerpc/kernel/time.c |4 ++--
 arch/s390/kernel/vtime.c   |4 ++--
 include/linux/kvm_host.h   |4 ++--
 include/linux/vtime.h  |8 
 kernel/sched/cputime.c |   12 ++--
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 5e48503..f638821 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -106,9 +106,9 @@ void vtime_task_switch(struct task_struct *prev)
struct thread_info *ni = task_thread_info(current);
 
if (idle_task(smp_processor_id()) != prev)
-   __vtime_account_system(prev);
+   vtime_account_system(prev);
else
-   __vtime_account_idle(prev);
+   vtime_account_idle(prev);
 
vtime_account_user(prev);
 
@@ -135,14 +135,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
return delta_stime;
 }
 
-void __vtime_account_system(struct task_struct *tsk)
+void vtime_account_system(struct task_struct *tsk)
 {
cputime_t delta = vtime_delta(tsk);
 
account_system_time(tsk, 0, delta, delta);
 }
 
-void __vtime_account_idle(struct task_struct *tsk)
+void vtime_account_idle(struct task_struct *tsk)
 {
account_idle_time(vtime_delta(tsk));
 }
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 0db456f..ce4cb77 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -336,7 +336,7 @@ static u64 vtime_delta(struct task_struct *tsk,
return delta;
 }
 
-void __vtime_account_system(struct task_struct *tsk)
+void vtime_account_system(struct task_struct *tsk)
 {
u64 delta, sys_scaled, stolen;
 
@@ -346,7 +346,7 @@ void __vtime_account_system(struct task_struct *tsk)
account_steal_time(stolen);
 }
 
-void __vtime_account_idle(struct task_struct *tsk)
+void vtime_account_idle(struct task_struct *tsk)
 {
u64 delta, sys_scaled, stolen;
 
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 783e988..80d1dbc 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -140,9 +140,9 @@ void vtime_account(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account);
 
-void __vtime_account_system(struct task_struct *tsk)
+void vtime_account_system(struct task_struct *tsk)
 __attribute__((alias(vtime_account)));
-EXPORT_SYMBOL_GPL(__vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_system);
 
 void __kprobes vtime_stop_cpu(void)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e2212f..f17158b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -741,7 +741,7 @@ static inline void kvm_guest_enter(void)
 * This is running in ioctl context so we can avoid
 * the call to vtime_account() with its unnecessary idle check.
 */
-   vtime_account_system(current);
+   vtime_account_system_irqsafe(current);
current-flags |= PF_VCPU;
/* KVM does not hold any references to rcu protected data when it
 * switches CPU into a guest mode. In fact switching to a guest mode
@@ -759,7 +759,7 @@ static inline void kvm_guest_exit(void)
 * This is running in ioctl context so we can avoid
 * the call to vtime_account() with its unnecessary idle check.
 */
-   vtime_account_system(current);
+   vtime_account_system_irqsafe(current);
current-flags = ~PF_VCPU;
 }
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 0c2a2d3..5ad13c3 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -5,14 +5,14 @@ struct task_struct;
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
 extern void vtime_task_switch(struct task_struct *prev);
-extern void __vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_system(struct task_struct *tsk);
-extern void __vtime_account_idle(struct task_struct *tsk);
+extern void vtime_account_system_irqsafe(struct task_struct *tsk);
+extern void vtime_account_idle(struct task_struct *tsk);
 extern

[PATCH 2/4] vtime: Explicitly account pending user time on process tick

2012-11-14 Thread Frederic Weisbecker
All vtime implementations just flush the user time on process
tick. Consolidate that in generic code by calling a user time
accounting helper. This avoids an indirect call in ia64 and
prepare to also consolidate vtime context switch code.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
---
 arch/ia64/kernel/time.c |   11 +--
 arch/powerpc/kernel/time.c  |   14 +++---
 arch/s390/kernel/vtime.c|7 ++-
 include/linux/kernel_stat.h |8 
 include/linux/vtime.h   |1 +
 5 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index f638821..834c78b 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -83,7 +83,7 @@ static struct clocksource *itc_clocksource;
 
 extern cputime_t cycle_to_cputime(u64 cyc);
 
-static void vtime_account_user(struct task_struct *tsk)
+void vtime_account_user(struct task_struct *tsk)
 {
cputime_t delta_utime;
struct thread_info *ti = task_thread_info(tsk);
@@ -147,15 +147,6 @@ void vtime_account_idle(struct task_struct *tsk)
account_idle_time(vtime_delta(tsk));
 }
 
-/*
- * Called from the timer interrupt handler to charge accumulated user time
- * to the current process.  Must be called with interrupts disabled.
- */
-void account_process_tick(struct task_struct *p, int user_tick)
-{
-   vtime_account_user(p);
-}
-
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
 static irqreturn_t
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index ce4cb77..a667aaf 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -355,15 +355,15 @@ void vtime_account_idle(struct task_struct *tsk)
 }
 
 /*
- * Transfer the user and system times accumulated in the paca
- * by the exception entry and exit code to the generic process
- * user and system time records.
+ * Transfer the user time accumulated in the paca
+ * by the exception entry and exit code to the generic
+ * process user time records.
  * Must be called with interrupts disabled.
- * Assumes that vtime_account() has been called recently
- * (i.e. since the last entry from usermode) so that
+ * Assumes that vtime_account_system/idle() has been called
+ * recently (i.e. since the last entry from usermode) so that
  * get_paca()-user_time_scaled is up to date.
  */
-void account_process_tick(struct task_struct *tsk, int user_tick)
+void vtime_account_user(struct task_struct *tsk)
 {
cputime_t utime, utimescaled;
 
@@ -378,7 +378,7 @@ void account_process_tick(struct task_struct *tsk, int 
user_tick)
 void vtime_task_switch(struct task_struct *prev)
 {
vtime_account(prev);
-   account_process_tick(prev, 0);
+   vtime_account_user(prev);
 }
 
 #else /* ! CONFIG_VIRT_CPU_ACCOUNTING */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 80d1dbc..7c6d861 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -112,7 +112,12 @@ void vtime_task_switch(struct task_struct *prev)
S390_lowcore.system_timer = ti-system_timer;
 }
 
-void account_process_tick(struct task_struct *tsk, int user_tick)
+/*
+ * In s390, accounting pending user time also implies
+ * accounting system time in order to correctly compute
+ * the stolen time accounting.
+ */
+void vtime_account_user(struct task_struct *tsk)
 {
if (do_account_vtime(tsk, HARDIRQ_OFFSET))
virt_timer_expire();
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..66b7078 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -127,7 +127,15 @@ extern void account_system_time(struct task_struct *, int, 
cputime_t, cputime_t)
 extern void account_steal_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+static inline void account_process_tick(struct task_struct *tsk, int user)
+{
+   vtime_account_user(tsk);
+}
+#else
 extern void account_process_tick(struct task_struct *, int user);
+#endif
+
 extern void account_steal_ticks(unsigned long ticks);
 extern void account_idle_ticks(unsigned long ticks);
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 5ad13c3..ae30ab5 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -8,6 +8,7 @@ extern void vtime_task_switch(struct task_struct *prev);
 extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_system_irqsafe(struct task_struct *tsk);
 extern void vtime_account_idle(struct

[PATCH 4/4] vtime: No need to disable irqs on vtime_account()

2012-11-14 Thread Frederic Weisbecker
vtime_account() is only called from irq entry. irqs
are always disabled at this point so we can safely
remove the irq disabling guards on that function.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
---
 kernel/sched/cputime.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2e8d34a..80b2fd5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -467,16 +467,10 @@ void vtime_task_switch(struct task_struct *prev)
 #ifndef __ARCH_HAS_VTIME_ACCOUNT
 void vtime_account(struct task_struct *tsk)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-
if (in_interrupt() || !is_idle_task(tsk))
vtime_account_system(tsk);
else
vtime_account_idle(tsk);
-
-   local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(vtime_account);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] vtime: Consolidate a bit the ctx switch code

2012-11-14 Thread Frederic Weisbecker
On ia64 and powerpc, vtime context switch only consists
in flushing system and user pending time, plus a few
arch housekeeping.

Consolidate that into a generic implementation. s390 is
a special case because pending user and system time accounting
there is hard to dissociate. So it's keeping its own implementation.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Martin Schwidefsky schwidef...@de.ibm.com
Cc: Heiko Carstens heiko.carst...@de.ibm.com
---
 arch/ia64/include/asm/cputime.h|2 ++
 arch/ia64/kernel/time.c|9 +
 arch/powerpc/include/asm/cputime.h |2 ++
 arch/powerpc/kernel/time.c |6 --
 arch/s390/include/asm/cputime.h|1 +
 kernel/sched/cputime.c |   13 +
 6 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h
index 3deac95..7fcf7f0 100644
--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -103,5 +103,7 @@ static inline void cputime_to_timeval(const cputime_t ct, 
struct timeval *val)
 #define cputime64_to_clock_t(__ct) \
cputime_to_clock_t((__force cputime_t)__ct)
 
+extern void arch_vtime_task_switch(struct task_struct *tsk);
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 #endif /* __IA64_CPUTIME_H */
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 834c78b..c9a7d2e 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -100,18 +100,11 @@ void vtime_account_user(struct task_struct *tsk)
  * accumulated times to the current process, and to prepare accounting on
  * the next process.
  */
-void vtime_task_switch(struct task_struct *prev)
+void arch_vtime_task_switch(struct task_struct *prev)
 {
struct thread_info *pi = task_thread_info(prev);
struct thread_info *ni = task_thread_info(current);
 
-   if (idle_task(smp_processor_id()) != prev)
-   vtime_account_system(prev);
-   else
-   vtime_account_idle(prev);
-
-   vtime_account_user(prev);
-
pi-ac_stamp = ni-ac_stamp;
ni-ac_stime = ni-ac_utime = 0;
 }
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 487d46f..483733b 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -228,6 +228,8 @@ static inline cputime_t clock_t_to_cputime(const unsigned 
long clk)
 
 #define cputime64_to_clock_t(ct)   cputime_to_clock_t((cputime_t)(ct))
 
+static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
+
 #endif /* __KERNEL__ */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index a667aaf..3486cfa 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -375,12 +375,6 @@ void vtime_account_user(struct task_struct *tsk)
account_user_time(tsk, utime, utimescaled);
 }
 
-void vtime_task_switch(struct task_struct *prev)
-{
-   vtime_account(prev);
-   vtime_account_user(prev);
-}
-
 #else /* ! CONFIG_VIRT_CPU_ACCOUNTING */
 #define calc_cputime_factors()
 #endif
diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index 023d5ae..d2ff4137 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -14,6 +14,7 @@
 
 
 #define __ARCH_HAS_VTIME_ACCOUNT
+#define __ARCH_HAS_VTIME_TASK_SWITCH
 
 /* We want to use full resolution of the CPU timer: 2**-12 micro-seconds. */
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index c0aa1ba..2e8d34a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -443,6 +443,19 @@ void vtime_account_system_irqsafe(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account_system_irqsafe);
 
+#ifndef __ARCH_HAS_VTIME_TASK_SWITCH
+void vtime_task_switch(struct task_struct *prev)
+{
+   if (is_idle_task(prev))
+   vtime_account_idle(prev);
+   else
+   vtime_account_system(prev);
+
+   vtime_account_user(prev);
+   arch_vtime_task_switch(prev);
+}
+#endif
+
 /*
  * Archs that account the whole time spent in the idle task
  * (outside irq) as idle time can rely on this and just implement
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] cputime: Even more cleanups

2012-11-14 Thread Frederic Weisbecker
Hi,

While working on full dynticks, I realized some more cleanups needed to be
done. Here is it. If no comment arise, I'll send a pull request to Ingo
in a week.

Thanks.

Frederic Weisbecker (4):
  vtime: Remove the underscore prefix invasion
  vtime: Explicitly account pending user time on process tick
  vtime: Consolidate a bit the ctx switch code
  vtime: No need to disable irqs on vtime_account()

 arch/ia64/include/asm/cputime.h|2 ++
 arch/ia64/kernel/time.c|   24 
 arch/powerpc/include/asm/cputime.h |2 ++
 arch/powerpc/kernel/time.c |   22 --
 arch/s390/include/asm/cputime.h|1 +
 arch/s390/kernel/vtime.c   |   11 ---
 include/linux/kernel_stat.h|8 
 include/linux/kvm_host.h   |4 ++--
 include/linux/vtime.h  |9 +
 kernel/sched/cputime.c |   31 +++
 10 files changed, 59 insertions(+), 55 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] vtime: Remove the underscore prefix invasion

2012-11-14 Thread Frederic Weisbecker
2012/11/14 Steven Rostedt rost...@goodmis.org:
 On Wed, 2012-11-14 at 17:26 +0100, Frederic Weisbecker wrote:
 Prepending irq-unsafe vtime APIs with underscores was actually
 a bad idea as the result is a big mess in the API namespace that
 is even waiting to be further extended. Also these helpers
 are always called from irq safe callers except kvm. Just
 provide a vtime_account_system_irqsafe() for this specific
 case so that we can remove the underscore prefix on other
 vtime functions.



 -void __vtime_account_system(struct task_struct *tsk)
 +void vtime_account_system(struct task_struct *tsk)
  {
   cputime_t delta = vtime_delta(tsk);

 Should we add a WARN_ON(!irqs_disabled()) check here?

Why not, I'll add one in vtime_delta().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] vtime: No need to disable irqs on vtime_account()

2012-11-14 Thread Frederic Weisbecker
2012/11/14 Steven Rostedt rost...@goodmis.org:
 On Wed, 2012-11-14 at 17:26 +0100, Frederic Weisbecker wrote:
 vtime_account() is only called from irq entry. irqs
 are always disabled at this point so we can safely
 remove the irq disabling guards on that function.

 Signed-off-by: Frederic Weisbecker fweis...@gmail.com
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Steven Rostedt rost...@goodmis.org
 Cc: Paul Gortmaker paul.gortma...@windriver.com
 Cc: Tony Luck tony.l...@intel.com
 Cc: Fenghua Yu fenghua...@intel.com
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: Paul Mackerras pau...@samba.org
 Cc: Martin Schwidefsky schwidef...@de.ibm.com
 Cc: Heiko Carstens heiko.carst...@de.ibm.com
 ---
  kernel/sched/cputime.c |6 --
  1 files changed, 0 insertions(+), 6 deletions(-)

 diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
 index 2e8d34a..80b2fd5 100644
 --- a/kernel/sched/cputime.c
 +++ b/kernel/sched/cputime.c
 @@ -467,16 +467,10 @@ void vtime_task_switch(struct task_struct *prev)
  #ifndef __ARCH_HAS_VTIME_ACCOUNT
  void vtime_account(struct task_struct *tsk)
  {
 - unsigned long flags;
 -
 - local_irq_save(flags);
 -

 I'd add a WARN_ON_ONCE(!irqs_disabled()) again here, or is this also
 covered by the vtime_delta()?

Yeah it's the ending point for both vtime_account_system() and
vtime_account_idle()
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] printk: Make it usable on nohz cpus

2012-11-14 Thread Frederic Weisbecker
Ingo,

Please pull the printk support in dynticks mode patches that can
be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git 
tags/printk-dynticks-for-mingo

This branch is based on top of v3.7-rc4
Head is 2fb933986dcef2db1344712162a1feb8d5736ff8:

  printk: Wake up klogd using irq_work (2012-11-14 17:45:37 +0100)

Since last version, very few things have changed:

* Added acks from Steve on patches 1/7 and 2/7
* Fixed an arch_needs_cpu() redefinition due to misordered headers (reported
by Wu Fengguang).

If you get further acks from Peterz or anybody, feel free to cherry pick
the patches instead. Or I can rebase my patches to add them, either way.

Thanks.


Support for printk in dynticks mode:

* Fix two races in irq work claiming

* Generalize irq_work support to all archs

* Don't stop tick with irq works pending. This
fix is generally useful and concerns archs that
can't raise self IPIs.

* Introduce lazy irq works that can wait for the
next tick to be executed, unless it's stopped.

* Implement klogd wake up using irq work. This
removes the ad-hoc printk_tick()/printk_needs_cpu()
hooks and make it working even in dynticks mode.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com


Frederic Weisbecker (7):
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
  irq_work: Fix racy check on work pending flag
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  nohz: Add API to check tick state
  irq_work: Don't stop the tick with pending works
  irq_work: Make self-IPIs optable
  printk: Wake up klogd using irq_work

 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 include/linux/irq_work.h|   20 +
 include/linux/printk.h  |3 --
 include/linux/tick.h|   17 +++-
 init/Kconfig|5 +--
 kernel/irq_work.c   |   76 +++
 kernel/printk.c |   36 +
 kernel/time/tick-sched.c|7 ++--
 kernel/timer.c  |1 -
 22 files changed, 112 insertions(+), 67 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012-11-14 Thread Frederic Weisbecker
The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.

So if we have specific data to compute in our work, we ensure it's
either handled by another CPU or locally by enqueuing the work again.
This state machine is guanranteed by atomic operations on the flags.

So when we set IRQ_WORK_BUSY without using an xchg-like operation,
we break this guarantee as in the following summarized scenario:

CPU 1   CPU 2
-   -
(flags = 0)
old_flags = flags;
(flags = 0)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(flags = 3)
[...]
flags = IRQ_WORK_BUSY
(flags = 2)
func()
(sees flags = 3)
cmpxchg(flags, old_flags,
old_flags | 
IRQ_WORK_FLAGS)
(give up)

cmpxchg(flags, 2, 0);
(flags = 0)

CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and
the work is again in a claimable state. Now CPU 2 has new data to process
and try to claim that work but it may see a stale value of the flags
and think the work is still pending somewhere that will handle our data.
This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically.

As a result, the data expected to be handle by CPU 2 won't get handled.

To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2
will see the correct value with cmpxchg() using the expected ordering.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 1588e3b..57be1a6 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -119,8 +119,11 @@ void irq_work_run(void)
/*
 * Clear the PENDING bit, after this point the @work
 * can be re-used.
+* Make it immediately visible so that other CPUs trying
+* to claim that work don't rely on us to handle their data
+* while we are in the middle of the func.
 */
-   work-flags = IRQ_WORK_BUSY;
+   xchg(work-flags, IRQ_WORK_BUSY);
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] nohz: Add API to check tick state

2012-11-14 Thread Frederic Weisbecker
We need some quick way to check if the CPU has stopped
its tick. This will be useful to implement the printk tick
using the irq work subsystem.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/tick.h |   17 -
 kernel/time/tick-sched.c |2 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..2307dd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,8 @@
 
 #include linux/clockchips.h
 #include linux/irqflags.h
+#include linux/percpu.h
+#include linux/hrtimer.h
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 
0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched);
+
+static inline int tick_nohz_tick_stopped(void)
+{
+   return __this_cpu_read(tick_cpu_sched.tick_stopped);
+}
+
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+# else /* !CONFIG_NO_HZ */
+static inline int tick_nohz_tick_stopped(void)
+{
+   return 0;
+}
+
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..9e945aa 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -28,7 +28,7 @@
 /*
  * Per cpu nohz control structure
  */
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
 
 /*
  * The time, when the last jiffy update happened. Protected by xtime_lock.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] irq_work: Don't stop the tick with pending works

2012-11-14 Thread Frederic Weisbecker
Don't stop the tick if we have pending irq works on the
queue, otherwise if the arch can't raise self-IPIs, we may not
find an opportunity to execute the pending works for a while.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |6 ++
 kernel/irq_work.c|   11 +++
 kernel/time/tick-sched.c |3 ++-
 3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 6a9e8f5..a69704f 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work);
 void irq_work_run(void);
 void irq_work_sync(struct irq_work *work);
 
+#ifdef CONFIG_IRQ_WORK
+bool irq_work_needs_cpu(void);
+#else
+static bool irq_work_needs_cpu(void) { return false; }
+#endif
+
 #endif /* _LINUX_IRQ_WORK_H */
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 64eddd5..b3c113a 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work)
 }
 EXPORT_SYMBOL_GPL(irq_work_queue);
 
+bool irq_work_needs_cpu(void)
+{
+   struct llist_head *this_list;
+
+   this_list = __get_cpu_var(irq_work_list);
+   if (llist_empty(this_list))
+   return false;
+
+   return true;
+}
+
 /*
  * Run the irq_work entries on this cpu. Requires to be ran from hardirq
  * context with local IRQs disabled.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e945aa..f249e8c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include linux/profile.h
 #include linux/sched.h
 #include linux/module.h
+#include linux/irq_work.h
 
 #include asm/irq_regs.h
 
@@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
} while (read_seqretry(xtime_lock, seq));
 
if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
-   arch_needs_cpu(cpu)) {
+   arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
} else {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] irq_work: Fix racy check on work pending flag

2012-11-14 Thread Frederic Weisbecker
Work claiming wants to be SMP-safe.

And by the time we try to claim a work, if it is already executing
concurrently on another CPU, we want to succeed the claiming and queue
the work again because the other CPU may have missed the data we wanted
to handle in our work if it's about to complete there.

This scenario is summarized below:

CPU 1   CPU 2
-   -
(flags = 0)
cmpxchg(flags, 0, IRQ_WORK_FLAGS)
(flags = 3)
[...]
xchg(flags, IRQ_WORK_BUSY)
(flags = 2)
func()
if (flags  IRQ_WORK_PENDING)
(not true)
cmpxchg(flags, flags, 
IRQ_WORK_FLAGS)
(flags = 3)
[...]
cmpxchg(flags, IRQ_WORK_BUSY, 0);
(fail, pending on CPU 2)

This state machine is synchronized using [cmp]xchg() on the flags.
As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy.
By the time we check it, we may be dealing with a stale value because
we aren't using an atomic accessor. As a result, CPU 2 may see
that the work is still pending on another CPU while it may be
actually completing the work function exection already, leaving
our data unprocessed.

To fix this, we start by speculating about the value we wish to be
in the work-flags but we only make any conclusion after the value
returned by the cmpxchg() call that either claims the work or let
the current owner handle the pending work for us.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |   16 +++-
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 57be1a6..64eddd5 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list);
  */
 static bool irq_work_claim(struct irq_work *work)
 {
-   unsigned long flags, nflags;
+   unsigned long flags, oflags, nflags;
 
+   /*
+* Start with our best wish as a premise but only trust any
+* flag value after cmpxchg() result.
+*/
+   flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
-   flags = work-flags;
-   if (flags  IRQ_WORK_PENDING)
-   return false;
nflags = flags | IRQ_WORK_FLAGS;
-   if (cmpxchg(work-flags, flags, nflags) == flags)
+   oflags = cmpxchg(work-flags, flags, nflags);
+   if (oflags == flags)
break;
+   if (oflags  IRQ_WORK_PENDING)
+   return false;
+   flags = oflags;
cpu_relax();
}
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] printk: Wake up klogd using irq_work

2012-11-14 Thread Frederic Weisbecker
klogd is woken up asynchronously from the tick in order
to do it safely.

However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
for a while. As a result, the user may miss some message.

To fix this, lets implement the printk tick using a lazy irq work.
This subsystem takes care of the timer tick state and can
fix up accordingly.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/printk.h   |3 ---
 init/Kconfig |1 +
 kernel/printk.c  |   36 
 kernel/time/tick-sched.c |2 +-
 kernel/timer.c   |1 -
 5 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 9afc01e..86c4b62 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...)
 extern asmlinkage __printf(1, 2)
 void early_printk(const char *fmt, ...);
 
-extern int printk_needs_cpu(int cpu);
-extern void printk_tick(void);
-
 #ifdef CONFIG_PRINTK
 asmlinkage __printf(5, 0)
 int vprintk_emit(int facility, int level,
diff --git a/init/Kconfig b/init/Kconfig
index cdc152c..c575566 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1196,6 +1196,7 @@ config HOTPLUG
 config PRINTK
default y
bool Enable support for printk if EXPERT
+   select IRQ_WORK
help
  This option enables normal printk support. Removing it
  eliminates most of the message strings from the kernel image
diff --git a/kernel/printk.c b/kernel/printk.c
index 2d607f4..c9104fe 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -42,6 +42,7 @@
 #include linux/notifier.h
 #include linux/rculist.h
 #include linux/poll.h
+#include linux/irq_work.h
 
 #include asm/uaccess.h
 
@@ -1955,30 +1956,32 @@ int is_console_locked(void)
 static DEFINE_PER_CPU(int, printk_pending);
 static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
 
-void printk_tick(void)
+static void wake_up_klogd_work_func(struct irq_work *irq_work)
 {
-   if (__this_cpu_read(printk_pending)) {
-   int pending = __this_cpu_xchg(printk_pending, 0);
-   if (pending  PRINTK_PENDING_SCHED) {
-   char *buf = __get_cpu_var(printk_sched_buf);
-   printk(KERN_WARNING [sched_delayed] %s, buf);
-   }
-   if (pending  PRINTK_PENDING_WAKEUP)
-   wake_up_interruptible(log_wait);
+   int pending = __this_cpu_xchg(printk_pending, 0);
+
+   if (pending  PRINTK_PENDING_SCHED) {
+   char *buf = __get_cpu_var(printk_sched_buf);
+   printk(KERN_WARNING [sched_delayed] %s, buf);
}
-}
 
-int printk_needs_cpu(int cpu)
-{
-   if (cpu_is_offline(cpu))
-   printk_tick();
-   return __this_cpu_read(printk_pending);
+   if (pending  PRINTK_PENDING_WAKEUP)
+   wake_up_interruptible(log_wait);
 }
 
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
+   .func = wake_up_klogd_work_func,
+   .flags = IRQ_WORK_LAZY,
+};
+
 void wake_up_klogd(void)
 {
-   if (waitqueue_active(log_wait))
+   preempt_disable();
+   if (waitqueue_active(log_wait)) {
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
+   }
+   preempt_enable();
 }
 
 static void console_cont_flush(char *text, size_t size)
@@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...)
va_end(args);
 
__this_cpu_or(printk_pending, PRINTK_PENDING_SCHED);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
local_irq_restore(flags);
 
return r;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f249e8c..822d757 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
time_delta = timekeeping_max_deferment();
} while (read_seqretry(xtime_lock, seq));
 
-   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
+   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||
arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
diff --git a/kernel/timer.c b/kernel/timer.c
index 367d008..ff3b516 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1351,7 +1351,6 @@ void update_process_times(int user_tick)
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
-   printk_tick();
 #ifdef

[PATCH 6/7] irq_work: Make self-IPIs optable

2012-11-14 Thread Frederic Weisbecker
On irq work initialization, let the user choose to define it
as lazy or not. Lazy means that we don't want to send
an IPI (provided the arch can anyway) when we enqueue this
work but we rather prefer to wait for the next timer tick
to execute our work if possible.

This is going to be a benefit for non-urgent enqueuers
(like printk in the future) that may prefer not to raise
an IPI storm in case of frequent enqueuing on short periods
of time.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |   14 ++
 kernel/irq_work.c|   46 ++
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index a69704f..b28eb60 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,20 @@
 
 #include linux/llist.h
 
+/*
+ * An entry can be in one of four states:
+ *
+ * free NULL, 0 - {claimed}   : free to be used
+ * claimed   NULL, 3 - {pending}   : claimed to be enqueued
+ * pending   next, 3 - {busy}  : queued, pending callback
+ * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
+ */
+
+#define IRQ_WORK_PENDING   1UL
+#define IRQ_WORK_BUSY  2UL
+#define IRQ_WORK_FLAGS 3UL
+#define IRQ_WORK_LAZY  4UL /* Doesn't want IPI, wait for tick */
+
 struct irq_work {
unsigned long flags;
struct llist_node llnode;
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index b3c113a..65c65dc 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -12,22 +12,13 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/irqflags.h
+#include linux/sched.h
+#include linux/tick.h
 #include asm/processor.h
 
-/*
- * An entry can be in one of four states:
- *
- * free NULL, 0 - {claimed}   : free to be used
- * claimed   NULL, 3 - {pending}   : claimed to be enqueued
- * pending   next, 3 - {busy}  : queued, pending callback
- * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
- */
-
-#define IRQ_WORK_PENDING   1UL
-#define IRQ_WORK_BUSY  2UL
-#define IRQ_WORK_FLAGS 3UL
 
 static DEFINE_PER_CPU(struct llist_head, irq_work_list);
+static DEFINE_PER_CPU(int, irq_work_raised);
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -67,14 +58,18 @@ void __weak arch_irq_work_raise(void)
  */
 static void __irq_work_queue(struct irq_work *work)
 {
-   bool empty;
-
preempt_disable();
 
-   empty = llist_add(work-llnode, __get_cpu_var(irq_work_list));
-   /* The list was empty, raise self-interrupt to start processing. */
-   if (empty)
-   arch_irq_work_raise();
+   llist_add(work-llnode, __get_cpu_var(irq_work_list));
+
+   /*
+* If the work is flagged as lazy, just wait for the next tick
+* to run it. Otherwise, or if the tick is stopped, raise the irq work.
+*/
+   if (!(work-flags  IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
+   if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
+   arch_irq_work_raise();
+   }
 
preempt_enable();
 }
@@ -116,10 +111,19 @@ bool irq_work_needs_cpu(void)
  */
 void irq_work_run(void)
 {
+   unsigned long flags;
struct irq_work *work;
struct llist_head *this_list;
struct llist_node *llnode;
 
+
+   /*
+* Reset the raised state right before we check the list because
+* an NMI may enqueue after we find the list empty from the runner.
+*/
+   __this_cpu_write(irq_work_raised, 0);
+   barrier();
+
this_list = __get_cpu_var(irq_work_list);
if (llist_empty(this_list))
return;
@@ -140,13 +144,15 @@ void irq_work_run(void)
 * to claim that work don't rely on us to handle their data
 * while we are in the middle of the func.
 */
-   xchg(work-flags, IRQ_WORK_BUSY);
+   flags = work-flags  ~IRQ_WORK_PENDING;
+   xchg(work-flags, flags);
+
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
 * no-one else claimed it meanwhile.
 */
-   (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
+   (void)cmpxchg(work-flags, flags, flags  ~IRQ_WORK_BUSY);
}
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

[PATCH 3/7] irq_work: Remove CONFIG_HAVE_IRQ_WORK

2012-11-14 Thread Frederic Weisbecker
irq work can run on any arch even without IPI
support because of the hook on update_process_times().

So lets remove HAVE_IRQ_WORK because it doesn't reflect
any backend requirement.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 init/Kconfig|4 
 15 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5dd7f5d..e56c2d1 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -5,7 +5,6 @@ config ALPHA
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_SYSCALL_WRAPPERS
-   select HAVE_IRQ_WORK
select HAVE_PCSPKR_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_DMA_ATTRS
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ade7e92..22d378b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -36,7 +36,6 @@ config ARM
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if (PERF_EVENTS  (CPU_V6 || CPU_V6K || 
CPU_V7))
select HAVE_IDE if PCI || ISA || PCMCIA
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ef54a59..dd50d72 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,7 +17,6 @@ config ARM64
select HAVE_GENERIC_DMA_COHERENT
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if PERF_EVENTS
-   select HAVE_IRQ_WORK
select HAVE_MEMBLOCK
select HAVE_PERF_EVENTS
select HAVE_SPARSE_IRQ
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index b6f3ad5..86f891f 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -24,7 +24,6 @@ config BLACKFIN
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_IDE
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP if RAMKERNEL
select HAVE_KERNEL_BZIP2 if RAMKERNEL
select HAVE_KERNEL_LZMA if RAMKERNEL
diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig
index df2eb4b..c44fd6e 100644
--- a/arch/frv/Kconfig
+++ b/arch/frv/Kconfig
@@ -3,7 +3,6 @@ config FRV
default y
select HAVE_IDE
select HAVE_ARCH_TRACEHOOK
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_UID16
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index 0744f7d..40a3185 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -14,7 +14,6 @@ config HEXAGON
# select HAVE_CLK
# select IRQ_PER_CPU
# select GENERIC_PENDING_IRQ if SMP
-   select HAVE_IRQ_WORK
select GENERIC_ATOMIC64
select HAVE_PERF_EVENTS
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index dba9390..3d86d69 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,7 +4,6 @@ config MIPS
select HAVE_GENERIC_DMA_COHERENT
select HAVE_IDE
select HAVE_OPROFILE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_ARCH_KGDB
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 11def45..8f0df47 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -9,7 +9,6 @@ config PARISC
select RTC_DRV_GENERIC
select INIT_ALL_POSSIBLE
select BUG
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select GENERIC_ATOMIC64 if !64BIT
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a902a5c..a90f0c9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -118,7 +118,6 @@ config PPC
select HAVE_SYSCALL_WRAPPERS if PPC64
select GENERIC_ATOMIC64 if PPC32
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5dba755..0816ff0 100644

Re: [PATCH 7/7] printk: Wake up klogd using irq_work

2012-11-15 Thread Frederic Weisbecker
2012/11/15 Steven Rostedt rost...@goodmis.org:
 On Wed, 2012-11-14 at 21:37 +0100, Frederic Weisbecker wrote:
 diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
 index f249e8c..822d757 100644
 --- a/kernel/time/tick-sched.c
 +++ b/kernel/time/tick-sched.c
 @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct 
 tick_sched *ts,
   time_delta = timekeeping_max_deferment();
   } while (read_seqretry(xtime_lock, seq));

 - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
 + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||

 If the CPU is going offline, the printk_tick() would be executed here.
 But now that printk_tick() is done with the irq_work code, it wont be
 executed till the next tick.  Could this cause a missed printk because
 of this, if the cpu is going offline?

 Actually, how does irq_work in general handle cpu offline work?

Good point, and that's not trivial to solve.

The hotplug down sequence does:

-
CPU that offilines  CPU offlining
-
-
cpu_down() {
__stop_machine(take_cpu_down)

take_cpu_down() {

__cpu_disable() {

* disable irqs in hw

* clear from online mask
   }

move all tasks somewhere
   }
while (!idle_cpu(offlining))
cpu_relax()

cpu_die();
-

So the offlining CPU goes to idle in the end once irqs are disabled in
the apic level. Does that include the timer tick? If so then the last
resort to offline without irq works in the queue is to make
take_cpu_down() ask for a retry if there are pending irq works during
its execution.

Now if we have printk() calls between __cpu_disable() and the idle
loop, they will be lost until the next onlining. Unless we do an
explicit call to printk_tick() from the idle loop if the CPU is
offline.

Note that !CONFIG_NO_HZ doesn't seem to handle that. Which makes me
wonder if the tick is really part of the whole IRQ disablement done in
__cpu_disable().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/7] printk: Wake up klogd using irq_work

2012-11-15 Thread Frederic Weisbecker
2012/11/15 Frederic Weisbecker fweis...@gmail.com:
 -
 CPU that offilines  CPU offlining
 -
 -
 cpu_down() {
 __stop_machine(take_cpu_down)

 take_cpu_down() {

 __cpu_disable() {

 * disable irqs in hw

 * clear from online mask
}

 move all tasks somewhere
}
 while (!idle_cpu(offlining))
 cpu_relax()

 cpu_die();
 -

Oh thanks gmail for the mess. Sometimes it mangles contents, sometimes
not. Probably depend if the moon is odd or even.

Here is a pastebin: http://pastebin.com/aACvyu6p
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7] irq_work: Don't stop the tick with pending works

2012-11-15 Thread Frederic Weisbecker
2012/11/15 Steven Rostedt rost...@goodmis.org:
 On Wed, 2012-11-14 at 21:37 +0100, Frederic Weisbecker wrote:
 diff --git a/kernel/irq_work.c b/kernel/irq_work.c
 index 64eddd5..b3c113a 100644
 --- a/kernel/irq_work.c
 +++ b/kernel/irq_work.c
 @@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work)
  }
  EXPORT_SYMBOL_GPL(irq_work_queue);

 +bool irq_work_needs_cpu(void)
 +{
 + struct llist_head *this_list;
 +
 + this_list = __get_cpu_var(irq_work_list);
 + if (llist_empty(this_list))
 + return false;
 +

 I wounder if this should just be:

 return !llist_empty(this_cpu_read(irq_work_list));

Yeah I'll simplify that way.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] irq_work: Flush work on CPU_DYING (was: Re: [PATCH 7/7] printk: Wake up klogd using irq_work)

2012-11-15 Thread Frederic Weisbecker
2012/11/15 Steven Rostedt rost...@goodmis.org:
 On Thu, 2012-11-15 at 16:25 +0100, Frederic Weisbecker wrote:
 2012/11/15 Steven Rostedt rost...@goodmis.org:
  On Wed, 2012-11-14 at 21:37 +0100, Frederic Weisbecker wrote:
  diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
  index f249e8c..822d757 100644
  --- a/kernel/time/tick-sched.c
  +++ b/kernel/time/tick-sched.c
  @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct 
  tick_sched *ts,
time_delta = timekeeping_max_deferment();
} while (read_seqretry(xtime_lock, seq));
 
  - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) 
  ||
  + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||
 
  If the CPU is going offline, the printk_tick() would be executed here.
  But now that printk_tick() is done with the irq_work code, it wont be
  executed till the next tick.  Could this cause a missed printk because
  of this, if the cpu is going offline?
 
  Actually, how does irq_work in general handle cpu offline work?

 Good point, and that's not trivial to solve.

 The hotplug down sequence does:

 -
 CPU that offilines  CPU offlining
 -
 -
 cpu_down() {
 __stop_machine(take_cpu_down)

 take_cpu_down() {

 __cpu_disable() {

 * disable irqs in hw

 * clear from online mask
}

 move all tasks somewhere
}
 while (!idle_cpu(offlining))
 cpu_relax()

 cpu_die();
 -

 So the offlining CPU goes to idle in the end once irqs are disabled in
 the apic level. Does that include the timer tick? If so then the last
 resort to offline without irq works in the queue is to make
 take_cpu_down() ask for a retry if there are pending irq works during
 its execution.

 Now if we have printk() calls between __cpu_disable() and the idle
 loop, they will be lost until the next onlining. Unless we do an
 explicit call to printk_tick() from the idle loop if the CPU is
 offline.

 Note that !CONFIG_NO_HZ doesn't seem to handle that. Which makes me
 wonder if the tick is really part of the whole IRQ disablement done in
 __cpu_disable().


 How about flushing all irq_work from CPU_DYING. The notifier is called
 by stop_machine on the CPU that is going down. Grant you, the code will
 not be called from irq context (so things like get_irq_regs() wont work)
 but I'm not sure what the requirements are for irq_work in that regard
 (Peter?). But irqs are disabled and the CPU is about to go offline.
 Might as well flush the work.

 I ran this against my stress_cpu_hotplug script (attached) and it seemed
 to work fine. I even did a:

   perf record ./stress-cpu-hotplug

 Signed-off-by: Steven Rostedt rost...@goodmis.org

 Index: linux-rt.git/kernel/irq_work.c
 ===
 --- linux-rt.git.orig/kernel/irq_work.c
 +++ linux-rt.git/kernel/irq_work.c
 @@ -14,6 +14,7 @@
  #include linux/irqflags.h
  #include linux/sched.h
  #include linux/tick.h
 +#include linux/cpu.h
  #include asm/processor.h


 @@ -105,11 +106,7 @@ bool irq_work_needs_cpu(void)
 return true;
  }

 -/*
 - * Run the irq_work entries on this cpu. Requires to be ran from hardirq
 - * context with local IRQs disabled.
 - */
 -void irq_work_run(void)
 +static void __irq_work_run(void)
  {
 unsigned long flags;
 struct irq_work *work;
 @@ -128,7 +125,6 @@ void irq_work_run(void)
 if (llist_empty(this_list))
 return;

 -   BUG_ON(!in_irq());
 BUG_ON(!irqs_disabled());

 llnode = llist_del_all(this_list);
 @@ -155,8 +151,23 @@ void irq_work_run(void)
 (void)cmpxchg(work-flags, flags, flags  ~IRQ_WORK_BUSY);
 }
  }
 +
 +/*
 + * Run the irq_work entries on this cpu. Requires to be ran from hardirq
 + * context with local IRQs disabled.
 + */
 +void irq_work_run(void)
 +{
 +   BUG_ON(!in_irq());
 +   __irq_work_run();
 +}
  EXPORT_SYMBOL_GPL(irq_work_run);

 +static void irq_work_run_cpu_down(void)
 +{
 +   __irq_work_run();
 +}
 +
  /*
   * Synchronize against the irq_work @entry, ensures the entry is not
   * currently in use.
 @@ -169,3 +180,35 @@ void irq_work_sync(struct irq_work *work
 cpu_relax();
  }
  EXPORT_SYMBOL_GPL(irq_work_sync);
 +
 +#ifdef CONFIG_HOTPLUG_CPU
 +static int irq_work_cpu_notify(struct notifier_block *self,
 +  unsigned long action, void *hcpu)
 +{
 +   long cpu = (long)hcpu;
 +
 +   switch (action) {
 +   case CPU_DYING:

Looks good. Perf has already deactivated the cpu wide events on
CPU_DOWN_PREPARE. I suspect it's the only irq work enqueuer from NMI.
At this stage of cpu down hotplug, irqs are deactivated so the last
possible enqueuers before the CPU goes idle/down are from subsequent

Re: [PATCH RFC] irq_work: Warn if there's still work on cpu_down

2012-11-15 Thread Frederic Weisbecker
2012/11/15 Steven Rostedt rost...@goodmis.org:
 If we are in nohz and there's still irq_work to be done when the idle
 task is about to go offline. Give a nasty warning.

 Signed-off-by: Steven Rostedt rost...@goodmis.org

 Index: linux-rt.git/kernel/irq_work.c
 ===
 --- linux-rt.git.orig/kernel/irq_work.c
 +++ linux-rt.git/kernel/irq_work.c
 @@ -103,6 +103,9 @@ bool irq_work_needs_cpu(void)
 if (llist_empty(this_list))
 return false;

 +   /* All work should have been flushed before going offline */
 +   WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));

Should we return false in that case? I don't know what can happen if
we wait for one more tick while the CPU is offline and apic is
deactivated.

 +
 return true;
  }



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/9] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012-11-15 Thread Frederic Weisbecker
The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.

So if we have specific data to compute in our work, we ensure it's
either handled by another CPU or locally by enqueuing the work again.
This state machine is guanranteed by atomic operations on the flags.

So when we set IRQ_WORK_BUSY without using an xchg-like operation,
we break this guarantee as in the following summarized scenario:

CPU 1   CPU 2
-   -
(flags = 0)
old_flags = flags;
(flags = 0)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(flags = 3)
[...]
flags = IRQ_WORK_BUSY
(flags = 2)
func()
(sees flags = 3)
cmpxchg(flags, old_flags,
old_flags | 
IRQ_WORK_FLAGS)
(give up)

cmpxchg(flags, 2, 0);
(flags = 0)

CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and
the work is again in a claimable state. Now CPU 2 has new data to process
and try to claim that work but it may see a stale value of the flags
and think the work is still pending somewhere that will handle our data.
This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically.

As a result, the data expected to be handle by CPU 2 won't get handled.

To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2
will see the correct value with cmpxchg() using the expected ordering.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 1588e3b..57be1a6 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -119,8 +119,11 @@ void irq_work_run(void)
/*
 * Clear the PENDING bit, after this point the @work
 * can be re-used.
+* Make it immediately visible so that other CPUs trying
+* to claim that work don't rely on us to handle their data
+* while we are in the middle of the func.
 */
-   work-flags = IRQ_WORK_BUSY;
+   xchg(work-flags, IRQ_WORK_BUSY);
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/9] nohz: Add API to check tick state

2012-11-15 Thread Frederic Weisbecker
We need some quick way to check if the CPU has stopped
its tick. This will be useful to implement the printk tick
using the irq work subsystem.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/tick.h |   17 -
 kernel/time/tick-sched.c |2 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..2307dd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,8 @@
 
 #include linux/clockchips.h
 #include linux/irqflags.h
+#include linux/percpu.h
+#include linux/hrtimer.h
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 
0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched);
+
+static inline int tick_nohz_tick_stopped(void)
+{
+   return __this_cpu_read(tick_cpu_sched.tick_stopped);
+}
+
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+# else /* !CONFIG_NO_HZ */
+static inline int tick_nohz_tick_stopped(void)
+{
+   return 0;
+}
+
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..9e945aa 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -28,7 +28,7 @@
 /*
  * Per cpu nohz control structure
  */
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
 
 /*
  * The time, when the last jiffy update happened. Protected by xtime_lock.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/9] irq_work: Don't stop the tick with pending works

2012-11-15 Thread Frederic Weisbecker
Don't stop the tick if we have pending irq works on the
queue, otherwise if the arch can't raise self-IPIs, we may not
find an opportunity to execute the pending works for a while.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |6 ++
 kernel/irq_work.c|   11 +++
 kernel/time/tick-sched.c |3 ++-
 3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 6a9e8f5..a69704f 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work);
 void irq_work_run(void);
 void irq_work_sync(struct irq_work *work);
 
+#ifdef CONFIG_IRQ_WORK
+bool irq_work_needs_cpu(void);
+#else
+static bool irq_work_needs_cpu(void) { return false; }
+#endif
+
 #endif /* _LINUX_IRQ_WORK_H */
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 64eddd5..b3c113a 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work)
 }
 EXPORT_SYMBOL_GPL(irq_work_queue);
 
+bool irq_work_needs_cpu(void)
+{
+   struct llist_head *this_list;
+
+   this_list = __get_cpu_var(irq_work_list);
+   if (llist_empty(this_list))
+   return false;
+
+   return true;
+}
+
 /*
  * Run the irq_work entries on this cpu. Requires to be ran from hardirq
  * context with local IRQs disabled.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e945aa..f249e8c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include linux/profile.h
 #include linux/sched.h
 #include linux/module.h
+#include linux/irq_work.h
 
 #include asm/irq_regs.h
 
@@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
} while (read_seqretry(xtime_lock, seq));
 
if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
-   arch_needs_cpu(cpu)) {
+   arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
} else {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/9] irq_work: Warn if there's still work on cpu_down

2012-11-15 Thread Frederic Weisbecker
From: Steven Rostedt rost...@goodmis.org

If we are in nohz and there's still irq_work to be done when the idle
task is about to go offline, give a nasty warning. Everything should
have been flushed from the CPU_DYING notifier already. Further attempts
to enqueue an irq_work are buggy because irqs are disabled by
__cpu_disable(). The best we can do is to report the issue to the user.

Signed-off-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 kernel/irq_work.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cf8b657..fcaadae 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -108,6 +108,9 @@ bool irq_work_needs_cpu(void)
if (llist_empty(this_list))
return false;
 
+   /* All work should have been flushed before going offline */
+   WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
+
return true;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/9] printk: Wake up klogd using irq_work

2012-11-15 Thread Frederic Weisbecker
klogd is woken up asynchronously from the tick in order
to do it safely.

However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
for a while. As a result, the user may miss some message.

To fix this, lets implement the printk tick using a lazy irq work.
This subsystem takes care of the timer tick state and can
fix up accordingly.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/printk.h   |3 ---
 init/Kconfig |1 +
 kernel/printk.c  |   36 
 kernel/time/tick-sched.c |2 +-
 kernel/timer.c   |1 -
 5 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 9afc01e..86c4b62 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...)
 extern asmlinkage __printf(1, 2)
 void early_printk(const char *fmt, ...);
 
-extern int printk_needs_cpu(int cpu);
-extern void printk_tick(void);
-
 #ifdef CONFIG_PRINTK
 asmlinkage __printf(5, 0)
 int vprintk_emit(int facility, int level,
diff --git a/init/Kconfig b/init/Kconfig
index cdc152c..c575566 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1196,6 +1196,7 @@ config HOTPLUG
 config PRINTK
default y
bool Enable support for printk if EXPERT
+   select IRQ_WORK
help
  This option enables normal printk support. Removing it
  eliminates most of the message strings from the kernel image
diff --git a/kernel/printk.c b/kernel/printk.c
index 2d607f4..c9104fe 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -42,6 +42,7 @@
 #include linux/notifier.h
 #include linux/rculist.h
 #include linux/poll.h
+#include linux/irq_work.h
 
 #include asm/uaccess.h
 
@@ -1955,30 +1956,32 @@ int is_console_locked(void)
 static DEFINE_PER_CPU(int, printk_pending);
 static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
 
-void printk_tick(void)
+static void wake_up_klogd_work_func(struct irq_work *irq_work)
 {
-   if (__this_cpu_read(printk_pending)) {
-   int pending = __this_cpu_xchg(printk_pending, 0);
-   if (pending  PRINTK_PENDING_SCHED) {
-   char *buf = __get_cpu_var(printk_sched_buf);
-   printk(KERN_WARNING [sched_delayed] %s, buf);
-   }
-   if (pending  PRINTK_PENDING_WAKEUP)
-   wake_up_interruptible(log_wait);
+   int pending = __this_cpu_xchg(printk_pending, 0);
+
+   if (pending  PRINTK_PENDING_SCHED) {
+   char *buf = __get_cpu_var(printk_sched_buf);
+   printk(KERN_WARNING [sched_delayed] %s, buf);
}
-}
 
-int printk_needs_cpu(int cpu)
-{
-   if (cpu_is_offline(cpu))
-   printk_tick();
-   return __this_cpu_read(printk_pending);
+   if (pending  PRINTK_PENDING_WAKEUP)
+   wake_up_interruptible(log_wait);
 }
 
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
+   .func = wake_up_klogd_work_func,
+   .flags = IRQ_WORK_LAZY,
+};
+
 void wake_up_klogd(void)
 {
-   if (waitqueue_active(log_wait))
+   preempt_disable();
+   if (waitqueue_active(log_wait)) {
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
+   }
+   preempt_enable();
 }
 
 static void console_cont_flush(char *text, size_t size)
@@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...)
va_end(args);
 
__this_cpu_or(printk_pending, PRINTK_PENDING_SCHED);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
local_irq_restore(flags);
 
return r;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f249e8c..822d757 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
time_delta = timekeeping_max_deferment();
} while (read_seqretry(xtime_lock, seq));
 
-   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
+   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||
arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
diff --git a/kernel/timer.c b/kernel/timer.c
index 367d008..ff3b516 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1351,7 +1351,6 @@ void update_process_times(int user_tick)
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
-   printk_tick();
 #ifdef

[PATCH 8/9] irq_work: Make self-IPIs optable

2012-11-15 Thread Frederic Weisbecker
On irq work initialization, let the user choose to define it
as lazy or not. Lazy means that we don't want to send
an IPI (provided the arch can anyway) when we enqueue this
work but we rather prefer to wait for the next timer tick
to execute our work if possible.

This is going to be a benefit for non-urgent enqueuers
(like printk in the future) that may prefer not to raise
an IPI storm in case of frequent enqueuing on short periods
of time.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |   14 ++
 kernel/irq_work.c|   46 ++
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index a69704f..b28eb60 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,20 @@
 
 #include linux/llist.h
 
+/*
+ * An entry can be in one of four states:
+ *
+ * free NULL, 0 - {claimed}   : free to be used
+ * claimed   NULL, 3 - {pending}   : claimed to be enqueued
+ * pending   next, 3 - {busy}  : queued, pending callback
+ * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
+ */
+
+#define IRQ_WORK_PENDING   1UL
+#define IRQ_WORK_BUSY  2UL
+#define IRQ_WORK_FLAGS 3UL
+#define IRQ_WORK_LAZY  4UL /* Doesn't want IPI, wait for tick */
+
 struct irq_work {
unsigned long flags;
struct llist_node llnode;
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index fcaadae..cef098d 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -12,23 +12,14 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/irqflags.h
+#include linux/sched.h
+#include linux/tick.h
 #include linux/cpu.h
 #include asm/processor.h
 
-/*
- * An entry can be in one of four states:
- *
- * free NULL, 0 - {claimed}   : free to be used
- * claimed   NULL, 3 - {pending}   : claimed to be enqueued
- * pending   next, 3 - {busy}  : queued, pending callback
- * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
- */
-
-#define IRQ_WORK_PENDING   1UL
-#define IRQ_WORK_BUSY  2UL
-#define IRQ_WORK_FLAGS 3UL
 
 static DEFINE_PER_CPU(struct llist_head, irq_work_list);
+static DEFINE_PER_CPU(int, irq_work_raised);
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -68,14 +59,18 @@ void __weak arch_irq_work_raise(void)
  */
 static void __irq_work_queue(struct irq_work *work)
 {
-   bool empty;
-
preempt_disable();
 
-   empty = llist_add(work-llnode, __get_cpu_var(irq_work_list));
-   /* The list was empty, raise self-interrupt to start processing. */
-   if (empty)
-   arch_irq_work_raise();
+   llist_add(work-llnode, __get_cpu_var(irq_work_list));
+
+   /*
+* If the work is flagged as lazy, just wait for the next tick
+* to run it. Otherwise, or if the tick is stopped, raise the irq work.
+*/
+   if (!(work-flags  IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
+   if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
+   arch_irq_work_raise();
+   }
 
preempt_enable();
 }
@@ -116,10 +111,19 @@ bool irq_work_needs_cpu(void)
 
 static void __irq_work_run(void)
 {
+   unsigned long flags;
struct irq_work *work;
struct llist_head *this_list;
struct llist_node *llnode;
 
+
+   /*
+* Reset the raised state right before we check the list because
+* an NMI may enqueue after we find the list empty from the runner.
+*/
+   __this_cpu_write(irq_work_raised, 0);
+   barrier();
+
this_list = __get_cpu_var(irq_work_list);
if (llist_empty(this_list))
return;
@@ -139,13 +143,15 @@ static void __irq_work_run(void)
 * to claim that work don't rely on us to handle their data
 * while we are in the middle of the func.
 */
-   xchg(work-flags, IRQ_WORK_BUSY);
+   flags = work-flags  ~IRQ_WORK_PENDING;
+   xchg(work-flags, flags);
+
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
 * no-one else claimed it meanwhile.
 */
-   (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
+   (void)cmpxchg(work-flags, flags, flags  ~IRQ_WORK_BUSY);
}
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

[PATCH 0/9] printk: Make it usable on nohz cpus v6

2012-11-15 Thread Frederic Weisbecker
Hi,

Previous patches haven't changed. This pile just adds two patches from
Steven Rostedt to ensure all pending irq works are executed before we
offline a CPU.

The branch can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/printk-v6

Thanks.

Frederic Weisbecker (7):
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
  irq_work: Fix racy check on work pending flag
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  nohz: Add API to check tick state
  irq_work: Don't stop the tick with pending works
  irq_work: Make self-IPIs optable
  printk: Wake up klogd using irq_work

Steven Rostedt (2):
  irq_work: Flush work on CPU_DYING
  irq_work: Warn if there's still work on cpu_down

 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 include/linux/irq_work.h|   20 ++
 include/linux/printk.h  |3 -
 include/linux/tick.h|   17 -
 init/Kconfig|5 +-
 kernel/irq_work.c   |  129 ++
 kernel/printk.c |   36 ++
 kernel/time/tick-sched.c|7 +-
 kernel/timer.c  |1 -
 22 files changed, 159 insertions(+), 73 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/9] irq_work: Fix racy check on work pending flag

2012-11-15 Thread Frederic Weisbecker
Work claiming wants to be SMP-safe.

And by the time we try to claim a work, if it is already executing
concurrently on another CPU, we want to succeed the claiming and queue
the work again because the other CPU may have missed the data we wanted
to handle in our work if it's about to complete there.

This scenario is summarized below:

CPU 1   CPU 2
-   -
(flags = 0)
cmpxchg(flags, 0, IRQ_WORK_FLAGS)
(flags = 3)
[...]
xchg(flags, IRQ_WORK_BUSY)
(flags = 2)
func()
if (flags  IRQ_WORK_PENDING)
(not true)
cmpxchg(flags, flags, 
IRQ_WORK_FLAGS)
(flags = 3)
[...]
cmpxchg(flags, IRQ_WORK_BUSY, 0);
(fail, pending on CPU 2)

This state machine is synchronized using [cmp]xchg() on the flags.
As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy.
By the time we check it, we may be dealing with a stale value because
we aren't using an atomic accessor. As a result, CPU 2 may see
that the work is still pending on another CPU while it may be
actually completing the work function exection already, leaving
our data unprocessed.

To fix this, we start by speculating about the value we wish to be
in the work-flags but we only make any conclusion after the value
returned by the cmpxchg() call that either claims the work or let
the current owner handle the pending work for us.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |   16 +++-
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 57be1a6..64eddd5 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list);
  */
 static bool irq_work_claim(struct irq_work *work)
 {
-   unsigned long flags, nflags;
+   unsigned long flags, oflags, nflags;
 
+   /*
+* Start with our best wish as a premise but only trust any
+* flag value after cmpxchg() result.
+*/
+   flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
-   flags = work-flags;
-   if (flags  IRQ_WORK_PENDING)
-   return false;
nflags = flags | IRQ_WORK_FLAGS;
-   if (cmpxchg(work-flags, flags, nflags) == flags)
+   oflags = cmpxchg(work-flags, flags, nflags);
+   if (oflags == flags)
break;
+   if (oflags  IRQ_WORK_PENDING)
+   return false;
+   flags = oflags;
cpu_relax();
}
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/9] irq_work: Flush work on CPU_DYING

2012-11-15 Thread Frederic Weisbecker
From: Steven Rostedt rost...@goodmis.org

In order not to offline a CPU with pending irq works, flush the
queue from CPU_DYING. The notifier is called by stop_machine on
the CPU that is going down. The code will not be called from irq context
(so things like get_irq_regs() wont work) but I'm not sure what the
requirements are for irq_work in that regard (Peter?). But irqs are
disabled and the CPU is about to go offline. Might as well flush the work.

Signed-off-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 kernel/irq_work.c |   50 --
 1 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index b3c113a..cf8b657 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -12,6 +12,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/irqflags.h
+#include linux/cpu.h
 #include asm/processor.h
 
 /*
@@ -110,11 +111,7 @@ bool irq_work_needs_cpu(void)
return true;
 }
 
-/*
- * Run the irq_work entries on this cpu. Requires to be ran from hardirq
- * context with local IRQs disabled.
- */
-void irq_work_run(void)
+static void __irq_work_run(void)
 {
struct irq_work *work;
struct llist_head *this_list;
@@ -124,7 +121,6 @@ void irq_work_run(void)
if (llist_empty(this_list))
return;
 
-   BUG_ON(!in_irq());
BUG_ON(!irqs_disabled());
 
llnode = llist_del_all(this_list);
@@ -149,6 +145,16 @@ void irq_work_run(void)
(void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
}
 }
+
+/*
+ * Run the irq_work entries on this cpu. Requires to be ran from hardirq
+ * context with local IRQs disabled.
+ */
+void irq_work_run(void)
+{
+   BUG_ON(!in_irq());
+   __irq_work_run();
+}
 EXPORT_SYMBOL_GPL(irq_work_run);
 
 /*
@@ -163,3 +169,35 @@ void irq_work_sync(struct irq_work *work)
cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
+
+#ifdef CONFIG_HOTPLUG_CPU
+static int irq_work_cpu_notify(struct notifier_block *self,
+  unsigned long action, void *hcpu)
+{
+   long cpu = (long)hcpu;
+
+   switch (action) {
+   case CPU_DYING:
+   /* Called from stop_machine */
+   if (WARN_ON_ONCE(cpu != smp_processor_id()))
+   break;
+   __irq_work_run();
+   break;
+   default:
+   break;
+   }
+   return NOTIFY_OK;
+}
+
+static struct notifier_block cpu_notify;
+
+static __init int irq_work_init_cpu_notifier(void)
+{
+   cpu_notify.notifier_call = irq_work_cpu_notify;
+   cpu_notify.priority = 0;
+   register_cpu_notifier(cpu_notify);
+   return 0;
+}
+device_initcall(irq_work_init_cpu_notifier);
+
+#endif /* CONFIG_HOTPLUG_CPU */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/9] irq_work: Remove CONFIG_HAVE_IRQ_WORK

2012-11-15 Thread Frederic Weisbecker
irq work can run on any arch even without IPI
support because of the hook on update_process_times().

So lets remove HAVE_IRQ_WORK because it doesn't reflect
any backend requirement.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Steven Rostedt rost...@goodmis.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 init/Kconfig|4 
 15 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5dd7f5d..e56c2d1 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -5,7 +5,6 @@ config ALPHA
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_SYSCALL_WRAPPERS
-   select HAVE_IRQ_WORK
select HAVE_PCSPKR_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_DMA_ATTRS
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ade7e92..22d378b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -36,7 +36,6 @@ config ARM
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if (PERF_EVENTS  (CPU_V6 || CPU_V6K || 
CPU_V7))
select HAVE_IDE if PCI || ISA || PCMCIA
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ef54a59..dd50d72 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,7 +17,6 @@ config ARM64
select HAVE_GENERIC_DMA_COHERENT
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if PERF_EVENTS
-   select HAVE_IRQ_WORK
select HAVE_MEMBLOCK
select HAVE_PERF_EVENTS
select HAVE_SPARSE_IRQ
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index b6f3ad5..86f891f 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -24,7 +24,6 @@ config BLACKFIN
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_IDE
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP if RAMKERNEL
select HAVE_KERNEL_BZIP2 if RAMKERNEL
select HAVE_KERNEL_LZMA if RAMKERNEL
diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig
index df2eb4b..c44fd6e 100644
--- a/arch/frv/Kconfig
+++ b/arch/frv/Kconfig
@@ -3,7 +3,6 @@ config FRV
default y
select HAVE_IDE
select HAVE_ARCH_TRACEHOOK
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_UID16
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index 0744f7d..40a3185 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -14,7 +14,6 @@ config HEXAGON
# select HAVE_CLK
# select IRQ_PER_CPU
# select GENERIC_PENDING_IRQ if SMP
-   select HAVE_IRQ_WORK
select GENERIC_ATOMIC64
select HAVE_PERF_EVENTS
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index dba9390..3d86d69 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,7 +4,6 @@ config MIPS
select HAVE_GENERIC_DMA_COHERENT
select HAVE_IDE
select HAVE_OPROFILE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_ARCH_KGDB
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 11def45..8f0df47 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -9,7 +9,6 @@ config PARISC
select RTC_DRV_GENERIC
select INIT_ALL_POSSIBLE
select BUG
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select GENERIC_ATOMIC64 if !64BIT
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a902a5c..a90f0c9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -118,7 +118,6 @@ config PPC
select HAVE_SYSCALL_WRAPPERS if PPC64
select GENERIC_ATOMIC64 if PPC32
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5dba755..0816ff0 100644

Re: [PATCH 8/9] irq_work: Make self-IPIs optable

2012-11-16 Thread Frederic Weisbecker
2012/11/16 Steven Rostedt rost...@goodmis.org:
 On Fri, 2012-11-16 at 03:21 +0100, Frederic Weisbecker wrote:

  /*
   * Claim the entry so that no one else will poke at it.
 @@ -68,14 +59,18 @@ void __weak arch_irq_work_raise(void)
   */
  static void __irq_work_queue(struct irq_work *work)
  {
 - bool empty;
 -
   preempt_disable();

 - empty = llist_add(work-llnode, __get_cpu_var(irq_work_list));
 - /* The list was empty, raise self-interrupt to start processing. */
 - if (empty)
 - arch_irq_work_raise();
 + llist_add(work-llnode, __get_cpu_var(irq_work_list));
 +
 + /*
 +  * If the work is flagged as lazy, just wait for the next tick
 +  * to run it. Otherwise, or if the tick is stopped, raise the irq work.

 Speaking more Greek? ;-)

 How about:

 If the work is not lazy or the tick is stopped, raise the irq
 work interrupt (if supported by the arch), otherwise, just wait
 for the next tick.

Much better :)


 Other than that, Acked-by: Steven Rostedt rost...@goodmis.org

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL v2] printk: Make it usable on nohz cpus

2012-11-17 Thread Frederic Weisbecker
Ingo,

Please pull the printk support in dynticks mode patches that can
be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git 
tags/printk-dynticks-for-mingo-v2

for you to fetch changes up to 74876a98a87a115254b3a66a14b27320b7f0acaa:

  printk: Wake up klogd using irq_work (2012-11-18 01:01:49 +0100)

It is based on v3.7-rc4.

Changes since previous pull request include support for irq work flush
on CPU offlining and acks from Steve. The rest hasn't changed except some
comment fix.

Thanks.


Support for printk in dynticks mode:

* Fix two races in irq work claiming

* Generalize irq_work support to all archs

* Don't stop tick with irq works pending. This
fix is generally useful and concerns archs that
can't raise self IPIs.

* Flush irq works before CPU offlining.

* Introduce lazy irq works that can wait for the
next tick to be executed, unless it's stopped.

* Implement klogd wake up using irq work. This
removes the ad-hoc printk_tick()/printk_needs_cpu()
hooks and make it working even in dynticks mode.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com


Frederic Weisbecker (7):
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
  irq_work: Fix racy check on work pending flag
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  nohz: Add API to check tick state
  irq_work: Don't stop the tick with pending works
  irq_work: Make self-IPIs optable
  printk: Wake up klogd using irq_work

Steven Rostedt (2):
  irq_work: Flush work on CPU_DYING
  irq_work: Warn if there's still work on cpu_down

 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 include/linux/irq_work.h|   20 ++
 include/linux/printk.h  |3 -
 include/linux/tick.h|   17 -
 init/Kconfig|5 +-
 kernel/irq_work.c   |  131 ++-
 kernel/printk.c |   36 +-
 kernel/time/tick-sched.c|7 +-
 kernel/timer.c  |1 -
 22 files changed, 161 insertions(+), 73 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/9] irq_work: Fix racy check on work pending flag

2012-11-17 Thread Frederic Weisbecker
Work claiming wants to be SMP-safe.

And by the time we try to claim a work, if it is already executing
concurrently on another CPU, we want to succeed the claiming and queue
the work again because the other CPU may have missed the data we wanted
to handle in our work if it's about to complete there.

This scenario is summarized below:

CPU 1   CPU 2
-   -
(flags = 0)
cmpxchg(flags, 0, IRQ_WORK_FLAGS)
(flags = 3)
[...]
xchg(flags, IRQ_WORK_BUSY)
(flags = 2)
func()
if (flags  IRQ_WORK_PENDING)
(not true)
cmpxchg(flags, flags, 
IRQ_WORK_FLAGS)
(flags = 3)
[...]
cmpxchg(flags, IRQ_WORK_BUSY, 0);
(fail, pending on CPU 2)

This state machine is synchronized using [cmp]xchg() on the flags.
As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy.
By the time we check it, we may be dealing with a stale value because
we aren't using an atomic accessor. As a result, CPU 2 may see
that the work is still pending on another CPU while it may be
actually completing the work function exection already, leaving
our data unprocessed.

To fix this, we start by speculating about the value we wish to be
in the work-flags but we only make any conclusion after the value
returned by the cmpxchg() call that either claims the work or let
the current owner handle the pending work for us.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |   16 +++-
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 57be1a6..64eddd5 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list);
  */
 static bool irq_work_claim(struct irq_work *work)
 {
-   unsigned long flags, nflags;
+   unsigned long flags, oflags, nflags;
 
+   /*
+* Start with our best wish as a premise but only trust any
+* flag value after cmpxchg() result.
+*/
+   flags = work-flags  ~IRQ_WORK_PENDING;
for (;;) {
-   flags = work-flags;
-   if (flags  IRQ_WORK_PENDING)
-   return false;
nflags = flags | IRQ_WORK_FLAGS;
-   if (cmpxchg(work-flags, flags, nflags) == flags)
+   oflags = cmpxchg(work-flags, flags, nflags);
+   if (oflags == flags)
break;
+   if (oflags  IRQ_WORK_PENDING)
+   return false;
+   flags = oflags;
cpu_relax();
}
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/9] printk: Wake up klogd using irq_work

2012-11-17 Thread Frederic Weisbecker
klogd is woken up asynchronously from the tick in order
to do it safely.

However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
for a while. As a result, the user may miss some message.

To fix this, lets implement the printk tick using a lazy irq work.
This subsystem takes care of the timer tick state and can
fix up accordingly.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/printk.h   |3 ---
 init/Kconfig |1 +
 kernel/printk.c  |   36 
 kernel/time/tick-sched.c |2 +-
 kernel/timer.c   |1 -
 5 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 9afc01e..86c4b62 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...)
 extern asmlinkage __printf(1, 2)
 void early_printk(const char *fmt, ...);
 
-extern int printk_needs_cpu(int cpu);
-extern void printk_tick(void);
-
 #ifdef CONFIG_PRINTK
 asmlinkage __printf(5, 0)
 int vprintk_emit(int facility, int level,
diff --git a/init/Kconfig b/init/Kconfig
index cdc152c..c575566 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1196,6 +1196,7 @@ config HOTPLUG
 config PRINTK
default y
bool Enable support for printk if EXPERT
+   select IRQ_WORK
help
  This option enables normal printk support. Removing it
  eliminates most of the message strings from the kernel image
diff --git a/kernel/printk.c b/kernel/printk.c
index 2d607f4..c9104fe 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -42,6 +42,7 @@
 #include linux/notifier.h
 #include linux/rculist.h
 #include linux/poll.h
+#include linux/irq_work.h
 
 #include asm/uaccess.h
 
@@ -1955,30 +1956,32 @@ int is_console_locked(void)
 static DEFINE_PER_CPU(int, printk_pending);
 static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
 
-void printk_tick(void)
+static void wake_up_klogd_work_func(struct irq_work *irq_work)
 {
-   if (__this_cpu_read(printk_pending)) {
-   int pending = __this_cpu_xchg(printk_pending, 0);
-   if (pending  PRINTK_PENDING_SCHED) {
-   char *buf = __get_cpu_var(printk_sched_buf);
-   printk(KERN_WARNING [sched_delayed] %s, buf);
-   }
-   if (pending  PRINTK_PENDING_WAKEUP)
-   wake_up_interruptible(log_wait);
+   int pending = __this_cpu_xchg(printk_pending, 0);
+
+   if (pending  PRINTK_PENDING_SCHED) {
+   char *buf = __get_cpu_var(printk_sched_buf);
+   printk(KERN_WARNING [sched_delayed] %s, buf);
}
-}
 
-int printk_needs_cpu(int cpu)
-{
-   if (cpu_is_offline(cpu))
-   printk_tick();
-   return __this_cpu_read(printk_pending);
+   if (pending  PRINTK_PENDING_WAKEUP)
+   wake_up_interruptible(log_wait);
 }
 
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
+   .func = wake_up_klogd_work_func,
+   .flags = IRQ_WORK_LAZY,
+};
+
 void wake_up_klogd(void)
 {
-   if (waitqueue_active(log_wait))
+   preempt_disable();
+   if (waitqueue_active(log_wait)) {
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
+   }
+   preempt_enable();
 }
 
 static void console_cont_flush(char *text, size_t size)
@@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...)
va_end(args);
 
__this_cpu_or(printk_pending, PRINTK_PENDING_SCHED);
+   irq_work_queue(__get_cpu_var(wake_up_klogd_work));
local_irq_restore(flags);
 
return r;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f249e8c..822d757 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
time_delta = timekeeping_max_deferment();
} while (read_seqretry(xtime_lock, seq));
 
-   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
+   if (rcu_needs_cpu(cpu, rcu_delta_jiffies) ||
arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
diff --git a/kernel/timer.c b/kernel/timer.c
index 367d008..ff3b516 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1351,7 +1351,6 @@ void update_process_times(int user_tick)
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
-   printk_tick

[PATCH 4/9] nohz: Add API to check tick state

2012-11-17 Thread Frederic Weisbecker
We need some quick way to check if the CPU has stopped
its tick. This will be useful to implement the printk tick
using the irq work subsystem.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/tick.h |   17 -
 kernel/time/tick-sched.c |2 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f37fceb..2307dd3 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -8,6 +8,8 @@
 
 #include linux/clockchips.h
 #include linux/irqflags.h
+#include linux/percpu.h
+#include linux/hrtimer.h
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 
0; }
 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
 
 # ifdef CONFIG_NO_HZ
+DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched);
+
+static inline int tick_nohz_tick_stopped(void)
+{
+   return __this_cpu_read(tick_cpu_sched.tick_stopped);
+}
+
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-# else
+
+# else /* !CONFIG_NO_HZ */
+static inline int tick_nohz_tick_stopped(void)
+{
+   return 0;
+}
+
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..9e945aa 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -28,7 +28,7 @@
 /*
  * Per cpu nohz control structure
  */
-static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
+DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched);
 
 /*
  * The time, when the last jiffy update happened. Protected by xtime_lock.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/9] irq_work: Don't stop the tick with pending works

2012-11-17 Thread Frederic Weisbecker
Don't stop the tick if we have pending irq works on the
queue, otherwise if the arch can't raise self-IPIs, we may not
find an opportunity to execute the pending works for a while.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |6 ++
 kernel/irq_work.c|   11 +++
 kernel/time/tick-sched.c |3 ++-
 3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 6a9e8f5..a69704f 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work);
 void irq_work_run(void);
 void irq_work_sync(struct irq_work *work);
 
+#ifdef CONFIG_IRQ_WORK
+bool irq_work_needs_cpu(void);
+#else
+static bool irq_work_needs_cpu(void) { return false; }
+#endif
+
 #endif /* _LINUX_IRQ_WORK_H */
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 64eddd5..b3c113a 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work)
 }
 EXPORT_SYMBOL_GPL(irq_work_queue);
 
+bool irq_work_needs_cpu(void)
+{
+   struct llist_head *this_list;
+
+   this_list = __get_cpu_var(irq_work_list);
+   if (llist_empty(this_list))
+   return false;
+
+   return true;
+}
+
 /*
  * Run the irq_work entries on this cpu. Requires to be ran from hardirq
  * context with local IRQs disabled.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e945aa..f249e8c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include linux/profile.h
 #include linux/sched.h
 #include linux/module.h
+#include linux/irq_work.h
 
 #include asm/irq_regs.h
 
@@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched 
*ts,
} while (read_seqretry(xtime_lock, seq));
 
if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) ||
-   arch_needs_cpu(cpu)) {
+   arch_needs_cpu(cpu) || irq_work_needs_cpu()) {
next_jiffies = last_jiffies + 1;
delta_jiffies = 1;
} else {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/9] irq_work: Make self-IPIs optable

2012-11-17 Thread Frederic Weisbecker
On irq work initialization, let the user choose to define it
as lazy or not. Lazy means that we don't want to send
an IPI (provided the arch can anyway) when we enqueue this
work but we rather prefer to wait for the next timer tick
to execute our work if possible.

This is going to be a benefit for non-urgent enqueuers
(like printk in the future) that may prefer not to raise
an IPI storm in case of frequent enqueuing on short periods
of time.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 include/linux/irq_work.h |   14 +
 kernel/irq_work.c|   47 ++---
 2 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index a69704f..b28eb60 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,20 @@
 
 #include linux/llist.h
 
+/*
+ * An entry can be in one of four states:
+ *
+ * free NULL, 0 - {claimed}   : free to be used
+ * claimed   NULL, 3 - {pending}   : claimed to be enqueued
+ * pending   next, 3 - {busy}  : queued, pending callback
+ * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
+ */
+
+#define IRQ_WORK_PENDING   1UL
+#define IRQ_WORK_BUSY  2UL
+#define IRQ_WORK_FLAGS 3UL
+#define IRQ_WORK_LAZY  4UL /* Doesn't want IPI, wait for tick */
+
 struct irq_work {
unsigned long flags;
struct llist_node llnode;
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 480f747..7f3a59b 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -12,24 +12,15 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/irqflags.h
+#include linux/sched.h
+#include linux/tick.h
 #include linux/cpu.h
 #include linux/notifier.h
 #include asm/processor.h
 
-/*
- * An entry can be in one of four states:
- *
- * free NULL, 0 - {claimed}   : free to be used
- * claimed   NULL, 3 - {pending}   : claimed to be enqueued
- * pending   next, 3 - {busy}  : queued, pending callback
- * busy  NULL, 2 - {free, claimed} : callback in progress, can be claimed
- */
-
-#define IRQ_WORK_PENDING   1UL
-#define IRQ_WORK_BUSY  2UL
-#define IRQ_WORK_FLAGS 3UL
 
 static DEFINE_PER_CPU(struct llist_head, irq_work_list);
+static DEFINE_PER_CPU(int, irq_work_raised);
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -69,14 +60,19 @@ void __weak arch_irq_work_raise(void)
  */
 static void __irq_work_queue(struct irq_work *work)
 {
-   bool empty;
-
preempt_disable();
 
-   empty = llist_add(work-llnode, __get_cpu_var(irq_work_list));
-   /* The list was empty, raise self-interrupt to start processing. */
-   if (empty)
-   arch_irq_work_raise();
+   llist_add(work-llnode, __get_cpu_var(irq_work_list));
+
+   /*
+* If the work is not lazy or the tick is stopped, raise the irq
+* work interrupt (if supported by the arch), otherwise, just wait
+* for the next tick.
+*/
+   if (!(work-flags  IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
+   if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
+   arch_irq_work_raise();
+   }
 
preempt_enable();
 }
@@ -117,10 +113,19 @@ bool irq_work_needs_cpu(void)
 
 static void __irq_work_run(void)
 {
+   unsigned long flags;
struct irq_work *work;
struct llist_head *this_list;
struct llist_node *llnode;
 
+
+   /*
+* Reset the raised state right before we check the list because
+* an NMI may enqueue after we find the list empty from the runner.
+*/
+   __this_cpu_write(irq_work_raised, 0);
+   barrier();
+
this_list = __get_cpu_var(irq_work_list);
if (llist_empty(this_list))
return;
@@ -140,13 +145,15 @@ static void __irq_work_run(void)
 * to claim that work don't rely on us to handle their data
 * while we are in the middle of the func.
 */
-   xchg(work-flags, IRQ_WORK_BUSY);
+   flags = work-flags  ~IRQ_WORK_PENDING;
+   xchg(work-flags, flags);
+
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
 * no-one else claimed it meanwhile.
 */
-   (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
+   (void)cmpxchg(work-flags, flags, flags  ~IRQ_WORK_BUSY);
}
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info

[PATCH 1/9] irq_work: Fix racy IRQ_WORK_BUSY flag setting

2012-11-17 Thread Frederic Weisbecker
The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.

So if we have specific data to compute in our work, we ensure it's
either handled by another CPU or locally by enqueuing the work again.
This state machine is guanranteed by atomic operations on the flags.

So when we set IRQ_WORK_BUSY without using an xchg-like operation,
we break this guarantee as in the following summarized scenario:

CPU 1   CPU 2
-   -
(flags = 0)
old_flags = flags;
(flags = 0)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(flags = 3)
[...]
flags = IRQ_WORK_BUSY
(flags = 2)
func()
(sees flags = 3)
cmpxchg(flags, old_flags,
old_flags | 
IRQ_WORK_FLAGS)
(give up)

cmpxchg(flags, 2, 0);
(flags = 0)

CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and
the work is again in a claimable state. Now CPU 2 has new data to process
and try to claim that work but it may see a stale value of the flags
and think the work is still pending somewhere that will handle our data.
This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically.

As a result, the data expected to be handle by CPU 2 won't get handled.

To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2
will see the correct value with cmpxchg() using the expected ordering.

Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Cc: Anish Kumar anish198519851...@gmail.com
---
 kernel/irq_work.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 1588e3b..57be1a6 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -119,8 +119,11 @@ void irq_work_run(void)
/*
 * Clear the PENDING bit, after this point the @work
 * can be re-used.
+* Make it immediately visible so that other CPUs trying
+* to claim that work don't rely on us to handle their data
+* while we are in the middle of the func.
 */
-   work-flags = IRQ_WORK_BUSY;
+   xchg(work-flags, IRQ_WORK_BUSY);
work-func(work);
/*
 * Clear the BUSY bit and return to the free state if
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/9] irq_work: Warn if there's still work on cpu_down

2012-11-17 Thread Frederic Weisbecker
From: Steven Rostedt rost...@goodmis.org

If we are in nohz and there's still irq_work to be done when the idle
task is about to go offline, give a nasty warning. Everything should
have been flushed from the CPU_DYING notifier already. Further attempts
to enqueue an irq_work are buggy because irqs are disabled by
__cpu_disable(). The best we can do is to report the issue to the user.

Signed-off-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 kernel/irq_work.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 4ed1749..480f747 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -109,6 +109,9 @@ bool irq_work_needs_cpu(void)
if (llist_empty(this_list))
return false;
 
+   /* All work should have been flushed before going offline */
+   WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
+
return true;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/9] irq_work: Flush work on CPU_DYING

2012-11-17 Thread Frederic Weisbecker
From: Steven Rostedt rost...@goodmis.org

In order not to offline a CPU with pending irq works, flush the
queue from CPU_DYING. The notifier is called by stop_machine on
the CPU that is going down. The code will not be called from irq context
(so things like get_irq_regs() wont work) but I'm not sure what the
requirements are for irq_work in that regard (Peter?). But irqs are
disabled and the CPU is about to go offline. Might as well flush the work.

Signed-off-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 kernel/irq_work.c |   51 +--
 1 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index b3c113a..4ed1749 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -12,6 +12,8 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/irqflags.h
+#include linux/cpu.h
+#include linux/notifier.h
 #include asm/processor.h
 
 /*
@@ -110,11 +112,7 @@ bool irq_work_needs_cpu(void)
return true;
 }
 
-/*
- * Run the irq_work entries on this cpu. Requires to be ran from hardirq
- * context with local IRQs disabled.
- */
-void irq_work_run(void)
+static void __irq_work_run(void)
 {
struct irq_work *work;
struct llist_head *this_list;
@@ -124,7 +122,6 @@ void irq_work_run(void)
if (llist_empty(this_list))
return;
 
-   BUG_ON(!in_irq());
BUG_ON(!irqs_disabled());
 
llnode = llist_del_all(this_list);
@@ -149,6 +146,16 @@ void irq_work_run(void)
(void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0);
}
 }
+
+/*
+ * Run the irq_work entries on this cpu. Requires to be ran from hardirq
+ * context with local IRQs disabled.
+ */
+void irq_work_run(void)
+{
+   BUG_ON(!in_irq());
+   __irq_work_run();
+}
 EXPORT_SYMBOL_GPL(irq_work_run);
 
 /*
@@ -163,3 +170,35 @@ void irq_work_sync(struct irq_work *work)
cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
+
+#ifdef CONFIG_HOTPLUG_CPU
+static int irq_work_cpu_notify(struct notifier_block *self,
+  unsigned long action, void *hcpu)
+{
+   long cpu = (long)hcpu;
+
+   switch (action) {
+   case CPU_DYING:
+   /* Called from stop_machine */
+   if (WARN_ON_ONCE(cpu != smp_processor_id()))
+   break;
+   __irq_work_run();
+   break;
+   default:
+   break;
+   }
+   return NOTIFY_OK;
+}
+
+static struct notifier_block cpu_notify;
+
+static __init int irq_work_init_cpu_notifier(void)
+{
+   cpu_notify.notifier_call = irq_work_cpu_notify;
+   cpu_notify.priority = 0;
+   register_cpu_notifier(cpu_notify);
+   return 0;
+}
+device_initcall(irq_work_init_cpu_notifier);
+
+#endif /* CONFIG_HOTPLUG_CPU */
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/9] irq_work: Remove CONFIG_HAVE_IRQ_WORK

2012-11-17 Thread Frederic Weisbecker
irq work can run on any arch even without IPI
support because of the hook on update_process_times().

So lets remove HAVE_IRQ_WORK because it doesn't reflect
any backend requirement.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Acked-by: Steven Rostedt rost...@goodmis.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Paul Gortmaker paul.gortma...@windriver.com
---
 arch/alpha/Kconfig  |1 -
 arch/arm/Kconfig|1 -
 arch/arm64/Kconfig  |1 -
 arch/blackfin/Kconfig   |1 -
 arch/frv/Kconfig|1 -
 arch/hexagon/Kconfig|1 -
 arch/mips/Kconfig   |1 -
 arch/parisc/Kconfig |1 -
 arch/powerpc/Kconfig|1 -
 arch/s390/Kconfig   |1 -
 arch/sh/Kconfig |1 -
 arch/sparc/Kconfig  |1 -
 arch/x86/Kconfig|1 -
 drivers/staging/iio/trigger/Kconfig |1 -
 init/Kconfig|4 
 15 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5dd7f5d..e56c2d1 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -5,7 +5,6 @@ config ALPHA
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_SYSCALL_WRAPPERS
-   select HAVE_IRQ_WORK
select HAVE_PCSPKR_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_DMA_ATTRS
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ade7e92..22d378b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -36,7 +36,6 @@ config ARM
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if (PERF_EVENTS  (CPU_V6 || CPU_V6K || 
CPU_V7))
select HAVE_IDE if PCI || ISA || PCMCIA
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ef54a59..dd50d72 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,7 +17,6 @@ config ARM64
select HAVE_GENERIC_DMA_COHERENT
select HAVE_GENERIC_HARDIRQS
select HAVE_HW_BREAKPOINT if PERF_EVENTS
-   select HAVE_IRQ_WORK
select HAVE_MEMBLOCK
select HAVE_PERF_EVENTS
select HAVE_SPARSE_IRQ
diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig
index b6f3ad5..86f891f 100644
--- a/arch/blackfin/Kconfig
+++ b/arch/blackfin/Kconfig
@@ -24,7 +24,6 @@ config BLACKFIN
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_IDE
-   select HAVE_IRQ_WORK
select HAVE_KERNEL_GZIP if RAMKERNEL
select HAVE_KERNEL_BZIP2 if RAMKERNEL
select HAVE_KERNEL_LZMA if RAMKERNEL
diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig
index df2eb4b..c44fd6e 100644
--- a/arch/frv/Kconfig
+++ b/arch/frv/Kconfig
@@ -3,7 +3,6 @@ config FRV
default y
select HAVE_IDE
select HAVE_ARCH_TRACEHOOK
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_UID16
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index 0744f7d..40a3185 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -14,7 +14,6 @@ config HEXAGON
# select HAVE_CLK
# select IRQ_PER_CPU
# select GENERIC_PENDING_IRQ if SMP
-   select HAVE_IRQ_WORK
select GENERIC_ATOMIC64
select HAVE_PERF_EVENTS
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index dba9390..3d86d69 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,7 +4,6 @@ config MIPS
select HAVE_GENERIC_DMA_COHERENT
select HAVE_IDE
select HAVE_OPROFILE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_ARCH_KGDB
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 11def45..8f0df47 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -9,7 +9,6 @@ config PARISC
select RTC_DRV_GENERIC
select INIT_ALL_POSSIBLE
select BUG
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select GENERIC_ATOMIC64 if !64BIT
select HAVE_GENERIC_HARDIRQS
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a902a5c..a90f0c9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -118,7 +118,6 @@ config PPC
select HAVE_SYSCALL_WRAPPERS if PPC64
select GENERIC_ATOMIC64 if PPC32
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
-   select HAVE_IRQ_WORK
select HAVE_PERF_EVENTS
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_HW_BREAKPOINT if PERF_EVENTS  PPC_BOOK3S_64
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5dba755..0816ff0

Re: [PATCH] nohz/cpuset: Make a CPU stick with do_timer() duty in the presence of nohz cpusets

2012-11-19 Thread Frederic Weisbecker
Hi Hakan,

As I start to focus on timekeeing for full dynticks, I'm looking at
your patch. Sorry I haven't yet replied with a serious review until
now. But here is it, finally:

2012/6/17 Hakan Akkan hakanak...@gmail.com:
 An adaptive nohz (AHZ) CPU may not do do_timer() for a while
 despite being non-idle. When all other CPUs are idle, AHZ
 CPUs might be using stale jiffies values. To prevent this
 always keep a CPU with ticks if there is one or more AHZ
 CPUs.

 The patch changes can_stop_{idle,adaptive}_tick functions
 and prevents either the last CPU who did the do_timer() duty
 or the AHZ CPU itself from stopping its sched timer if there
 is one or more AHZ CPUs in the system. This means AHZ CPUs
 might keep the ticks running for short periods until a
 non-AHZ CPU takes the charge away in
 tick_do_timer_check_handler() function. When a non-AHZ CPU
 takes the charge, it never gives it away so that AHZ CPUs
 can run tickless.

 Signed-off-by: Hakan Akkan hakanak...@gmail.com
 CC: Frederic Weisbecker fweis...@gmail.com
 ---
  include/linux/cpuset.h   |3 ++-
  kernel/cpuset.c  |5 +
  kernel/time/tick-sched.c |   31 ++-
  3 files changed, 37 insertions(+), 2 deletions(-)

 diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
 index ccbc2fd..19aa448 100644
 --- a/include/linux/cpuset.h
 +++ b/include/linux/cpuset.h
 @@ -266,11 +266,12 @@ static inline bool cpuset_adaptive_nohz(void)

  extern void cpuset_exit_nohz_interrupt(void *unused);
  extern void cpuset_nohz_flush_cputimes(void);
 +extern bool nohz_cpu_exist(void);
  #else
  static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; }
  static inline bool cpuset_adaptive_nohz(void) { return false; }
  static inline void cpuset_nohz_flush_cputimes(void) { }
 -
 +static inline bool nohz_cpu_exist(void) { return false; }
  #endif /* CONFIG_CPUSETS_NO_HZ */

  #endif /* _LINUX_CPUSET_H */
 diff --git a/kernel/cpuset.c b/kernel/cpuset.c
 index 858217b..ccbaac9 100644
 --- a/kernel/cpuset.c
 +++ b/kernel/cpuset.c
 @@ -1231,6 +1231,11 @@ DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref);

  static cpumask_t nohz_cpuset_mask;

 +inline bool nohz_cpu_exist(void)
 +{
 +   return !cpumask_empty(nohz_cpuset_mask);
 +}
 +
  static void flush_cputime_interrupt(void *unused)
  {
 trace_printk(IPI: flush cputime\n);
 diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
 index bdc8aeb..e60d541 100644
 --- a/kernel/time/tick-sched.c
 +++ b/kernel/time/tick-sched.c
 @@ -409,6 +409,25 @@ out:
 return ret;
  }

 +static inline bool must_take_timer_duty(int cpu)
 +{
 +   int handler = tick_do_timer_cpu;
 +   bool ret = false;
 +   bool tick_needed = nohz_cpu_exist();

Note this is racy because this fetches the value of the
nohz_cpuset_mask without locking.
We may see there is no nohz cpusets whereas we just set some of them
as nohz and they
could even have shut down their tick already.

 +
 +   /*
 +* A CPU will have to take the timer duty if there is an adaptive
 +* nohz CPU in the system. The last handler == cpu check ensures
 +* that the last cpu that did the do_timer() sticks with the duty.
 +* A normal (non nohz) cpu will take the charge from a nohz cpu in
 +* tick_do_timer_check_handler anyway.
 +*/
 +   if (tick_needed  (handler == TICK_DO_TIMER_NONE || handler == cpu))
 +   ret = true;

This check is also racy due to the lack of locking. The previous
handler may have set TICK_DO_TIMER_NONE and gone to sleep. We have no
guarantee that the CPU can see that new value. It could believe there
is still a handler. This needs at least cmpxchg() to make the test and
set atomic.

 +
 +   return ret;
 +}
 +
  static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
  {
 /*
 @@ -421,6 +440,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
 *ts)
 if (unlikely(!cpu_online(cpu))) {
 if (cpu == tick_do_timer_cpu)
 tick_do_timer_cpu = TICK_DO_TIMER_NONE;
 +   } else if (must_take_timer_duty(cpu)) {
 +   tick_do_timer_cpu = cpu;
 +   return false;
 }

 if (unlikely(ts-nohz_mode == NOHZ_MODE_INACTIVE))
 @@ -512,6 +534,13 @@ void tick_nohz_idle_enter(void)
  #ifdef CONFIG_CPUSETS_NO_HZ
  static bool can_stop_adaptive_tick(void)
  {
 +   int cpu = smp_processor_id();
 +
 +   if (must_take_timer_duty(cpu)) {
 +   tick_do_timer_cpu = cpu;
 +   return false;

One problem I see here is that you randomize the handler. It could be
an adaptive nohz CPU or an idle CPU. It's a problem if the user wants
CPU isolation.

I suggest to rather define a tunable timekeeping duty CPU affinity in
a cpumask file at /sys/devices/system/cpu/timekeeping and a toggle at
/sys/devices/system/cpu/cpuX/timekeeping (like the online file). This
way the user can decide whether adaptive nohz CPU

Re: linux-next: manual merge of the tip tree with the rr tree

2012-09-28 Thread Frederic Weisbecker
On Fri, Sep 28, 2012 at 01:33:41PM +1000, Stephen Rothwell wrote:
 Hi all,
 
 Today's linux-next merge of the tip tree got a conflict in arch/Kconfig
 between commit 9a9d5786a5e7 (Make most arch asm/module.h files use
 asm-generic/module.h) from the rr tree and commits fdf9c356502a
 (cputime: Make finegrained irqtime accounting generally available) and
 2b1d5024e17b (rcu: Settle config for userspace extended quiescent
 state) from the tip tree.
 
 I fixed it up (see below) and can carry the fix as necessary (no action
 is required).

Looks good. Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: eqs related warnings in linux-next

2012-09-28 Thread Frederic Weisbecker
On Fri, Sep 28, 2012 at 02:51:03PM +0200, Sasha Levin wrote:
 Hi all,
 
 While fuzzing with trinity inside a KVM tools guest with the latest 
 linux-next kernel, I've stumbled on the following during boot:
 
 [  199.224369] WARNING: at kernel/rcutree.c:513 
 rcu_eqs_exit_common+0x4a/0x3a0()
 [  199.225307] Pid: 1, comm: init Tainted: GW
 3.6.0-rc7-next-20120928-sasha-1-g8b2d05d-dirty #13
 [  199.226611] Call Trace:
 [  199.226951]  [811c8d1a] ? rcu_eqs_exit_common+0x4a/0x3a0
 [  199.227773]  [81108e36] warn_slowpath_common+0x86/0xb0
 [  199.228572]  [81108f25] warn_slowpath_null+0x15/0x20
 [  199.229348]  [811c8d1a] rcu_eqs_exit_common+0x4a/0x3a0
 [  199.230037]  [8117f267] ? __lock_acquire+0x1c37/0x1ca0
 [  199.230037]  [811c936c] rcu_eqs_exit+0x9c/0xb0
 [  199.230037]  [811c940c] rcu_user_exit+0x8c/0xf0
 [  199.230037]  [810a98bb] do_page_fault+0x1b/0x40
 [  199.230037]  [810a2a90] do_async_page_fault+0x30/0xa0
 [  199.230037]  [83a3eea8] async_page_fault+0x28/0x30
 [  199.230037]  [819f357b] ? debug_object_activate+0x6b/0x1b0
 [  199.230037]  [819f3586] ? debug_object_activate+0x76/0x1b0
 [  199.230037]  [8111af13] ? lock_timer_base.isra.19+0x33/0x70
 [  199.230037]  [8111d45f] mod_timer_pinned+0x9f/0x260
 [  199.230037]  [811c5ff4] rcu_eqs_enter_common+0x894/0x970
 [  199.230037]  [839dc2ac] ? init_post+0x75/0xc8
 [  199.230037]  [85abfed5] ? kernel_init+0x1e1/0x1e1
 [  199.230037]  [811c63df] rcu_eqs_enter+0xaf/0xc0
 [  199.230037]  [811c64c5] rcu_user_enter+0xd5/0x140
 [  199.230037]  [8107d0fd] syscall_trace_leave+0xfd/0x150
 [  199.230037]  [83a3f7af] int_check_syscall_exit_work+0x34/0x3d
 [  199.230037] ---[ end trace a582c3a264d5bd1a ]---

We are faulting in the middle of rcu_user_enter() and thus we call 
rcu_user_exit()
while the whole transition state in rcu_user_enter() is not yet finished 
(rdtp-dynticks
not incremented).

Not sure how to solve this...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: eqs related warnings in linux-next

2012-09-28 Thread Frederic Weisbecker
On Fri, Sep 28, 2012 at 02:51:03PM +0200, Sasha Levin wrote:
 Hi all,
 
 While fuzzing with trinity inside a KVM tools guest with the latest 
 linux-next kernel, I've stumbled on the following during boot:
 
 [  199.224369] WARNING: at kernel/rcutree.c:513 
 rcu_eqs_exit_common+0x4a/0x3a0()
 [  199.225307] Pid: 1, comm: init Tainted: GW
 3.6.0-rc7-next-20120928-sasha-1-g8b2d05d-dirty #13
 [  199.226611] Call Trace:
 [  199.226951]  [811c8d1a] ? rcu_eqs_exit_common+0x4a/0x3a0
 [  199.227773]  [81108e36] warn_slowpath_common+0x86/0xb0
 [  199.228572]  [81108f25] warn_slowpath_null+0x15/0x20
 [  199.229348]  [811c8d1a] rcu_eqs_exit_common+0x4a/0x3a0
 [  199.230037]  [8117f267] ? __lock_acquire+0x1c37/0x1ca0
 [  199.230037]  [811c936c] rcu_eqs_exit+0x9c/0xb0
 [  199.230037]  [811c940c] rcu_user_exit+0x8c/0xf0
 [  199.230037]  [810a98bb] do_page_fault+0x1b/0x40
 [  199.230037]  [810a2a90] do_async_page_fault+0x30/0xa0
 [  199.230037]  [83a3eea8] async_page_fault+0x28/0x30
 [  199.230037]  [819f357b] ? debug_object_activate+0x6b/0x1b0
 [  199.230037]  [819f3586] ? debug_object_activate+0x76/0x1b0
 [  199.230037]  [8111af13] ? lock_timer_base.isra.19+0x33/0x70
 [  199.230037]  [8111d45f] mod_timer_pinned+0x9f/0x260
 [  199.230037]  [811c5ff4] rcu_eqs_enter_common+0x894/0x970
 [  199.230037]  [839dc2ac] ? init_post+0x75/0xc8
 [  199.230037]  [85abfed5] ? kernel_init+0x1e1/0x1e1
 [  199.230037]  [811c63df] rcu_eqs_enter+0xaf/0xc0
 [  199.230037]  [811c64c5] rcu_user_enter+0xd5/0x140
 [  199.230037]  [8107d0fd] syscall_trace_leave+0xfd/0x150
 [  199.230037]  [83a3f7af] int_check_syscall_exit_work+0x34/0x3d
 [  199.230037] ---[ end trace a582c3a264d5bd1a ]---

Ok, we can't decently protect against any kind of exception messing up 
everything
in the middle of RCU APIs anyway. The only solution is to find out what cause 
this
page fault in mod_timer_pinned() and work around that.

Anybody, an idea?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCHSET 00/15] perf report: Add support to accumulate hist periods

2012-09-28 Thread Frederic Weisbecker
On Fri, Sep 28, 2012 at 09:07:57AM +0200, Stephane Eranian wrote:
 On Fri, Sep 28, 2012 at 7:49 AM, Namhyung Kim namhy...@kernel.org wrote:
  Hi Frederic,
 
  On Fri, 28 Sep 2012 01:01:48 +0200, Frederic Weisbecker wrote:
  When Arun was working on this, I asked him to explore if it could make 
  sense to reuse
  the -b, --branch-stack  perf report option. Because after all, this 
  feature is doing
  about the same than -b except it's using callchains instead of full 
  branch tracing.
  But callchains are branches. Just a limited subset of all branches taken 
  on excecution.
  So you can probably reuse some interface and even ground code there.
 
  What do you think?
 
  Umm.. first of all, I'm not familiar with the branch stack thing.  It's
  intel-specific, right?
 
 The kernel API is NOT specific to Intel. It is abstracted to be portable
 across architecture. The implementation only exists on certain Intel
 X86 processors.
 
  Also I don't understand what exactly you want here.  What kind of
  interface did you say?  Can you elaborate it bit more?
 
 Not clear to me either.
 
  And AFAIK branch stack can collect much more branch information than
  just callstacks.  Can we differentiate which is which easily?  Is there
  any limitation on using it?  What if callstacks are not sync'ed with
  branch stacks - is it possible though?
 
 First of all branch stack is not a branch tracing mechanism. This is a
 branch sampling mechanism. Not all branches are captured. Only the
 last N consecutive branches leading to a PMU interrupt are captured
 in each sample.
 
 Yes, the branch stack mechanism as it exists on Intel processors
 can capture more then call branches. It is HW based and provides
 a branch type filter. Filtering capability is exposed at the API level
 in a generic fashion. The hw filter is based on opcodes. Call branches
 all cover call, syscall instructions. As such, the branch stack mechanism
 cannot be used to capture callstacks to shared libraries, simply because
 there a a non call instruction in the trampoline. To obtain a better quality
 callstack you have instead to sample return branches. So yes, callstacks
 are not sync'ed with branch stack even if limited to call branches.
 

You're right. One doesn't simply sample callchains on top of branch tracing. 
Not easily at least.
But that's not what we want here. We want the other way round: use callchains 
as branch sampling.
And a callchain _is_ a branch sampling. Just a specialized one.

PERF_SAMPLE_BRANCH_STACK either records only calls, only ret, or everything, 
or
You can define the filter with -j option. Now callchains can be considered as 
the result
of a specific -j filter option. It's just a high level filtering. ie: not 
just based on opcode
types but on semantic post-processing. As if we applied a specific filter on a 
pure branch tracing
that cancelled calls that had matching ret.

But in the end, what we have is just branches. Some branch layout that is 
biased, that already passed
through a semantic wheel, still it's just _branches_.

Note I'm not arguing about adding a -j callchain option, just trying to show 
you that callchains
are not really different from other filtered source of branch sampling.


  But I think it'd be good if the branch stack can be changed to call
  stack in general.  Did you mean this?
 
 That's not going to happen. The mechanism is much more generic than
 that.
 
 Quite frankly, I don't understand Frederic's motivation here. The mechanism
 are not quite the same.

So, considering that callchains are just branches, why can't we use them as
a branch source, just like PERF_SAMPLE_BRANCH_STACK data samples, that we
can reuse in perf report -b.

Look at commit b50311dc2ac1c04ad19163c2359910b25e16caf6
perf report: Add support for taken branch sampling. It's doing (except for a 
few details
like the period weight of branch samples) the same than in Namhyung patch, just 
with
PERF_SAMPLE_BRANCH_STACK instead of callchains.

I don't understand what justifies this duplication.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCHSET 00/15] perf report: Add support to accumulate hist periods

2012-09-28 Thread Frederic Weisbecker
On Fri, Sep 28, 2012 at 02:49:55PM +0900, Namhyung Kim wrote:
 Hi Frederic,
 
 On Fri, 28 Sep 2012 01:01:48 +0200, Frederic Weisbecker wrote:
  When Arun was working on this, I asked him to explore if it could make 
  sense to reuse
  the -b, --branch-stack  perf report option. Because after all, this 
  feature is doing
  about the same than -b except it's using callchains instead of full 
  branch tracing.
  But callchains are branches. Just a limited subset of all branches taken on 
  excecution.
  So you can probably reuse some interface and even ground code there.
 
  What do you think?
 
 Umm.. first of all, I'm not familiar with the branch stack thing.  It's
 intel-specific, right?
 
 Also I don't understand what exactly you want here.  What kind of
 interface did you say?  Can you elaborate it bit more?

Look at commit b50311dc2ac1c04ad19163c2359910b25e16caf6
perf report: Add support for taken branch sampling. It's doing almost
the same than you do, just using PERF_SAMPLE_BRANCH_STACK instead of
callchains.

 And AFAIK branch stack can collect much more branch information than
 just callstacks.

That's not a problem. Callchains are just a high-level filtered source of
branch samples. You don't need full branches to use -b. Just use the flavour
of branch samples you want to make the sense you want on your branch sampling.

 Can we differentiate which is which easily?

Sure. If you have both sources in your perf.data (PERF_SAMPLE_BRANCH_STACK and
callchains), ask the user which one he wants. Otherwise defaults to what's 
there.

 Is there
 any limitation on using it?  What if callstacks are not sync'ed with
 branch stacks - is it possible though?

It' better to make both sources mutually exclusive. Otherwise it's going
to be over-complicated.

 
 But I think it'd be good if the branch stack can be changed to call
 stack in general.  Did you mean this?

That's a different. We might be able to post-process branch tracing and
build a callchain on top of it (following calls and ret). May be we will
one day. But they are different issues altogether.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: eqs related warnings in linux-next

2012-09-29 Thread Frederic Weisbecker
2012/9/29 Sasha Levin levinsasha...@gmail.com:
 Maybe I could help here a bit.

 lappy linux # addr2line -i -e vmlinux 8111d45f
 /usr/src/linux/kernel/timer.c:549
 /usr/src/linux/include/linux/jump_label.h:101
 /usr/src/linux/include/trace/events/timer.h:44
 /usr/src/linux/kernel/timer.c:601
 /usr/src/linux/kernel/timer.c:734
 /usr/src/linux/kernel/timer.c:886

 Which means that it was about to:

 debug_object_activate(timer, timer_debug_descr);

I can't find anything in the debug object code that might fault.
I was suspecting some per cpu allocated memory: per cpu allocation
sometimes use vmalloc
which uses lazy paging using faults. But I can't find such thing there.

May be there is some faulting specific to KVM...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rcu: eqs related warnings in linux-next

2012-09-29 Thread Frederic Weisbecker
On Sat, Sep 29, 2012 at 06:37:37AM -0700, Paul E. McKenney wrote:
 On Sat, Sep 29, 2012 at 02:25:04PM +0200, Frederic Weisbecker wrote:
  2012/9/29 Sasha Levin levinsasha...@gmail.com:
   Maybe I could help here a bit.
  
   lappy linux # addr2line -i -e vmlinux 8111d45f
   /usr/src/linux/kernel/timer.c:549
   /usr/src/linux/include/linux/jump_label.h:101
   /usr/src/linux/include/trace/events/timer.h:44
   /usr/src/linux/kernel/timer.c:601
   /usr/src/linux/kernel/timer.c:734
   /usr/src/linux/kernel/timer.c:886
  
   Which means that it was about to:
  
   debug_object_activate(timer, timer_debug_descr);
 
 Understood and agreed, hence my severe diagnostic patch.
 
  I can't find anything in the debug object code that might fault.
  I was suspecting some per cpu allocated memory: per cpu allocation
  sometimes use vmalloc
  which uses lazy paging using faults. But I can't find such thing there.
  
  May be there is some faulting specific to KVM...
 
 Sasha, is the easily reproducible?  If so, could you please try the
 previous patch?  It will likely give us more information on where
 this bug really lives.  (Yes, it might totally obscure the bug, but
 in that case we will just need to try some other perturbation.)

Isn't your patch actually removing the timer? But if so, we won't fault
anymore, or may be you want to check if we fault also outside the timer?

Just in case, I'm posting a second patch that dumps the regs when we
fault in the middle of an RCU user mode API. This way we can find
the precise rip where we fault:

---
From db4ef9708e606754ac8a3f83b9f293383d263108 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker fweis...@gmail.com
Date: Sat, 29 Sep 2012 14:16:09 +0200
Subject: [PATCH] rcu: Debug nasty rcu user mode API recursion

Add some debug code to chase down the origin of the fault.

Not-Signed-off-by: Frederic Weisbecker fweis...@gmail.com
---
 arch/x86/mm/fault.c  |1 +
 include/linux/rcupdate.h |1 +
 kernel/rcutree.c |   32 
 kernel/rcutree.h |1 +
 4 files changed, 35 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a530b23..a5f0eb5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1232,6 +1232,7 @@ good_area:
 dotraplinkage void __kprobes
 do_page_fault(struct pt_regs *regs, unsigned long error_code)
 {
+   rcu_check_user_recursion(regs);
exception_enter(regs);
__do_page_fault(regs, error_code);
exception_exit(regs);
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7c968e4..14ba908 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -199,6 +199,7 @@ extern void rcu_user_enter_after_irq(void);
 extern void rcu_user_exit_after_irq(void);
 extern void rcu_user_hooks_switch(struct task_struct *prev,
  struct task_struct *next);
+extern void rcu_check_user_recursion(struct pt_regs *regs);
 #else
 static inline void rcu_user_enter(void) { }
 static inline void rcu_user_exit(void) { }
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4fb2376..63b84f5 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -405,6 +405,20 @@ void rcu_idle_enter(void)
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
 
 #ifdef CONFIG_RCU_USER_QS
+void rcu_check_user_recursion(struct pt_regs *regs)
+{
+   unsigned long flags;
+   static int printed;
+
+   local_irq_save(flags);
+   if (__this_cpu_read(rcu_dynticks.recursion)  !printed) {
+   printed = 1;
+   printk(Found recursion\n);
+   show_regs(regs);
+   }
+   local_irq_restore(flags);
+}
+
 /**
  * rcu_user_enter - inform RCU that we are resuming userspace.
  *
@@ -433,10 +447,20 @@ void rcu_user_enter(void)
 
local_irq_save(flags);
rdtp = __get_cpu_var(rcu_dynticks);
+   if (WARN_ON_ONCE(rdtp-recursion)) {
+   local_irq_restore(flags);
+   return;
+   }
+
+   rdtp-recursion = true;
+   barrier();
+
if (!rdtp-ignore_user_qs  !rdtp-in_user) {
rdtp-in_user = true;
rcu_eqs_enter(true);
}
+   rdtp-recursion = false;
+
local_irq_restore(flags);
 }
 
@@ -590,10 +614,18 @@ void rcu_user_exit(void)
 
local_irq_save(flags);
rdtp = __get_cpu_var(rcu_dynticks);
+   if (WARN_ON_ONCE(rdtp-recursion)) {
+   local_irq_restore(flags);
+   return;
+   }
+
+   rdtp-recursion = true;
+   barrier();
if (rdtp-in_user) {
rdtp-in_user = false;
rcu_eqs_exit(true);
}
+   rdtp-recursion = false;
local_irq_restore(flags);
 }
 
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 5faf05d..1bde9d5 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -103,6 +103,7 @@ struct rcu_dynticks {
int tick_nohz_enabled_snap; /* Previously seen value from sysfs. */
 #endif /* #ifdef

Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path

2012-10-02 Thread Frederic Weisbecker
On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote:
 diff --git a/arch/x86/kernel/cpu/perf_event.c 
 b/arch/x86/kernel/cpu/perf_event.c
 index 915b876..11d62ff 100644
 --- a/arch/x86/kernel/cpu/perf_event.c
 +++ b/arch/x86/kernel/cpu/perf_event.c
 @@ -34,6 +34,7 @@
  #include asm/timer.h
  #include asm/desc.h
  #include asm/ldt.h
 +#include asm/syscall.h
  
  #include perf_event.h
  
 @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct 
 perf_event_mmap_page *userpg, u64 now)
   userpg-time_offset = this_cpu_read(cyc2ns_offset) - now;
  }
  
 +#ifdef CONFIG_X86_64
 +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs)
 +{
 + int kernel = !user_mode(regs);
 +
 + if (kernel) {
 + if (current-mm)
 + regs = task_pt_regs(current);
 + else
 + regs = NULL;
 + }

Shouldn't the above stay in generic code?

 +
 + if (regs) {
 + memcpy(oregs, regs, sizeof(*regs));
 +
 + /*
 +  * If the perf event was triggered within the kernel code
 +  * path, then it was either syscall or interrupt. While
 +  * interrupt stores almost all user registers, the syscall
 +  * fast path does not. At this point we can at least set
 +  * rsp register right, which is crucial for dwarf unwind.
 +  *
 +  * The syscall_get_nr function returns -1 (orig_ax) for
 +  * interrupt, and positive value for syscall.
 +  *
 +  * We have two race windows in here:
 +  *
 +  * 1) Few instructions from syscall entry until old_rsp is
 +  *set.
 +  *
 +  * 2) In syscall/interrupt path from entry until the orig_ax
 +  *is set.
 +  *
 +  * Above described race windows are fractional opposed to
 +  * the syscall fast path, so we get much better results
 +  * fixing rsp this way.

That said, a race is there already: if the syscall is interrupted before
SAVE_ARGS and co.

I'm trying to scratch my head to find a solution to detect the race and
bail out instead of recording erroneous values but I can't find one.

Anyway this is still better than what we have now.

Another solution could be to force syscall slow path and have some variable
set there that tells us we are in a syscall and every regs have been saved.

But we probably don't want to force syscall slow path...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] perf tools: Check existence of _get_comp_words_by_ref when bash completing

2012-10-02 Thread Frederic Weisbecker
On Wed, Oct 03, 2012 at 12:21:32AM +0900, Namhyung Kim wrote:
 The '_get_comp_words_by_ref' function is available from the bash
 completion v1.2 so that earlier version emits following warning:
 
   $ perf reTAB_get_comp_words_by_ref: command not found
 
 Use older '_get_cword' method when the above function doesn't exist.

May be only use _get_cword then, if it works everywhere?

 
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: David Ahern dsah...@gmail.com
 Signed-off-by: Namhyung Kim namhy...@kernel.org
 ---
  tools/perf/bash_completion |   15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)
 
 diff --git a/tools/perf/bash_completion b/tools/perf/bash_completion
 index 1958fa539d0f..3d48cee1b5e5 100644
 --- a/tools/perf/bash_completion
 +++ b/tools/perf/bash_completion
 @@ -1,12 +1,23 @@
  # perf completion
  
 +function_exists()
 +{
 + declare -F $1  /dev/null
 + return $?
 +}
 +
  have perf 
  _perf()
  {
 - local cur cmd
 + local cur prev cmd
  
   COMPREPLY=()
 - _get_comp_words_by_ref cur prev
 + if function_exists _get_comp_words_by_ref; then
 + _get_comp_words_by_ref cur prev
 + else
 + cur=$(_get_cword)
 + prev=${COMP_WORDS[COMP_CWORD-1]}
 + fi
  
   cmd=${COMP_WORDS[0]}
  
 -- 
 1.7.9.2
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] perf tools: Bash completion update

2012-10-02 Thread Frederic Weisbecker
On Wed, Oct 03, 2012 at 12:21:31AM +0900, Namhyung Kim wrote:
 Hi,
 
 This patchset improves bash completion support for perf tools.  Some
 option names are really painful to type so here comes a support for
 completing those long option names.  But I still think the
 --showcpuutilization option needs to be renamed (at least adding a
 couple of dashes in it).
 
 Thanks,
 Namhyung

Acked-by: Frederic Weisbecker fweis...@gmail.com

Thanks Namhyung!

 
 
 Namhyung Kim (3):
   perf tools: Check existence of _get_comp_words_by_ref when bash completing
   perf tools: Complete long option names of perf command
   perf tools: Long option completion support for each subcommands
 
  tools/perf/bash_completion  |   36 +---
  tools/perf/util/parse-options.c |8 
  tools/perf/util/parse-options.h |1 +
  3 files changed, 38 insertions(+), 7 deletions(-)
 
 -- 
 1.7.9.2
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path

2012-10-02 Thread Frederic Weisbecker
On Tue, Oct 02, 2012 at 06:06:26PM +0200, Jiri Olsa wrote:
 On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote:
  On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote:
   diff --git a/arch/x86/kernel/cpu/perf_event.c 
   b/arch/x86/kernel/cpu/perf_event.c
   index 915b876..11d62ff 100644
   --- a/arch/x86/kernel/cpu/perf_event.c
   +++ b/arch/x86/kernel/cpu/perf_event.c
   @@ -34,6 +34,7 @@
#include asm/timer.h
#include asm/desc.h
#include asm/ldt.h
   +#include asm/syscall.h

#include perf_event.h

   @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct 
   perf_event_mmap_page *userpg, u64 now)
 userpg-time_offset = this_cpu_read(cyc2ns_offset) - now;
}

   +#ifdef CONFIG_X86_64
   +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs 
   *regs)
   +{
   + int kernel = !user_mode(regs);
   +
   + if (kernel) {
   + if (current-mm)
   + regs = task_pt_regs(current);
   + else
   + regs = NULL;
   + }
  
  Shouldn't the above stay in generic code?
 
 could be.. I guess I thought that having the regs retrieval
 plus the fixup at the same place feels better/compact ;)
 
 but could change that if needed

Yeah please.

  
  I'm trying to scratch my head to find a solution to detect the race and
  bail out instead of recording erroneous values but I can't find one.
  
  Anyway this is still better than what we have now.
  
  Another solution could be to force syscall slow path and have some variable
  set there that tells us we are in a syscall and every regs have been saved.
  
  But we probably don't want to force syscall slow path...
 
 I was trying something like that as well, but the one I sent looks
 far less hacky to me.. :)

Actually it's more hacky because it's less deterministic.
But it's more simple, and doesn't hurt performances.

Ok, let's start with that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Frederic Weisbecker
2013/1/10 Russell King - ARM Linux li...@arm.linux.org.uk:
 On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
 On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
  Emulate NMIs on systems where they are not available by using timer
  interrupts on other cpus.  Each cpu will use its softlockup hrtimer
  to check that the next cpu is processing hrtimer interrupts by
  verifying that a counter is increasing.
 
  This patch is useful on systems where the hardlockup detector is not
  available due to a lack of NMIs, for example most ARM SoCs.

 I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
 reserving an IRQ line as 'special' (can not be masked).  Not sure if that
 is something worth looking at here (or even possible).

 No it isn't, because that assumes that things like spin_lock_irqsave()
 won't mask that interrupt.  We don't have the facility to do that.

I believe sparc is doing something like this though. Look at
arch/sparc/include/asm/irqflags_64.h, it seems NMIs are implemented
there using an irq number that is not masked by this function.

Not all archs can do that so easily I guess.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Request for tree inclusion

2013-01-10 Thread Frederic Weisbecker
2012/12/3 Frederic Weisbecker fweis...@gmail.com:
 2012/12/2 Stephen Rothwell s...@canb.auug.org.au:
 Well, these are a bit late (I expected Linus to release v3.7 today), but
 since Ingo has not piped in over the weekend, I have added them from today
 after the tip tree merge.

 Yeah sorry to submit that so late. Those branches are in pending pull
 requests to the -tip tree and I thought about relying on the
 propagation of -tip into -next as usual. But Ingo has been very busy
 with numa related work during this cycle. So until these branches get
 merged in -tip, I'm short-circuiting a bit the -next step before it
 becomes too late for the next merge window.


 I have called them fw-cputime, fs-sched and fw-nohz respectively and
 listed you as the only contact in case of problems.

 Ok.

 If these are to be
 long term trees included in linux-next, I would prefer that you use
 better branch names - otherwise, if they are just short term, please tell
 me to remove them when they are finished with.

 They are definitely short term. I'll tell you once these can be dropped.

 Thanks a lot!

Hi Stephen!

fw-cputime and fs-sched have been merged so you can now remove these
branches from next.

But fw-nohz remains. In the meantime I have created a branch named nohz/next:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/next

This branch currently refers to fw-nohz HEAD (aka nohz/printk-v8) and
this is also the place where I'll gather -next materials in the future
instead of the multiple branches you're currently pulling.
So could you please remove fw-nohz (nohz/printk-v8) as well from -next
but include nohz/next instead?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-14 Thread Frederic Weisbecker
2013/1/11 Colin Cross ccr...@android.com:
 Emulate NMIs on systems where they are not available by using timer
 interrupts on other cpus.  Each cpu will use its softlockup hrtimer
 to check that the next cpu is processing hrtimer interrupts by
 verifying that a counter is increasing.

 This patch is useful on systems where the hardlockup detector is not
 available due to a lack of NMIs, for example most ARM SoCs.
 Without this patch any cpu stuck with interrupts disabled can
 cause a hardware watchdog reset with no debugging information,
 but with this patch the kernel can detect the lockup and panic,
 which can result in useful debugging info.

 Signed-off-by: Colin Cross ccr...@android.com

I believe this is pretty much what the RCU stall detector does
already: checks for other CPUs being responsive. The only difference
is on how it checks that. For RCU it's about checking for CPUs
reporting quiescent states when requested to do so. In your case it's
about ensuring the hrtimer interrupt is well handled.

One thing you can do is to enqueue an RCU callback (cal_rcu()) every
minute so you can force other CPUs to report quiescent states
periodically and thus check for lockups.

Now you'll face the same problem in the end: if you don't have NMIs,
you won't have a very useful report.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-14 Thread Frederic Weisbecker
2013/1/15 Colin Cross ccr...@android.com:
 On Mon, Jan 14, 2013 at 4:13 PM, Frederic Weisbecker fweis...@gmail.com 
 wrote:
 I believe this is pretty much what the RCU stall detector does
 already: checks for other CPUs being responsive. The only difference
 is on how it checks that. For RCU it's about checking for CPUs
 reporting quiescent states when requested to do so. In your case it's
 about ensuring the hrtimer interrupt is well handled.

 One thing you can do is to enqueue an RCU callback (cal_rcu()) every
 minute so you can force other CPUs to report quiescent states
 periodically and thus check for lockups.

 That's a good point, I'll take a look at using that.  A minute is too
 long, some SoCs have maximum HW watchdog periods of under 30 seconds,
 but a call_rcu every 10-20 seconds might be sufficient.

Sure. And you can tune CONFIG_RCU_CPU_STALL_TIMEOUT accordingly.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-14 Thread Frederic Weisbecker
2013/1/15 Colin Cross ccr...@android.com:
 On Mon, Jan 14, 2013 at 4:25 PM, Frederic Weisbecker fweis...@gmail.com 
 wrote:
 2013/1/15 Colin Cross ccr...@android.com:
 On Mon, Jan 14, 2013 at 4:13 PM, Frederic Weisbecker fweis...@gmail.com 
 wrote:
 I believe this is pretty much what the RCU stall detector does
 already: checks for other CPUs being responsive. The only difference
 is on how it checks that. For RCU it's about checking for CPUs
 reporting quiescent states when requested to do so. In your case it's
 about ensuring the hrtimer interrupt is well handled.

 One thing you can do is to enqueue an RCU callback (cal_rcu()) every
 minute so you can force other CPUs to report quiescent states
 periodically and thus check for lockups.

 That's a good point, I'll take a look at using that.  A minute is too
 long, some SoCs have maximum HW watchdog periods of under 30 seconds,
 but a call_rcu every 10-20 seconds might be sufficient.

 Sure. And you can tune CONFIG_RCU_CPU_STALL_TIMEOUT accordingly.

 After considering this, I think the hrtimer watchdog is more useful.
 RCU stalls are not usually panic events, and I wouldn't want to add a
 panic on every RCU stall.  The lack of stack traces on the affected
 cpu makes a panic important.  I'm planning to add an ARM DBGPCSR panic
 handler, which will be able to dump the PC of a stuck cpu even if it
 is not responding to interrupts.  kexec or kgdb on panic might also
 allow some inspection of the stack on stuck cpu.

 Failing to process interrupts is a much more serious event than an RCU
 stall, and being able to detect them separately may be very valuable
 for debugging.

RCU stalls can happen for different reasons: softlockup (failure to
schedule another task), hardlockup (failure to process interrupts), or
a bug in RCU itself. But if you have a hardlockup, it will report it.

Now why do you need a panic in any case? I don't know DBGPCSR, is this
a breakpoint register? How do you plan to use it remotely from the CPU
that detects the lockup?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs

2013-01-05 Thread Frederic Weisbecker
2013/1/5 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Sat, Jan 05, 2013 at 06:21:01PM +0100, Frederic Weisbecker wrote:
 Hi Paul,

 2013/1/5 Paul E. McKenney paul...@linux.vnet.ibm.com:
  From: Paul Gortmaker paul.gortma...@windriver.com
 
  The wait_event() at the head of the rcu_nocb_kthread() can result in
  soft-lockup complaints if the CPU in question does not register RCU
  callbacks for an extended period.  This commit therefore changes
  the wait_event() to a wait_event_interruptible().
 
  Reported-by: Frederic Weisbecker fweis...@gmail.com
  Signed-off-by: Paul Gortmaker paul.gortma...@windriver.com
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  ---
   kernel/rcutree_plugin.h |3 ++-
   1 files changed, 2 insertions(+), 1 deletions(-)
 
  diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
  index f6e5ec2..43dba2d 100644
  --- a/kernel/rcutree_plugin.h
  +++ b/kernel/rcutree_plugin.h
  @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg)
  for (;;) {
  /* If not polling, wait for next batch of callbacks. */
  if (!rcu_nocb_poll)
  -   wait_event(rdp-nocb_wq, rdp-nocb_head);
  +   wait_event_interruptible(rdp-nocb_wq, 
  rdp-nocb_head);
  list = ACCESS_ONCE(rdp-nocb_head);
  if (!list) {
  schedule_timeout_interruptible(1);
  +   flush_signals(current);

 Why is that needed?

 To satisfy my paranoia.  ;-)  And in case someone ever figures out some
 way to send a signal to a kthread.

Ok. I don't want to cause any insomnia to anyone, so I won't insist ;)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/27] cputime: Allow dynamic switch between tick/virtual based cputime accounting

2013-01-07 Thread Frederic Weisbecker
Hey Paul,

2013/1/4 Paul Gortmaker paul.gortma...@windriver.com:
 On 12-12-29 11:42 AM, Frederic Weisbecker wrote:
 Allow to dynamically switch between tick and virtual based cputime 
 accounting.
 This way we can provide a kind of on-demand virtual based cputime
 accounting. In this mode, the kernel will rely on the user hooks
 subsystem to dynamically hook on kernel boundaries.

 This is in preparation for beeing able to stop the timer tick further
 idle. Doing so will depend on CONFIG_VIRT_CPU_ACCOUNTING which makes

 s/beeing/being/  -- also I know what you mean, but it may not be
 100% clear to everyone -- perhaps ...for being able to stop the
 timer tick in more places than just the idle state.

Thanks! Fixed for the next version!

[...]
 +static inline bool vtime_accounting(void) { return false; }

 It wasn't 100% obvious what vtime_accounting() was doing until I
 saw its definition below.  I wonder if it should be something like
 vtime_accounting_on() or vtime_accounting_enabled() instead?

Agreed, I've renamed into vtime_accounting_enabled().


  #endif

  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 diff --git a/init/Kconfig b/init/Kconfig
 index dad2b88..307bc35 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -342,6 +342,7 @@ config VIRT_CPU_ACCOUNTING
   bool Deterministic task and CPU time accounting
   depends on HAVE_VIRT_CPU_ACCOUNTING || HAVE_CONTEXT_TRACKING
   select VIRT_CPU_ACCOUNTING_GEN if !HAVE_VIRT_CPU_ACCOUNTING
 + select VIRT_CPU_ACCOUNTING_NATIVE if HAVE_VIRT_CPU_ACCOUNTING
   help
 Select this option to enable more accurate task and CPU time
 accounting.  This is done by reading a CPU counter on each
 @@ -366,11 +367,16 @@ endchoice

  config VIRT_CPU_ACCOUNTING_GEN
   select CONTEXT_TRACKING
 + depends on VIRT_CPU_ACCOUNTING  HAVE_CONTEXT_TRACKING

 Should the 2nd half of this depends been already here, i.e. introduced
 with the prev. patch that created VIRT_CPU_ACCOUNTING_GEN?

Yeah, Li Zhong suggested that I turn *_GEN and *_NATIVE options into
distinct choices for the user. So I moved that part to the previous
patch.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH rcu] Remove unused code originally used for context tracking

2013-01-07 Thread Frederic Weisbecker
2013/1/7 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Fri, Nov 30, 2012 at 02:19:22PM +0800, Li Zhong wrote:
 As new context tracking subsystem added, it seems ignore_user_qs and
 in_user defined in struct rcu_dynticks are no longer needed, so remove
 them.

 Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com

 Hearing no objections from  Frederic, I have queued this patch for 3.9

Thanks Paul!

And feel free to add my ack.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] context_tracking: Add comments on interface and internals

2013-01-07 Thread Frederic Weisbecker
2012/12/16 Ingo Molnar mi...@kernel.org:

 * Frederic Weisbecker fweis...@gmail.com wrote:
 +
 +/**
 + * context_tracking_task_switch - context switch the syscall hooks
 + *
 + * The context tracking uses the syscall slow path to implement its 
 user-kernel
 + * boundaries hooks on syscalls. This way it doesn't impact the syscall fast
 + * path on CPUs that don't do context tracking.
 + *
 + * But we need to clear the flag on the previous task because it may later
 + * migrate to some CPU that doesn't do the context tracking. As such the TIF
 + * flag may not be desired there.

 If possible: s/hooks/callbacks

 'hook' gives me the visual of a box match. YMMV.

Ok, I'm fixing this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: Remove broken check for skip clock update

2013-01-07 Thread Frederic Weisbecker
rq-skip_clock_update shouldn't be negative. Thus the check
in put_prev_task() is useless.

It was probably intended to do the following check:

if (prev-on_rq  !rq-skip_clock_update)

We only want to update the clock if the current task is
not voluntarily sleeping: otherwise deactivate_task()
already did the rq clock update in schedule(). But we want
to ignore that update if a ttwu did it for us, in which case
rq-skip_clock_update is 1.

But update_rq_clock() already takes care of that so we
can just remove the broken condition.

Signed-off-by: Frederic Weisbecker fweis...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Steven Rostedt rost...@goodmis.org
---
 kernel/sched/core.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15ba35e..8dfc461 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2886,7 +2886,7 @@ static inline void schedule_debug(struct task_struct 
*prev)
 
 static void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-   if (prev-on_rq || rq-skip_clock_update  0)
+   if (prev-on_rq)
update_rq_clock(rq);
prev-sched_class-put_prev_task(rq, prev);
 }
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    3   4   5   6   7   8   9   10   11   12   >