[PATCH 10/27] rcu: Restart the tick on non-responding full dynticks CPUs
When a CPU in full dynticks mode doesn't respond to complete a grace period, issue it a specific IPI so that it restarts the tick and chases a quiescent state. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de Signed-off-by: Steven Rostedt rost...@goodmis.org --- kernel/rcutree.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index e441b77..302d360 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -53,6 +53,7 @@ #include linux/delay.h #include linux/stop_machine.h #include linux/random.h +#include linux/tick.h #include rcutree.h #include trace/events/rcu.h @@ -743,6 +744,12 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp) return (rdp-dynticks_snap 0x1) == 0; } +static void rcu_kick_nohz_cpu(int cpu) +{ + if (tick_nohz_full_cpu(cpu)) + smp_send_reschedule(cpu); +} + /* * Return true if the specified CPU has passed through a quiescent * state by virtue of being in or having passed through an dynticks @@ -790,6 +797,9 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) rdp-offline_fqs++; return 1; } + + rcu_kick_nohz_cpu(rdp-cpu); + return 0; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/27] sched: Update rq clock on nohz CPU before migrating tasks
Because the sched_class::put_prev_task() callback of rt and fair classes are referring to the rq clock to update their runtime statistics. A CPU running in tickless mode may carry a stale value. We need to update it there. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/core.c |6 ++ kernel/sched/sched.h |7 +++ 2 files changed, 13 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bfac40f..2fcbb03 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4894,6 +4894,12 @@ static void migrate_tasks(unsigned int dead_cpu) */ rq-stop = NULL; + /* +* -put_prev_task() need to have an up-to-date value +* of rq-clock[_task] +*/ + update_nohz_rq_clock(rq); + for ( ; ; ) { /* * There's this thread running, bail when that's the only diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fc88644..f24d91e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3,6 +3,7 @@ #include linux/mutex.h #include linux/spinlock.h #include linux/stop_machine.h +#include linux/tick.h #include cpupri.h @@ -963,6 +964,12 @@ static inline void dec_nr_running(struct rq *rq) extern void update_rq_clock(struct rq *rq); +static inline void update_nohz_rq_clock(struct rq *rq) +{ + if (tick_nohz_full_cpu(cpu_of(rq))) + update_rq_clock(rq); +} + extern void activate_task(struct rq *rq, struct task_struct *p, int flags); extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/27] sched: Update rq clock on tickless CPUs before calling check_preempt_curr()
check_preempt_wakeup() of fair class needs an uptodate sched clock value to update runtime stats of the current task. When a task is woken up, activate_task() is usually called right before ttwu_do_wakeup() unless the task is already in the runqueue. In this case we need to update the rq clock manually in case the CPU runs tickless because ttwu_do_wakeup() calls check_preempt_wakeup(). Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/core.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2fcbb03..3c1a806 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1346,6 +1346,12 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) rq = __task_rq_lock(p); if (p-on_rq) { + /* +* Ensure check_preempt_curr() won't deal with a stale value +* of rq clock if the CPU is tickless. BTW do we actually need +* check_preempt_curr() to be called here? +*/ + update_nohz_rq_clock(rq); ttwu_do_wakeup(rq, p, wake_flags); ret = 1; } @@ -1523,8 +1529,17 @@ static void try_to_wake_up_local(struct task_struct *p) if (!(p-state TASK_NORMAL)) goto out; - if (!p-on_rq) + if (!p-on_rq) { ttwu_activate(rq, p, ENQUEUE_WAKEUP); + } else { + /* +* Even if the task is on the runqueue we still +* need to ensure check_preempt_curr() won't +* deal with a stale rq clock value on a tickless +* CPU +*/ + update_nohz_rq_clock(rq); + } ttwu_do_wakeup(rq, p, 0); ttwu_stat(p, smp_processor_id(), 0); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/27] sched: Update nohz rq clock before searching busiest group on load balancing
While load balancing an rq target, we look for the busiest group. This operation may require an uptodate rq clock if we end up calling scale_rt_power(). To this end, update it manually if the target is running tickless. DOUBT: don't we actually also need this in vanilla kernel, in case this_cpu is in dyntick-idle mode? Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/fair.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 698137d..473f50f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5023,6 +5023,19 @@ static int load_balance(int this_cpu, struct rq *this_rq, schedstat_inc(sd, lb_count[idle]); + /* +* find_busiest_group() may need an uptodate cpu clock +* for find_busiest_group() (see scale_rt_power()). If +* the CPU is nohz, it's clock may be stale. +*/ + if (tick_nohz_full_cpu(this_cpu)) { + local_irq_save(flags); + raw_spin_lock(this_rq-lock); + update_rq_clock(this_rq); + raw_spin_unlock(this_rq-lock); + local_irq_restore(flags); + } + redo: group = find_busiest_group(env, balance); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/27] nohz: Only stop the tick on RCU nocb CPUs
On a full dynticks CPU, we want the RCU callbacks to be offlined to another CPU, otherwise we need to keep the tick to wait for the grace period completion. Ensure the full dynticks CPU is also an rcu_nocb one. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/rcupdate.h |7 +++ kernel/rcutree.c |6 +++--- kernel/rcutree_plugin.h | 13 - kernel/time/tick-sched.c | 20 +--- 4 files changed, 31 insertions(+), 15 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 275aa3f..829312e 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -992,4 +992,11 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) #define kfree_rcu(ptr, rcu_head) \ __kfree_rcu(((ptr)-rcu_head), offsetof(typeof(*(ptr)), rcu_head)) +#ifdef CONFIG_RCU_NOCB_CPU +bool rcu_is_nocb_cpu(int cpu); +#else +static inline bool rcu_is_nocb_cpu(int cpu) { return false; }; +#endif + + #endif /* __LINUX_RCUPDATE_H */ diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 302d360..e9e0ffa 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -1589,7 +1589,7 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) { /* No-CBs CPUs do not have orphanable callbacks. */ - if (is_nocb_cpu(rdp-cpu)) + if (rcu_is_nocb_cpu(rdp-cpu)) return; /* @@ -2651,10 +2651,10 @@ static void _rcu_barrier(struct rcu_state *rsp) * corresponding CPU's preceding callbacks have been invoked. */ for_each_possible_cpu(cpu) { - if (!cpu_online(cpu) !is_nocb_cpu(cpu)) + if (!cpu_online(cpu) !rcu_is_nocb_cpu(cpu)) continue; rdp = per_cpu_ptr(rsp-rda, cpu); - if (is_nocb_cpu(cpu)) { + if (rcu_is_nocb_cpu(cpu)) { _rcu_barrier_trace(rsp, OnlineNoCB, cpu, rsp-n_barrier_done); atomic_inc(rsp-barrier_cpu_count); diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index f6e5ec2..625b327 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2160,7 +2160,7 @@ static int __init rcu_nocb_setup(char *str) __setup(rcu_nocbs=, rcu_nocb_setup); /* Is the specified CPU a no-CPUs CPU? */ -static bool is_nocb_cpu(int cpu) +bool rcu_is_nocb_cpu(int cpu) { if (have_rcu_nocb_mask) return cpumask_test_cpu(cpu, rcu_nocb_mask); @@ -2218,7 +2218,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, bool lazy) { - if (!is_nocb_cpu(rdp-cpu)) + if (!rcu_is_nocb_cpu(rdp-cpu)) return 0; __call_rcu_nocb_enqueue(rdp, rhp, rhp-next, 1, lazy); return 1; @@ -2235,7 +2235,7 @@ static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, long qll = rsp-qlen_lazy; /* If this is not a no-CBs CPU, tell the caller to do it the old way. */ - if (!is_nocb_cpu(smp_processor_id())) + if (!rcu_is_nocb_cpu(smp_processor_id())) return 0; rsp-qlen = 0; rsp-qlen_lazy = 0; @@ -2275,7 +2275,7 @@ static bool nocb_cpu_expendable(int cpu) * If there are no no-CB CPUs or if this CPU is not a no-CB CPU, * then offlining this CPU is harmless. Let it happen. */ - if (!have_rcu_nocb_mask || is_nocb_cpu(cpu)) + if (!have_rcu_nocb_mask || rcu_is_nocb_cpu(cpu)) return 1; /* If no memory, play it safe and keep the CPU around. */ @@ -2456,11 +2456,6 @@ static void __init rcu_init_nocb(void) #else /* #ifdef CONFIG_RCU_NOCB_CPU */ -static bool is_nocb_cpu(int cpu) -{ - return false; -} - static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, bool lazy) { diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9d31b08..78e5341 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -587,6 +587,19 @@ void tick_nohz_idle_enter(void) local_irq_enable(); } +#ifdef CONFIG_NO_HZ_FULL +static bool can_stop_full_tick(int cpu) +{ + if (!sched_can_stop_tick()) + return false; + + if (!rcu_is_nocb_cpu(cpu
[PATCH 23/27] nohz: Don't stop the tick if posix cpu timers are running
If either a per thread or a per process posix cpu timer is running, don't stop the tick. TODO: restart the tick if it is stopped and a posix cpu timer is enqueued. Check we probably need a memory barrier for the per process posix timer that can be enqueued from another task of the group. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/posix-timers.h |1 + kernel/posix-cpu-timers.c| 11 +++ kernel/time/tick-sched.c |4 3 files changed, 16 insertions(+), 0 deletions(-) diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index 042058f..97480c2 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -119,6 +119,7 @@ int posix_timer_event(struct k_itimer *timr, int si_private); void posix_cpu_timer_schedule(struct k_itimer *timer); void run_posix_cpu_timers(struct task_struct *task); +bool posix_cpu_timers_running(struct task_struct *tsk); void posix_cpu_timers_exit(struct task_struct *task); void posix_cpu_timers_exit_group(struct task_struct *task); diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c index 165d476..15f8f4f 100644 --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -1269,6 +1269,17 @@ static inline int fastpath_timer_check(struct task_struct *tsk) return 0; } +bool posix_cpu_timers_running(struct task_struct *tsk) +{ + if (!task_cputime_zero(tsk-cputime_expires)) + return true; + + if (tsk-signal-cputimer.running) + return true; + + return false; +} + /* * This is called from the timer interrupt handler. The irq handler has * already updated our counts. We need to check if any timers fire now. diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 04504c4..eb6ad3d 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -21,6 +21,7 @@ #include linux/sched.h #include linux/module.h #include linux/irq_work.h +#include linux/posix-timers.h #include asm/irq_regs.h @@ -599,6 +600,9 @@ static bool can_stop_full_tick(int cpu) if (rcu_pending(cpu)) return false; + if (posix_cpu_timers_running(current)) + return false; + return true; } #endif -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 25/27] rcu: Don't keep the tick for RCU while in userspace
If we are interrupting userspace, we don't need to keep the tick for RCU: quiescent states don't need to be reported because we soon run in userspace and local callbacks are handled by the nocb threads. CHECKME: Do the nocb threads actually handle the global grace period completion for local callbacks? Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/time/tick-sched.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 0e1ebff..76d1b08 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -22,6 +22,7 @@ #include linux/module.h #include linux/irq_work.h #include linux/posix-timers.h +#include linux/context_tracking.h #include asm/irq_regs.h @@ -604,10 +605,9 @@ static bool can_stop_full_tick(int cpu) /* * Keep the tick if we are asked to report a quiescent state. -* This must be further optimized (avoid checks for local callbacks, -* ignore RCU in userspace, etc... +* This must be further optimized (avoid checks for local callbacks) */ - if (rcu_pending(cpu)) { + if (!context_tracking_in_user() rcu_pending(cpu)) { trace_printk(Can't stop: RCU pending\n); return false; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 26/27] profiling: Remove unused timer hook
The last remaining user was oprofile and its use has been removed a while ago on commit bc078e4eab65f11bbaeed380593ab8151b30d703 oprofile: convert oprofile from timer_hook to hrtimer. There doesn't seem to be any upstream user of this hook for about two years now. And I'm not even aware of any out of tree user. Let's remove it. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/profile.h | 13 - kernel/profile.c| 24 2 files changed, 0 insertions(+), 37 deletions(-) diff --git a/include/linux/profile.h b/include/linux/profile.h index a0fc322..2112390 100644 --- a/include/linux/profile.h +++ b/include/linux/profile.h @@ -82,9 +82,6 @@ int task_handoff_unregister(struct notifier_block * n); int profile_event_register(enum profile_type, struct notifier_block * n); int profile_event_unregister(enum profile_type, struct notifier_block * n); -int register_timer_hook(int (*hook)(struct pt_regs *)); -void unregister_timer_hook(int (*hook)(struct pt_regs *)); - struct pt_regs; #else @@ -135,16 +132,6 @@ static inline int profile_event_unregister(enum profile_type t, struct notifier_ #define profile_handoff_task(a) (0) #define profile_munmap(a) do { } while (0) -static inline int register_timer_hook(int (*hook)(struct pt_regs *)) -{ - return -ENOSYS; -} - -static inline void unregister_timer_hook(int (*hook)(struct pt_regs *)) -{ - return; -} - #endif /* CONFIG_PROFILING */ #endif /* _LINUX_PROFILE_H */ diff --git a/kernel/profile.c b/kernel/profile.c index 1f39181..dc3384e 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -37,9 +37,6 @@ struct profile_hit { #define NR_PROFILE_HIT (PAGE_SIZE/sizeof(struct profile_hit)) #define NR_PROFILE_GRP (NR_PROFILE_HIT/PROFILE_GRPSZ) -/* Oprofile timer tick hook */ -static int (*timer_hook)(struct pt_regs *) __read_mostly; - static atomic_t *prof_buffer; static unsigned long prof_len, prof_shift; @@ -208,25 +205,6 @@ int profile_event_unregister(enum profile_type type, struct notifier_block *n) } EXPORT_SYMBOL_GPL(profile_event_unregister); -int register_timer_hook(int (*hook)(struct pt_regs *)) -{ - if (timer_hook) - return -EBUSY; - timer_hook = hook; - return 0; -} -EXPORT_SYMBOL_GPL(register_timer_hook); - -void unregister_timer_hook(int (*hook)(struct pt_regs *)) -{ - WARN_ON(hook != timer_hook); - timer_hook = NULL; - /* make sure all CPUs see the NULL hook */ - synchronize_sched(); /* Allow ongoing interrupts to complete. */ -} -EXPORT_SYMBOL_GPL(unregister_timer_hook); - - #ifdef CONFIG_SMP /* * Each cpu has a pair of open-addressed hashtables for pending @@ -436,8 +414,6 @@ void profile_tick(int type) { struct pt_regs *regs = get_irq_regs(); - if (type == CPU_PROFILING timer_hook) - timer_hook(regs); if (!user_mode(regs) prof_cpu_mask != NULL cpumask_test_cpu(smp_processor_id(), prof_cpu_mask)) profile_hit(type, (void *)profile_pc(regs)); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 27/27] timer: Don't run non-pinned timer to full dynticks CPUs
While trying to find a target for a non-pinned timer, use the following logic: - Use the closest (from a sched domain POV) busy CPU that is not full dynticks - If none, use the closest idle CPU that is not full dynticks. So this is biased toward isolation over powersaving. This is a quick hack until we provide a way for the user to tune that policy. A CPU mask affinity for non pinned timers could be such a solution. Original-patch-by: Thomas Gleixner t...@linutronix.de Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/hrtimer.c|3 ++- kernel/sched/core.c | 26 +++--- kernel/timer.c |3 ++- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..f5da6fb 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -159,7 +159,8 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, static int hrtimer_get_target(int this_cpu, int pinned) { #ifdef CONFIG_NO_HZ - if (!pinned get_sysctl_timer_migration() idle_cpu(this_cpu)) + if (!pinned get_sysctl_timer_migration() + (idle_cpu(this_cpu) || tick_nohz_full_cpu(this_cpu))) return get_nohz_timer_target(); #endif return this_cpu; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7b6156a..e2884c5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -560,22 +560,42 @@ void resched_cpu(int cpu) */ int get_nohz_timer_target(void) { - int cpu = smp_processor_id(); int i; struct sched_domain *sd; + int cpu = smp_processor_id(); + int target = -1; rcu_read_lock(); for_each_domain(cpu, sd) { for_each_cpu(i, sched_domain_span(sd)) { + /* +* This is biased toward CPU isolation usecase: +* try to migrate the timer to a busy non-full-nohz +* CPU. If there is none, then prefer an idle CPU +* than a full nohz one. +* We shouldn't do policy here (isolation VS powersaving) +* so this is a temporary hack. Being able to affine +* non-pinned timers would be a better thing. +*/ + if (tick_nohz_full_cpu(i)) + continue; + if (!idle_cpu(i)) { - cpu = i; + target = i; goto unlock; } + + if (target == -1) + target = i; } } + /* Fallback in case of NULL domain */ + if (target == -1) + target = cpu; unlock: rcu_read_unlock(); - return cpu; + + return target; } /* * When add_timer_on() enqueues a timer into the timer wheel of an diff --git a/kernel/timer.c b/kernel/timer.c index 970b57d..51dd02b 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -738,7 +738,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires, cpu = smp_processor_id(); #if defined(CONFIG_NO_HZ) defined(CONFIG_SMP) - if (!pinned get_sysctl_timer_migration() idle_cpu(cpu)) + if (!pinned get_sysctl_timer_migration() + (idle_cpu(cpu) || tick_nohz_full_cpu(cpu))) cpu = get_nohz_timer_target(); #endif new_base = per_cpu(tvec_bases, cpu); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 24/27] nohz: Add some tracing
Not for merge, just for debugging. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/time/tick-sched.c | 27 ++- 1 files changed, 22 insertions(+), 5 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index eb6ad3d..0e1ebff 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -142,6 +142,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) ts-idle_jiffies++; } #endif + trace_printk(tick\n); update_process_times(user_mode(regs)); profile_tick(CPU_PROFILING); } @@ -591,17 +592,30 @@ void tick_nohz_idle_enter(void) #ifdef CONFIG_NO_HZ_FULL static bool can_stop_full_tick(int cpu) { - if (!sched_can_stop_tick()) + if (!sched_can_stop_tick()) { + trace_printk(Can't stop: sched\n); return false; + } - if (!rcu_is_nocb_cpu(cpu)) + if (!rcu_is_nocb_cpu(cpu)) { + trace_printk(Can't stop: not RCU nocb\n); return false; + } - if (rcu_pending(cpu)) + /* +* Keep the tick if we are asked to report a quiescent state. +* This must be further optimized (avoid checks for local callbacks, +* ignore RCU in userspace, etc... +*/ + if (rcu_pending(cpu)) { + trace_printk(Can't stop: RCU pending\n); return false; + } - if (posix_cpu_timers_running(current)) + if (posix_cpu_timers_running(current)) { + trace_printk(Can't stop: posix CPU timers running\n); return false; + } return true; } @@ -615,12 +629,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts) if (!tick_nohz_full_cpu(cpu) || is_idle_task(current)) return; - if (!ts-tick_stopped ts-nohz_mode == NOHZ_MODE_INACTIVE) + if (!ts-tick_stopped ts-nohz_mode == NOHZ_MODE_INACTIVE) { + trace_printk(Can't stop: NOHZ_MODE_INACTIVE\n); return; + } if (!can_stop_full_tick(cpu)) return; + trace_printk(Stop tick\n); tick_nohz_stop_sched_tick(ts, ktime_get(), cpu); #endif } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 22/27] nohz: Don't turn off the tick if rcu needs it
If RCU is waiting for the current CPU to complete a grace period, don't turn off the tick. Unlike dynctik-idle, we are not necessarily going to enter into rcu extended quiescent state, so we may need to keep the tick to note current CPU's quiescent states. [added build fix from Zen Lin] CHECKME: OTOH we don't want to handle a locally started grace period, this should be offloaded for rcu_nocb CPUs. What we want is to be kicked if we stay dynticks in the kernel for too long (ie: to report a quiescent state). rcu_pending() is perhaps an overkill just for that. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de Signed-off-by: Steven Rostedt rost...@goodmis.org --- include/linux/rcupdate.h |1 + kernel/rcutree.c |3 +-- kernel/time/tick-sched.c |3 +++ 3 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 829312e..2ebadac 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -211,6 +211,7 @@ static inline int rcu_preempt_depth(void) extern void rcu_sched_qs(int cpu); extern void rcu_bh_qs(int cpu); extern void rcu_check_callbacks(int cpu, int user); +extern int rcu_pending(int cpu); struct notifier_block; extern void rcu_idle_enter(void); extern void rcu_idle_exit(void); diff --git a/kernel/rcutree.c b/kernel/rcutree.c index e9e0ffa..6ba3e02 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -232,7 +232,6 @@ module_param(jiffies_till_next_fqs, ulong, 0644); static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *)); static void force_quiescent_state(struct rcu_state *rsp); -static int rcu_pending(int cpu); /* * Return the number of RCU-sched batches processed thus far for debug stats. @@ -2521,7 +2520,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp) * by the current CPU, returning 1 if so. This function is part of the * RCU implementation; it is -not- an exported member of the RCU API. */ -static int rcu_pending(int cpu) +int rcu_pending(int cpu) { struct rcu_state *rsp; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 78e5341..04504c4 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -596,6 +596,9 @@ static bool can_stop_full_tick(int cpu) if (!rcu_is_nocb_cpu(cpu)) return false; + if (rcu_pending(cpu)) + return false; + return true; } #endif -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 20/27] nohz: Full dynticks mode
When a CPU is in full dynticks mode, try to switch it to nohz mode from the interrupt exit path if it is running a single non-idle task. Then restart the tick if necessary if we are enqueuing a second task while the timer is stopped, so that the scheduler tick is rearmed. [TODO: Check remaining things to be done from scheduler_tick()] [ Included build fix from Geoff Levand ] Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/sched.h|6 + include/linux/tick.h |2 + kernel/sched/core.c | 22 - kernel/sched/sched.h | 10 + kernel/softirq.c |5 ++- kernel/time/tick-sched.c | 47 - 6 files changed, 83 insertions(+), 9 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 32860ae..132897d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2846,6 +2846,12 @@ static inline void inc_syscw(struct task_struct *tsk) #define TASK_SIZE_OF(tsk) TASK_SIZE #endif +#ifdef CONFIG_NO_HZ_FULL +extern bool sched_can_stop_tick(void); +#else +static inline bool sched_can_stop_tick(void) { return false; } +#endif + #ifdef CONFIG_MM_OWNER extern void mm_update_next_owner(struct mm_struct *mm); extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p); diff --git a/include/linux/tick.h b/include/linux/tick.h index 2d4f6f0..dfb90ea 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -159,8 +159,10 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; } #ifdef CONFIG_NO_HZ_FULL int tick_nohz_full_cpu(int cpu); +extern void tick_nohz_full_check(void); #else static inline int tick_nohz_full_cpu(int cpu) { return 0; } +static inline void tick_nohz_full_check(void) { } #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3c1a806..7b6156a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1238,6 +1238,24 @@ static void update_avg(u64 *avg, u64 sample) } #endif +#ifdef CONFIG_NO_HZ_FULL +bool sched_can_stop_tick(void) +{ + struct rq *rq; + + rq = this_rq(); + + /* Make sure rq-nr_running update is visible after the IPI */ + smp_rmb(); + + /* More than one running task need preemption */ + if (rq-nr_running 1) + return false; + + return true; +} +#endif + static void ttwu_stat(struct task_struct *p, int cpu, int wake_flags) { @@ -1380,7 +1398,8 @@ static void sched_ttwu_pending(void) void scheduler_ipi(void) { - if (llist_empty(this_rq()-wake_list) !got_nohz_idle_kick()) + if (llist_empty(this_rq()-wake_list) !got_nohz_idle_kick() +!tick_nohz_full_cpu(smp_processor_id())) return; /* @@ -1397,6 +1416,7 @@ void scheduler_ipi(void) * somewhat pessimize the simple resched case. */ irq_enter(); + tick_nohz_full_check(); sched_ttwu_pending(); /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f24d91e..63915fe 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -955,6 +955,16 @@ static inline u64 steal_ticks(u64 steal) static inline void inc_nr_running(struct rq *rq) { rq-nr_running++; + +#ifdef CONFIG_NO_HZ_FULL + if (rq-nr_running == 2) { + if (tick_nohz_full_cpu(rq-cpu)) { + /* Order rq-nr_running write against the IPI */ + smp_wmb(); + smp_send_reschedule(rq-cpu); + } + } +#endif } static inline void dec_nr_running(struct rq *rq) diff --git a/kernel/softirq.c b/kernel/softirq.c index f5cc25f..6342078 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -307,7 +307,8 @@ void irq_enter(void) int cpu = smp_processor_id(); rcu_irq_enter(); - if (is_idle_task(current) !in_interrupt()) { + + if ((is_idle_task(current) || tick_nohz_full_cpu(cpu)) !in_interrupt()) { /* * Prevent raise_softirq from needlessly waking up ksoftirqd * here, as softirq will be serviced on return from interrupt. @@ -349,7 +350,7 @@ void irq_exit(void) #ifdef CONFIG_NO_HZ /* Make sure that timer wheel updates are propagated */ - if (idle_cpu(smp_processor_id()) !in_interrupt() !need_resched()) + if (!in_interrupt()) tick_nohz_irq_exit(); #endif
[PATCH 17/27] sched: Update rq clock before idle balancing
idle_balance() is called from schedule() right before we schedule the idle task. It needs to record the idle timestamp at that time and for this the rq clock must be accurate. If the CPU is running tickless we need to update the rq clock manually. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/fair.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e78d81104..698137d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5241,6 +5241,7 @@ void idle_balance(int this_cpu, struct rq *this_rq) int pulled_task = 0; unsigned long next_balance = jiffies + HZ; + update_nohz_rq_clock(this_rq); this_rq-idle_stamp = this_rq-clock; if (this_rq-avg_idle sysctl_sched_migration_cost) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/27] nohz: Move nohz load balancer selection into idle logic
[ ** BUGGY PATCH: I need to put more thinking into this ** ] We want the nohz load balancer to be an idle CPU, thus move that selection to strict dyntick idle logic. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de [ added movement of calc_load_exit_idle() ] Signed-off-by: Steven Rostedt rost...@goodmis.org --- kernel/time/tick-sched.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index ab3aa14..164db94 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -444,9 +444,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, * the scheduler tick in nohz_restart_sched_tick. */ if (!ts-tick_stopped) { - nohz_balance_enter_idle(cpu); - calc_load_enter_idle(); - ts-last_tick = hrtimer_get_expires(ts-sched_timer); ts-tick_stopped = 1; } @@ -542,8 +539,11 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts) ts-idle_expires = expires; } - if (!was_stopped ts-tick_stopped) + if (!was_stopped ts-tick_stopped) { ts-idle_jiffies = ts-last_jiffies; + nohz_balance_enter_idle(cpu); + calc_load_enter_idle(); + } } } @@ -651,7 +651,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now) tick_do_update_jiffies64(now); update_cpu_load_nohz(); - calc_load_exit_idle(); touch_softlockup_watchdog(); /* * Cancel the scheduled timer and restore the tick @@ -711,6 +710,8 @@ void tick_nohz_idle_exit(void) tick_nohz_stop_idle(cpu, now); if (ts-tick_stopped) { + nohz_balance_enter_idle(cpu); + calc_load_exit_idle(); tick_nohz_restart_sched_tick(ts, now); tick_nohz_account_idle_ticks(ts); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 16/27] sched: Update clock of nohz busiest rq before balancing
move_tasks() and active_load_balance_cpu_stop() both need to have the busiest rq clock uptodate because they may end up calling can_migrate_task() that uses rq-clock_task to determine if the task running in the busiest runqueue is cache hot. Hence if the busiest runqueue is tickless, update its clock before reading it. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de [ Forward port conflicts ] Signed-off-by: Steven Rostedt rost...@goodmis.org --- kernel/sched/fair.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d65ac7..e78d81104 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5002,6 +5002,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, { int ld_moved, cur_ld_moved, active_balance = 0; int lb_iterations, max_lb_iterations; + int clock_updated; struct sched_group *group; struct rq *busiest; unsigned long flags; @@ -5045,6 +5046,7 @@ redo: ld_moved = 0; lb_iterations = 1; + clock_updated = 0; if (busiest-nr_running 1) { /* * Attempt to move tasks. If find_busiest_group has found @@ -5068,6 +5070,14 @@ more_balance: */ cur_ld_moved = move_tasks(env); ld_moved += cur_ld_moved; + + /* +* Move tasks may end up calling can_migrate_task() which +* requires an uptodate value of the rq clock. +*/ + update_nohz_rq_clock(busiest); + clock_updated = 1; + double_rq_unlock(env.dst_rq, busiest); local_irq_restore(flags); @@ -5163,6 +5173,13 @@ more_balance: busiest-active_balance = 1; busiest-push_cpu = this_cpu; active_balance = 1; + /* +* active_load_balance_cpu_stop may end up calling +* can_migrate_task() which requires an uptodate +* value of the rq clock. +*/ + if (!clock_updated) + update_nohz_rq_clock(busiest); } raw_spin_unlock_irqrestore(busiest-lock, flags); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/27] sched: Update rq clock earlier in unthrottle_cfs_rq
In this function we are making use of rq-clock right before the update of the rq clock, let's just call update_rq_clock() just before that to avoid using a stale rq clock value. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/fair.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a96f0f2..3d65ac7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2279,14 +2279,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) long task_delta; se = cfs_rq-tg-se[cpu_of(rq_of(cfs_rq))]; - cfs_rq-throttled = 0; + + update_rq_clock(rq); + raw_spin_lock(cfs_b-lock); cfs_b-throttled_time += rq-clock - cfs_rq-throttled_clock; list_del_rcu(cfs_rq-throttled_list); raw_spin_unlock(cfs_b-lock); - update_rq_clock(rq); /* update hierarchical throttle state */ walk_tg_tree_from(cfs_rq-tg, tg_nop, tg_unthrottle_up, (void *)rq); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/27] sched: Update rq clock on nohz CPU before setting fair group shares
Because we may update the execution time (sched_group_set_shares()- update_cfs_shares()-reweight_entity()-update_curr()) before reweighting the entity after updating the group shares and this requires an uptodate version of the runqueue clock. Let's update it on the target CPU if it runs tickless because scheduler_tick() is not there to maintain it. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/fair.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5eea870..a96f0f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6068,6 +6068,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) se = tg-se[i]; /* Propagate contribution to hierarchy */ raw_spin_lock_irqsave(rq-lock, flags); + /* +* We may call update_curr() which needs an up-to-date +* version of rq clock if the CPU runs tickless. +*/ + update_nohz_rq_clock(rq); for_each_sched_entity(se) update_cfs_shares(group_cfs_rq(se)); raw_spin_unlock_irqrestore(rq-lock, flags); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/27] sched: Comment on rq-clock correctness in ttwu_do_wakeup() in nohz
Just to avoid confusion. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/sched/core.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 63b25e2..bfac40f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1302,6 +1302,12 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) if (p-sched_class-task_woken) p-sched_class-task_woken(rq, p); + /* +* For adaptive nohz case: We called ttwu_activate() +* which just updated the rq clock. There is an +* exception with p-on_rq != 0 but in this case +* we are not idle and rq-idle_stamp == 0 +*/ if (rq-idle_stamp) { u64 delta = rq-clock - rq-idle_stamp; u64 max = 2*sysctl_sched_migration_cost; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/27] nohz: Wake up full dynticks CPUs when a timer gets enqueued
Wake up a CPU when a timer list timer is enqueued there and the CPU is in full dynticks mode. Sending an IPI to it makes it reconsidering the next timer to program on top of recent updates. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/sched.h |4 ++-- kernel/sched/core.c | 18 +- kernel/timer.c|2 +- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 3bca36e..32860ae 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2061,9 +2061,9 @@ static inline void idle_task_exit(void) {} #endif #if defined(CONFIG_NO_HZ) defined(CONFIG_SMP) -extern void wake_up_idle_cpu(int cpu); +extern void wake_up_nohz_cpu(int cpu); #else -static inline void wake_up_idle_cpu(int cpu) { } +static inline void wake_up_nohz_cpu(int cpu) { } #endif extern unsigned int sysctl_sched_latency; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 257002c..63b25e2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -587,7 +587,7 @@ unlock: * account when the CPU goes back to idle and evaluates the timer * wheel for the next timer event. */ -void wake_up_idle_cpu(int cpu) +static void wake_up_idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -617,6 +617,22 @@ void wake_up_idle_cpu(int cpu) smp_send_reschedule(cpu); } +static bool wake_up_full_nohz_cpu(int cpu) +{ + if (tick_nohz_full_cpu(cpu)) { + smp_send_reschedule(cpu); + return true; + } + + return false; +} + +void wake_up_nohz_cpu(int cpu) +{ + if (!wake_up_full_nohz_cpu(cpu)) + wake_up_idle_cpu(cpu); +} + static inline bool got_nohz_idle_kick(void) { int cpu = smp_processor_id(); diff --git a/kernel/timer.c b/kernel/timer.c index ff3b516..970b57d 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -936,7 +936,7 @@ void add_timer_on(struct timer_list *timer, int cpu) * makes sure that a CPU on the way to idle can not evaluate * the timer wheel. */ - wake_up_idle_cpu(cpu); + wake_up_nohz_cpu(cpu); spin_unlock_irqrestore(base-lock, flags); } EXPORT_SYMBOL_GPL(add_timer_on); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/27] nohz: Trace timekeeping update
Not for merge. This may become a real tracepoint. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- kernel/time/tick-sched.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f19e8bf..ab3aa14 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -118,8 +118,10 @@ static void tick_sched_do_timer(ktime_t now) #endif /* Check, if the jiffies need an update */ - if (tick_do_timer_cpu == cpu) + if (tick_do_timer_cpu == cpu) { + trace_printk(do timekeeping\n); tick_do_update_jiffies64(now); + } } static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/27] nohz: Basic full dynticks interface
Start with a very simple interface to define full dynticks CPU: use a boot time option defined cpumask through the full_nohz= kernel parameter. Make sure you keep at least one CPU outside this range to handle the timekeeping. Also full_nohz= must match rcu_nocb= value. Suggested-by: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/tick.h |7 +++ kernel/time/Kconfig |9 + kernel/time/tick-sched.c | 23 +++ 3 files changed, 39 insertions(+), 0 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index 553272e..2d4f6f0 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -157,6 +157,13 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; } static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; } # endif /* !NO_HZ */ +#ifdef CONFIG_NO_HZ_FULL +int tick_nohz_full_cpu(int cpu); +#else +static inline int tick_nohz_full_cpu(int cpu) { return 0; } +#endif + + # ifdef CONFIG_CPU_IDLE_GOV_MENU extern void menu_hrtimer_cancel(void); # else diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 8601f0d..dc6381d 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -70,6 +70,15 @@ config NO_HZ only trigger on an as-needed basis both when the system is busy and when the system is idle. +config NO_HZ_FULL + bool Full tickless system + depends on NO_HZ RCU_USER_QS VIRT_CPU_ACCOUNTING_GEN RCU_NOCB_CPU SMP + select CONTEXT_TRACKING_FORCE + help + Try to be tickless everywhere, not just in idle. (You need +to fill up the full_nohz_mask boot parameter). + + config HIGH_RES_TIMERS bool High Resolution Timer Support depends on !ARCH_USES_GETTIMEOFFSET GENERIC_CLOCKEVENTS diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index ad0e6fa..fac9ba4 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -142,6 +142,29 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) profile_tick(CPU_PROFILING); } +#ifdef CONFIG_NO_HZ_FULL +static cpumask_var_t full_nohz_mask; +bool have_full_nohz_mask; + +int tick_nohz_full_cpu(int cpu) +{ + if (!have_full_nohz_mask) + return 0; + + return cpumask_test_cpu(cpu, full_nohz_mask); +} + +/* Parse the boot-time nohz CPU list from the kernel parameters. */ +static int __init tick_nohz_full_setup(char *str) +{ + alloc_bootmem_cpumask_var(full_nohz_mask); + have_full_nohz_mask = true; + cpulist_parse(str, full_nohz_mask); + return 1; +} +__setup(full_nohz=, tick_nohz_full_setup); +#endif + /* * NOHZ - aka dynamic tick functionality */ -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/27] cputime: Safely read cputime of full dynticks CPUs
While remotely reading the cputime of a task running in a full dynticks CPU, the values stored in utime/stime fields of struct task_struct may be stale. Its values may be those of the last kernel - user transition time snapshot and we need to add the tickless time spent since this snapshot. To fix this, flush the cputime of the dynticks CPUs on kernel - user transition and record the time / context where we did this. Then on top of this snapshot and the current time, perform the fixup on the reader side from task_times() accessors. FIXME: do the same for idle and guest time. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- arch/s390/kernel/vtime.c |6 +- include/asm-generic/cputime.h |1 + include/linux/hardirq.h |4 +- include/linux/init_task.h | 11 include/linux/sched.h | 16 + include/linux/vtime.h | 40 +++--- kernel/context_tracking.c |2 +- kernel/fork.c |6 ++ kernel/sched/cputime.c| 123 ++--- kernel/softirq.c |6 +- 10 files changed, 154 insertions(+), 61 deletions(-) diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c index e84b8b6..ce9cc5a 100644 --- a/arch/s390/kernel/vtime.c +++ b/arch/s390/kernel/vtime.c @@ -127,7 +127,7 @@ void vtime_account_user(struct task_struct *tsk) * Update process times based on virtual cpu times stored by entry.S * to the lowcore fields user_timer, system_timer steal_clock. */ -void vtime_account(struct task_struct *tsk) +void vtime_account_irq_enter(struct task_struct *tsk) { struct thread_info *ti = task_thread_info(tsk); u64 timer, system; @@ -145,10 +145,10 @@ void vtime_account(struct task_struct *tsk) virt_timer_forward(system); } -EXPORT_SYMBOL_GPL(vtime_account); +EXPORT_SYMBOL_GPL(vtime_account_irq_enter); void vtime_account_system(struct task_struct *tsk) -__attribute__((alias(vtime_account))); +__attribute__((alias(vtime_account_irq_enter))); EXPORT_SYMBOL_GPL(vtime_account_system); void __kprobes vtime_stop_cpu(void) diff --git a/include/asm-generic/cputime.h b/include/asm-generic/cputime.h index 9a62937..3e704d5 100644 --- a/include/asm-generic/cputime.h +++ b/include/asm-generic/cputime.h @@ -10,6 +10,7 @@ typedef unsigned long __nocast cputime_t; #define cputime_to_jiffies(__ct) (__force unsigned long)(__ct) #define cputime_to_scaled(__ct)(__ct) #define jiffies_to_cputime(__hz) (__force cputime_t)(__hz) +#define jiffies_to_scaled(__hz)(__force cputime_t)(__hz) typedef u64 __nocast cputime64_t; diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 624ef3f..7105d5c 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -153,7 +153,7 @@ extern void rcu_nmi_exit(void); */ #define __irq_enter() \ do {\ - vtime_account_irq_enter(current); \ + account_irq_enter_time(current);\ add_preempt_count(HARDIRQ_OFFSET); \ trace_hardirq_enter(); \ } while (0) @@ -169,7 +169,7 @@ extern void irq_enter(void); #define __irq_exit() \ do {\ trace_hardirq_exit(); \ - vtime_account_irq_exit(current);\ + account_irq_exit_time(current); \ sub_preempt_count(HARDIRQ_OFFSET); \ } while (0) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 6d087c5..a6ef59f 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -10,6 +10,7 @@ #include linux/pid_namespace.h #include linux/user_namespace.h #include linux/securebits.h +#include linux/seqlock.h #include net/net_namespace.h #ifdef CONFIG_SMP @@ -141,6 +142,15 @@ extern struct task_group root_task_group; # define INIT_PERF_EVENTS(tsk) #endif +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN +# define INIT_VTIME(tsk) \ + .vtime_seqlock = __SEQLOCK_UNLOCKED(tsk.vtime_seqlock), \ + .prev_jiffies = INITIAL_JIFFIES, /* CHECKME */ \ + .prev_jiffies_whence = JIFFIES_SYS, +#else +# define INIT_VTIME(tsk) +#endif + #define
[PATCH 03/27] cputime: Allow dynamic switch between tick/virtual based cputime accounting
Allow to dynamically switch between tick and virtual based cputime accounting. This way we can provide a kind of on-demand virtual based cputime accounting. In this mode, the kernel will rely on the user hooks subsystem to dynamically hook on kernel boundaries. This is in preparation for beeing able to stop the timer tick further idle. Doing so will depend on CONFIG_VIRT_CPU_ACCOUNTING which makes it possible to account the cputime without the tick by hooking on kernel/user boundaries. Depending whether the tick is stopped or not, we can switch between tick and vtime based accounting anytime in order to minimize the overhead associated to user hooks. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/kernel_stat.h |2 +- include/linux/sched.h |4 +- include/linux/vtime.h |8 ++ init/Kconfig|6 kernel/fork.c |2 +- kernel/sched/cputime.c | 58 +++--- kernel/time/tick-sched.c|5 +++- 7 files changed, 59 insertions(+), 26 deletions(-) diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..ed5f6ed 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -127,7 +127,7 @@ extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t) extern void account_steal_time(cputime_t); extern void account_idle_time(cputime_t); -#ifdef CONFIG_VIRT_CPU_ACCOUNTING +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE static inline void account_process_tick(struct task_struct *tsk, int user) { vtime_account_user(tsk); diff --git a/include/linux/sched.h b/include/linux/sched.h index 206bb08..66b2344 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -605,7 +605,7 @@ struct signal_struct { cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; -#ifndef CONFIG_VIRT_CPU_ACCOUNTING +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE struct cputime prev_cputime; #endif unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw; @@ -1365,7 +1365,7 @@ struct task_struct { cputime_t utime, stime, utimescaled, stimescaled; cputime_t gtime; -#ifndef CONFIG_VIRT_CPU_ACCOUNTING +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE struct cputime prev_cputime; #endif unsigned long nvcsw, nivcsw; /* context switch counts */ diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 1151960..e57020d 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -10,12 +10,20 @@ extern void vtime_account_system_irqsafe(struct task_struct *tsk); extern void vtime_account_idle(struct task_struct *tsk); extern void vtime_account_user(struct task_struct *tsk); extern void vtime_account(struct task_struct *tsk); + +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN +extern bool vtime_accounting(void); #else +static inline bool vtime_accounting(void) { return true; } +#endif + +#else /* !CONFIG_VIRT_CPU_ACCOUNTING */ static inline void vtime_task_switch(struct task_struct *prev) { } static inline void vtime_account_system(struct task_struct *tsk) { } static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { } static inline void vtime_account_user(struct task_struct *tsk) { } static inline void vtime_account(struct task_struct *tsk) { } +static inline bool vtime_accounting(void) { return false; } #endif #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN diff --git a/init/Kconfig b/init/Kconfig index dad2b88..307bc35 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -342,6 +342,7 @@ config VIRT_CPU_ACCOUNTING bool Deterministic task and CPU time accounting depends on HAVE_VIRT_CPU_ACCOUNTING || HAVE_CONTEXT_TRACKING select VIRT_CPU_ACCOUNTING_GEN if !HAVE_VIRT_CPU_ACCOUNTING + select VIRT_CPU_ACCOUNTING_NATIVE if HAVE_VIRT_CPU_ACCOUNTING help Select this option to enable more accurate task and CPU time accounting. This is done by reading a CPU counter on each @@ -366,11 +367,16 @@ endchoice config VIRT_CPU_ACCOUNTING_GEN select CONTEXT_TRACKING + depends on VIRT_CPU_ACCOUNTING HAVE_CONTEXT_TRACKING bool help Implement a generic virtual based cputime accounting by using the context tracking subsystem. +config VIRT_CPU_ACCOUNTING_NATIVE + depends on VIRT_CPU_ACCOUNTING HAVE_VIRT_CPU_ACCOUNTING + bool
[PATCH 01/27] context_tracking: Add comments on interface and internals
This subsystem lacks many explanations on its purpose and design. Add these missing comments. Reported-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de Cc: Li Zhong zh...@linux.vnet.ibm.com --- kernel/context_tracking.c | 73 ++-- 1 files changed, 63 insertions(+), 10 deletions(-) diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index e0e07fd..9f6c38f 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -1,3 +1,19 @@ +/* + * Context tracking: Probe on high level context boundaries such as kernel + * and userspace. This includes syscalls and exceptions entry/exit. + * + * This is used by RCU to remove its dependency on the timer tick while a CPU + * runs in userspace. + * + * Started by Frederic Weisbecker: + * + * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com + * + * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton, + * Steven Rostedt, Peter Zijlstra for suggestions and improvements. + * + */ + #include linux/context_tracking.h #include linux/rcupdate.h #include linux/sched.h @@ -6,8 +22,8 @@ struct context_tracking { /* -* When active is false, hooks are not set to -* minimize overhead: TIF flags are cleared +* When active is false, hooks are unset in order +* to minimize overhead: TIF flags are cleared * and calls to user_enter/exit are ignored. This * may be further optimized using static keys. */ @@ -24,6 +40,15 @@ static DEFINE_PER_CPU(struct context_tracking, context_tracking) = { #endif }; +/** + * user_enter - Inform the context tracking that the CPU is going to + * enter userspace mode. + * + * This function must be called right before we switch from the kernel + * to userspace, when it's guaranteed the remaining kernel instructions + * to execute won't use any RCU read side critical section because this + * function sets RCU in extended quiescent state. + */ void user_enter(void) { unsigned long flags; @@ -39,40 +64,68 @@ void user_enter(void) if (in_interrupt()) return; + /* Kernel threads aren't supposed to go to userspace */ WARN_ON_ONCE(!current-mm); local_irq_save(flags); if (__this_cpu_read(context_tracking.active) __this_cpu_read(context_tracking.state) != IN_USER) { __this_cpu_write(context_tracking.state, IN_USER); + /* +* At this stage, only low level arch entry code remains and +* then we'll run in userspace. We can assume there won't be +* any RCU read-side critical section until the next call to +* user_exit() or rcu_irq_enter(). Let's remove RCU's dependency +* on the tick. +*/ rcu_user_enter(); } local_irq_restore(flags); } + +/** + * user_exit - Inform the context tracking that the CPU is + * exiting userspace mode and entering the kernel. + * + * This function must be called after we entered the kernel from userspace + * before any use of RCU read side critical section. This potentially include + * any high level kernel code like syscalls, exceptions, signal handling, etc... + * + * This call supports re-entrancy. This way it can be called from any exception + * handler without needing to know if we came from userspace or not. + */ void user_exit(void) { unsigned long flags; - /* -* Some contexts may involve an exception occuring in an irq, -* leading to that nesting: -* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() -* This would mess up the dyntick_nesting count though. And rcu_irq_*() -* helpers are enough to protect RCU uses inside the exception. So -* just return immediately if we detect we are in an IRQ. -*/ if (in_interrupt()) return; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == IN_USER) { __this_cpu_write(context_tracking.state, IN_KERNEL); + /* +* We are going to run code that may use RCU. Inform +* RCU core about that (ie: we may need the tick again). +*/ rcu_user_exit
Re: [PATCH 07/27] nohz: Assign timekeeping duty to a non-full-nohz CPU
2013/1/2 Christoph Lameter c...@linux.com: On Sat, 29 Dec 2012, Frederic Weisbecker wrote: @@ -163,6 +164,8 @@ static int __init tick_nohz_full_setup(char *str) return 1; } __setup(full_nohz=, tick_nohz_full_setup); +#else +#define have_full_nohz_mask (0) #endif /* @@ -512,6 +515,10 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) return false; } + /* If there are full nohz CPUs around, we need to keep the timekeeping duty */ + if (have_full_nohz_mask tick_do_timer_cpu == cpu) + return false; + return true; } Ok so I guess this means that if I setup all cpus as nohz then a random one will continue to do timekeeping? In fact, although the code doesn't check that yet, you're supposed to have at least one online CPU outside the full_nohz mask to handle that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 21/27] nohz: Only stop the tick on RCU nocb CPUs
2013/1/2 Namhyung Kim namhy...@kernel.org: You may want to add the following also to shut up the gcc: CC kernel/rcutree.o In file included from /home/namhyung/project/linux/kernel/rcutree.c:58:0: /home/namhyung/project/linux/kernel/rcutree.h:539:13: warning: ‘is_nocb_cpu’ declared ‘static’ but never defined [-Wunused-function] Thanks, Namhyung diff --git a/kernel/rcutree.h b/kernel/rcutree.h index 4b69291b093d..fbbad931c36a 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -536,7 +536,6 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu); static void print_cpu_stall_info_end(void); static void zero_cpu_stall_ticks(struct rcu_data *rdp); static void increment_cpu_stall_ticks(void); -static bool is_nocb_cpu(int cpu); static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, bool lazy); static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, I'm folding that fix for the next version. Thanks Namhyung! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v1 15/31] ARC: Process/scheduling/clock/Timers/Delay Management
2013/1/2 Vineet Gupta vineet.gup...@synopsys.com: On Tuesday 13 November 2012 01:59 AM, Thomas Gleixner wrote: On Wed, 7 Nov 2012, Vineet Gupta wrote: +void cpu_idle(void) +{ +/* Since we SLEEP in idle loop, TIF_POLLING_NRFLAG can't be set */ + +/* endless idle loop with no priority at all */ +while (1) { +tick_nohz_idle_enter(); + +while (!need_resched()) +arch_idle(); + +tick_nohz_idle_exit(); + +preempt_enable_no_resched(); +schedule(); +preempt_disable(); schedule_preempt_disabled() please OK ! And it seems I was also missing the calls to rcu_idle_enter()/exit() to track commit 1268fbc746ea nohz: Remove tick_nohz_idle_enter_norcu() / ... Right! They must be placed around the code that sets the low power mode, when you know there is no use of RCU between rcu_idle_enter() / rcu_idle_exit(). Here this would be likely: while (1) { tick_nohz_idle_enter(); + rcu_idle_enter(); while (!need_resched()) arch_idle(); + rcu_idle_exit() tick_nohz_idle_exit(); schedule_preempt_disabled(); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/27] nohz: Basic full dynticks interface
2012/12/31 Li Zhong zh...@linux.vnet.ibm.com: On Sat, 2012-12-29 at 17:42 +0100, Frederic Weisbecker wrote: Start with a very simple interface to define full dynticks CPU: use a boot time option defined cpumask through the full_nohz= kernel parameter. Make sure you keep at least one CPU outside this range to handle the timekeeping. Also full_nohz= must match rcu_nocb= value. Suggested-by: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Alessio Igor Bogani abog...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Chris Metcalf cmetc...@tilera.com Cc: Christoph Lameter c...@linux.com Cc: Geoff Levand ge...@infradead.org Cc: Gilad Ben Yossef gi...@benyossef.com Cc: Hakan Akkan hakanak...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org Cc: Thomas Gleixner t...@linutronix.de --- include/linux/tick.h |7 +++ kernel/time/Kconfig |9 + kernel/time/tick-sched.c | 23 +++ 3 files changed, 39 insertions(+), 0 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index 553272e..2d4f6f0 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -157,6 +157,13 @@ static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; } static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; } # endif /* !NO_HZ */ +#ifdef CONFIG_NO_HZ_FULL +int tick_nohz_full_cpu(int cpu); +#else +static inline int tick_nohz_full_cpu(int cpu) { return 0; } +#endif + + # ifdef CONFIG_CPU_IDLE_GOV_MENU extern void menu_hrtimer_cancel(void); # else diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 8601f0d..dc6381d 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -70,6 +70,15 @@ config NO_HZ only trigger on an as-needed basis both when the system is busy and when the system is idle. +config NO_HZ_FULL + bool Full tickless system + depends on NO_HZ RCU_USER_QS VIRT_CPU_ACCOUNTING_GEN RCU_NOCB_CPU SMP Does that mean for archs like PPC64, which HAVE_VIRT_CPU_ACCOUNTING, to get NO_HZ_FULL supported, we need to use VIRT_CPU_ACCOUTING_GEN instead of VIRT_CPU_ACCOUNTING_NATIVE? ( I think the two, *_NATIVE and *_GEN, shouldn't be both enabled at the same time? ) Indeed! This sounds silly in the first place but _GEN does a context tracking that _NATIVE doesn't perform. And this context tracking must also be well ordered and serialized against the cputime snapshots. This is important when we remotely fix up the time from the read side. ie: if we read the cputime of a task that runs tickless for some time, we need to know where it runs (user or kernel) then pick either tsk-utime or tsk-stime as a result and add to it the delta of time it has been running tickless. This fixup is performed in task_cputime() using seqlock() for ordering/serializing. And the write side use seqlocks too from vtime accounting APIs. But this is not handled by _NATIVE. When I tried it on a ppc64 machine, it seems that after I select VIRT_CPU_ACCOUNTING, VIRT_CPU_ACCOUNTING_NATIVE is automatically selected. And I have no way to enable VIRT_CPU_ACCOUTING_GEN, or disable VIRT_CPU_ACCOUNTING_NATIVE. It seems that's because these two don't have a configuration name (input prompt). Yeah I need to fix that. The user should be able to choose between VIRT_CPU_ACCOUTING_GEN and VIRT_CPU_ACCOUNTING_NATIVE. I'll fix that for the next release. + select CONTEXT_TRACKING_FORCE + help + Try to be tickless everywhere, not just in idle. (You need + to fill up the full_nohz_mask boot parameter). Maybe it is better to use the name of the boot parameter full_nohz here than the name of the mask variable used in the code? Right! Thanks for your reviews! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/27] cputime: Safely read cputime of full dynticks CPUs
2012/12/31 Li Zhong zh...@linux.vnet.ibm.com: On Sat, 2012-12-29 at 17:42 +0100, Frederic Weisbecker wrote: static inline void vtime_task_switch(struct task_struct *prev) { } static inline void vtime_account_system(struct task_struct *tsk) { } static inline void vtime_account_system_irqsafe(struct task_struct *tsk) { } static inline void vtime_account_user(struct task_struct *tsk) { } -static inline void vtime_account(struct task_struct *tsk) { } +static inline void vtime_account_irq_enter(struct task_struct *tsk) { } static inline bool vtime_accounting(void) { return false; } #endif #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -static inline void arch_vtime_task_switch(struct task_struct *tsk) { } +extern void arch_vtime_task_switch(struct task_struct *tsk); +extern void vtime_account_irq_exit(struct task_struct *tsk); +extern void vtime_user_enter(struct task_struct *tsk); +extern bool vtime_accounting(void); +#else +static inline void vtime_account_irq_exit(struct task_struct *tsk) +{ + /* On hard|softirq exit we always account to hard|softirq cputime */ + vtime_account_system(tsk); +} +static inline void vtime_enter_user(struct task_struct *tsk) { } I guess the function name above should be vtime_user_enter to match the above extern, and the usage in user_enter()? Totally! Thanks for pointing this out. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] 3.7-nohz1
2012/12/30 Paul E. McKenney paul...@linux.vnet.ibm.com: On Mon, Dec 24, 2012 at 12:43:25AM +0100, Frederic Weisbecker wrote: 2012/12/21 Steven Rostedt rost...@goodmis.org: On Thu, 2012-12-20 at 19:32 +0100, Frederic Weisbecker wrote: Let's imagine you have 4 CPUs. We keep the CPU 0 to offline RCU callbacks there and to handle the timekeeping. We set the rest as full dynticks. So you need the following kernel parameters: rcu_nocbs=1-3 full_nohz=1-3 (Note rcu_nocbs value must always be the same as full_nohz). Why? You can't have: rcu_nocbs=1-4 full_nohz=1-3 That should be allowed. or: rcu_nocbs=1-3 full_nohz=1-4 ? But that not. You need to have: rcu_nocbs full_nohz == full_nohz. This is because the tick is not there to maintain the local RCU callbacks anymore. So this must be offloaded to the rcu_nocb threads. I just have a doubt with rcu_nocb. Do we still need the tick to complete the grace period for local rcu callbacks? I need to discuss that with Paul. The tick is only needed if rcu_needs_cpu() returns false. Of course, this means that if you don't invoke rcu_needs_cpu() before returning to adaptive-idle usermode execution, you are correct that a full_nohz CPU would also have to be a rcu_nocbs CPU. That said, I am getting close to having an rcu_needs_cpu() that only returns false if there are callbacks immediately ready to invoke, at least if RCU_FAST_NO_HZ=y. Ok. Also when a CPU enqueues a callback and starts a grace period, the tick polls on the grace period completion. How is it handled with rcu_nocbs CPUs? Does rcu_needs_cpu() return false until the grace period is completed? If so I still need to restart the local tick whenever a new callback is enqueued. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] sched: Move idle_balance() to post_schedule
2012/12/22 Steven Rostedt rost...@goodmis.org: The idle_balance() code is called to do task load balancing just before going to idle. This makes sense as the CPU is about to sleep anyway. But currently it's called in the middle of the scheduler and in a place that must have interrupts disabled. That means, while the load balancing is going on, if a task wakes up on this CPU, it wont get to run while the interrupts are disabled. The idle task doing the balancing will be clueless about it. There's no real reason that the idle_balance() needs to be called in the middle of schedule anyway. The only benefit is that if a task is pulled to this CPU, it can be scheduled without the need to schedule the idle task. But load balancing and migrating the task makes a switch to idle and back negligible. This cleanup looks nice as it does not only let us enable interrupts there but also debloats a bit the schedule() code from idle specific code. So it would be a pity if the optimization that goes away with your cleanup has any measurable impact. Is there any sensible benchmark that can be run against this patch? Something that may involve a lot of back and forth to idle with some bunch of tasks running around on other CPUs? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs
Hi Paul, 2013/1/5 Paul E. McKenney paul...@linux.vnet.ibm.com: From: Paul Gortmaker paul.gortma...@windriver.com The wait_event() at the head of the rcu_nocb_kthread() can result in soft-lockup complaints if the CPU in question does not register RCU callbacks for an extended period. This commit therefore changes the wait_event() to a wait_event_interruptible(). Reported-by: Frederic Weisbecker fweis...@gmail.com Signed-off-by: Paul Gortmaker paul.gortma...@windriver.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com --- kernel/rcutree_plugin.h |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index f6e5ec2..43dba2d 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg) for (;;) { /* If not polling, wait for next batch of callbacks. */ if (!rcu_nocb_poll) - wait_event(rdp-nocb_wq, rdp-nocb_head); + wait_event_interruptible(rdp-nocb_wq, rdp-nocb_head); list = ACCESS_ONCE(rdp-nocb_head); if (!list) { schedule_timeout_interruptible(1); + flush_signals(current); Why is that needed? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] context_tracking: Add comments on interface and internals
This subsystem lacks many explanations on its purpose and design. Add these missing comments. Reported-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Gilad Ben-Yossef gi...@benyossef.com Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Ingo Molnar mi...@kernel.org Cc: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Li Zhong zh...@linux.vnet.ibm.com --- kernel/context_tracking.c | 74 ++-- 1 files changed, 64 insertions(+), 10 deletions(-) diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index e0e07fd..f146b27 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -1,3 +1,19 @@ +/* + * Context tracking: Probe on high level context boundaries such as kernel + * and userspace. This includes syscalls and exceptions entry/exit. + * + * This is used by RCU to remove its dependency to the timer tick while a CPU + * runs in userspace. + * + * Started by Frederic Weisbecker: + * + * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com + * + * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton, + * Steven Rostedt, Peter Zijlstra for suggestions and improvements. + * + */ + #include linux/context_tracking.h #include linux/rcupdate.h #include linux/sched.h @@ -6,8 +22,8 @@ struct context_tracking { /* -* When active is false, hooks are not set to -* minimize overhead: TIF flags are cleared +* When active is false, hooks are unset in order +* to minimize overhead: TIF flags are cleared * and calls to user_enter/exit are ignored. This * may be further optimized using static keys. */ @@ -24,6 +40,16 @@ static DEFINE_PER_CPU(struct context_tracking, context_tracking) = { #endif }; +/** + * user_enter - Inform the context tracking that the CPU is going to + * enter in userspace mode. + * + * This function must be called right before we switch from the kernel + * to the user space, when the last remaining kernel instructions to execute + * are low level arch code that perform the resuming to userspace. + * + * This call supports re-entrancy. + */ void user_enter(void) { unsigned long flags; @@ -39,40 +65,68 @@ void user_enter(void) if (in_interrupt()) return; + /* Kernel thread aren't supposed to go to userspace */ WARN_ON_ONCE(!current-mm); local_irq_save(flags); if (__this_cpu_read(context_tracking.active) __this_cpu_read(context_tracking.state) != IN_USER) { __this_cpu_write(context_tracking.state, IN_USER); + /* +* At this stage, only low level arch entry code remains and +* then we'll run in userspace. We can assume there won't we +* any RCU read-side critical section until the next call to +* user_exit() or rcu_irq_enter(). Let's remove RCU's dependency +* on the tick. +*/ rcu_user_enter(); } local_irq_restore(flags); } + +/** + * user_exit - Inform the context tracking that the CPU is + * exiting userspace mode and entering the kernel. + * + * This function must be called right before we run any high level kernel + * code (ie: anything that is not low level arch entry code) after we entered + * the kernel from userspace. + * + * This call supports re-entrancy. This way it can be called from any exception + * handler without bothering to know if we come from userspace or not. + */ void user_exit(void) { unsigned long flags; - /* -* Some contexts may involve an exception occuring in an irq, -* leading to that nesting: -* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() -* This would mess up the dyntick_nesting count though. And rcu_irq_*() -* helpers are enough to protect RCU uses inside the exception. So -* just return immediately if we detect we are in an IRQ. -*/ if (in_interrupt()) return; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == IN_USER) { __this_cpu_write(context_tracking.state, IN_KERNEL); + /* +* We are going to run code that may use RCU. Inform +* RCU core about that (ie: we may need the tick again). +*/ rcu_user_exit(); } local_irq_restore(flags); } + +/** + * context_tracking_task_switch - context switch the syscall hooks + * + * The context tracking uses the syscall slow path to implement its user-kernel + * boundaries hooks on syscalls. This way it doesn't impact the syscall fast + * path on CPUs
Re: [PATCH] context_tracking: Add comments on interface and internals
2012/12/13 Andrew Morton a...@linux-foundation.org: On Thu, 13 Dec 2012 21:57:05 +0100 Frederic Weisbecker fweis...@gmail.com wrote: This subsystem lacks many explanations on its purpose and design. Add these missing comments. Thanks, it helps. --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -1,3 +1,19 @@ +/* + * Context tracking: Probe on high level context boundaries such as kernel + * and userspace. This includes syscalls and exceptions entry/exit. + * + * This is used by RCU to remove its dependency to the timer tick while a CPU + * runs in userspace. on the timer tick Oops, will fix, along with the other spelling issues you reported. + * + * Started by Frederic Weisbecker: + * + * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com + * + * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton, + * Steven Rostedt, Peter Zijlstra for suggestions and improvements. + * + */ + #include linux/context_tracking.h #include linux/rcupdate.h #include linux/sched.h @@ -6,8 +22,8 @@ struct context_tracking { /* - * When active is false, hooks are not set to - * minimize overhead: TIF flags are cleared + * When active is false, hooks are unset in order + * to minimize overhead: TIF flags are cleared * and calls to user_enter/exit are ignored. This * may be further optimized using static keys. */ @@ -24,6 +40,16 @@ static DEFINE_PER_CPU(struct context_tracking, context_tracking) = { #endif }; +/** + * user_enter - Inform the context tracking that the CPU is going to + * enter in userspace mode. s/in // + * + * This function must be called right before we switch from the kernel + * to the user space, when the last remaining kernel instructions to execute s/the user space/userspace/ + * are low level arch code that perform the resuming to userspace. This is a bit vague - what is right before? What happens if this is done a few instructions early? I mean, what exactly is the requirement here? Might it be something like after the last rcu_foo operation? IOW, if the call to user_enter() were moved earlier and earlier, at what point would the kernel gain a bug? What caused that bug? That's indeed too vague. So as long as RCU is the only user of this, the only rule is: call user_enter() when you're about to resume in userspace and you're sure there will be no use of RCU until we return to the kernel. Here the precision on when to call that wrt. kernel - user transition step doesn't matter much. This is only about RCU usage correctness. Now this context tracking will soon be used by the cputime subsystem in order to implement a generic tickless cputime accounting. The precision induced by the probe location in kernel/user transition will have an effect on cputime accounting precision. But even there this shouldn't matter much because this will have a per-jiffies granularity. This may evolve in the future with a nanosec granularity but then it will be up to archs to place the probes closer to the real kernel/user boundaries. Anyway, I'll comment on the RCU requirement for now and extend the comments to explain the cputime precision issue when I'll add the cputime bits. + * This call supports re-entrancy. Presumably the explanation for user_exit() applies here. Not sure what you mean here. + */ void user_enter(void) { unsigned long flags; @@ -39,40 +65,68 @@ void user_enter(void) if (in_interrupt()) return; + /* Kernel thread aren't supposed to go to userspace */ s/thread/threads/ WARN_ON_ONCE(!current-mm); local_irq_save(flags); if (__this_cpu_read(context_tracking.active) __this_cpu_read(context_tracking.state) != IN_USER) { __this_cpu_write(context_tracking.state, IN_USER); + /* + * At this stage, only low level arch entry code remains and + * then we'll run in userspace. We can assume there won't we s/we/be/ + * any RCU read-side critical section until the next call to + * user_exit() or rcu_irq_enter(). Let's remove RCU's dependency + * on the tick. + */ rcu_user_enter(); } local_irq_restore(flags); } + +/** + * user_exit - Inform the context tracking that the CPU is + * exiting userspace mode and entering the kernel. + * + * This function must be called right before we run any high level kernel + * code (ie: anything that is not low level arch entry code) after we entered + * the kernel from userspace. Also a very vague spec. You're right, as for user_enter(), I'll insist on the RCU and cputime requirements. [...] +/** + * context_tracking_task_switch - context switch the syscall hooks + * + * The context tracking uses the syscall slow path
Re: [PATCH] context_tracking: Add comments on interface and internals
2012/12/14 Andrew Morton a...@linux-foundation.org: On Thu, 13 Dec 2012 23:50:23 +0100 Frederic Weisbecker fweis...@gmail.com wrote: + * This call supports re-entrancy. Presumably the explanation for user_exit() applies here. Not sure what you mean here. It's unclear what it means to say user_enter() supports reentrancy. I mean, zillions of kernel functions are surely reentrant - so what? It appears that you had something in mind when pointing this out, but what was it? The comment over user_exit() appears to tell us. Ah ok. Yeah indeed, the fact user_exit() is reentrant is very important because I have precise usecases in mind. For user_enter() I don't, so probably I don't need to inform about it. It's mainly this bit which makes me wonder why the code is in lib/. Is there any conceivable prospect that any other subsystem will use this code for anything? So that's because of that cputime accounting on dynticks CPUs which will need to know about user/kernel transitions. I'm preparing that for the 3.9 merge window. Oh. That's really the entire reason for the patch and should have been in the changelog! I mentioned it in the changelog: commit 91d1aa43d30505b0b825db8898ffc80a8eca96c7 context_tracking: New context tracking susbsystem We need to pull this up from RCU into this new level of indirection because this tracking is also going to be used to implement an on demand generic virtual cputime accounting. A necessary step to shutdown the tick while still accounting the cputime. Another reason, more implicit this time, was to avoid that RCU handles those reentrancy things and context tracking all around by itself. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] context_tracking: Add comments on interface and internals
This subsystem lacks many explanations on its purpose and design. Add these missing comments. v2: Address comments from Andrew Reported-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Gilad Ben-Yossef gi...@benyossef.com Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Ingo Molnar mi...@kernel.org Cc: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Li Zhong zh...@linux.vnet.ibm.com --- kernel/context_tracking.c | 73 ++-- 1 files changed, 63 insertions(+), 10 deletions(-) diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index e0e07fd..9f6c38f 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -1,3 +1,19 @@ +/* + * Context tracking: Probe on high level context boundaries such as kernel + * and userspace. This includes syscalls and exceptions entry/exit. + * + * This is used by RCU to remove its dependency on the timer tick while a CPU + * runs in userspace. + * + * Started by Frederic Weisbecker: + * + * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker fweis...@redhat.com + * + * Many thanks to Gilad Ben-Yossef, Paul McKenney, Ingo Molnar, Andrew Morton, + * Steven Rostedt, Peter Zijlstra for suggestions and improvements. + * + */ + #include linux/context_tracking.h #include linux/rcupdate.h #include linux/sched.h @@ -6,8 +22,8 @@ struct context_tracking { /* -* When active is false, hooks are not set to -* minimize overhead: TIF flags are cleared +* When active is false, hooks are unset in order +* to minimize overhead: TIF flags are cleared * and calls to user_enter/exit are ignored. This * may be further optimized using static keys. */ @@ -24,6 +40,15 @@ static DEFINE_PER_CPU(struct context_tracking, context_tracking) = { #endif }; +/** + * user_enter - Inform the context tracking that the CPU is going to + * enter userspace mode. + * + * This function must be called right before we switch from the kernel + * to userspace, when it's guaranteed the remaining kernel instructions + * to execute won't use any RCU read side critical section because this + * function sets RCU in extended quiescent state. + */ void user_enter(void) { unsigned long flags; @@ -39,40 +64,68 @@ void user_enter(void) if (in_interrupt()) return; + /* Kernel threads aren't supposed to go to userspace */ WARN_ON_ONCE(!current-mm); local_irq_save(flags); if (__this_cpu_read(context_tracking.active) __this_cpu_read(context_tracking.state) != IN_USER) { __this_cpu_write(context_tracking.state, IN_USER); + /* +* At this stage, only low level arch entry code remains and +* then we'll run in userspace. We can assume there won't be +* any RCU read-side critical section until the next call to +* user_exit() or rcu_irq_enter(). Let's remove RCU's dependency +* on the tick. +*/ rcu_user_enter(); } local_irq_restore(flags); } + +/** + * user_exit - Inform the context tracking that the CPU is + * exiting userspace mode and entering the kernel. + * + * This function must be called after we entered the kernel from userspace + * before any use of RCU read side critical section. This potentially include + * any high level kernel code like syscalls, exceptions, signal handling, etc... + * + * This call supports re-entrancy. This way it can be called from any exception + * handler without needing to know if we came from userspace or not. + */ void user_exit(void) { unsigned long flags; - /* -* Some contexts may involve an exception occuring in an irq, -* leading to that nesting: -* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() -* This would mess up the dyntick_nesting count though. And rcu_irq_*() -* helpers are enough to protect RCU uses inside the exception. So -* just return immediately if we detect we are in an IRQ. -*/ if (in_interrupt()) return; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == IN_USER) { __this_cpu_write(context_tracking.state, IN_KERNEL); + /* +* We are going to run code that may use RCU. Inform +* RCU core about that (ie: we may need the tick again). +*/ rcu_user_exit(); } local_irq_restore(flags); } + +/** + * context_tracking_task_switch - context switch the syscall hooks + * + * The context tracking uses the syscall slow path to implement its user-kernel
[RFC GIT PULL] printk: Full dynticks support for 3.8
Linus, We are currently working on extending the dynticks mode to broader contexts than just idle. Under some conditions on a busy CPU, the tick can be avoided (no need of preemption for one task running, no need of RCU state machine maintainance in userspace, etc...). The most popular application of this is the implementation of CPU isolation. On HPC workloads, where people run one task per-CPU in order to maximize the CPU performances, the kernel sets itself too much on the way with these often unnecessary interrupts. The result is a performance loss due to stolen CPU time and cache trashing of the userspace workset. Now CPU isolation is the most famous user. I expect more. For example we should be able to avoid the tick when we run in guest mode. And more generally this may be a win for most CPU-bound workloads. So in order to implement this full dynticks mode, we need to find alternatives to handle the many maintainance operations performed periodically and turn them to more one-shot event driven solutions. printk() is part of the problem. It must be safely callable from most places and for that purpose it performs an asynchronous wake up of the readers by probing on the tick for pending messages and readers through printk_tick(). Of course if we use printk while the tick is stopped, the pending readers may not be woken up for a while. So a solution to make printk() working even if the CPU is in dynticks mode is to use the irq_work subsystem. This subsystem is typically able to fire self-IPIs. So when printk() is called, it now enqueues an irq_work that does the asynchronous wakeup: * If the tick is stopped, it raises a self-IPI * If the tick is running periodically then don't fire a self-IPI but wait for the next tick to handle that instead (irq work probes on the timer tick). This avoids self-IPIs storm in case of frequent printk() in short periods of time. I know this is a sensitive area. We want printk() to stay minimal and not rely too much on other subsystems that add complications and that may use printk themselves. That's why we chose irq_work because: - It's pretty small and self-contained - It's lockless - It handles most recursivity cases (if it uses printk() itself from the IPI path, this won't fire another IPI) But because it's sensitive, I'm proposing it as an RFC pull request. So if you're ok with that, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git tags/printk-dynticks-for-linus HEAD: 74876a98a87a115254b3a66a14b27320b7f0acaa printk: Wake up klogd using irq_work It has been in linux-next. Thanks. Support for printk in dynticks mode: * Fix two races in irq work claiming * Generalize irq_work support to all archs * Don't stop tick with irq works pending. This fix is generally useful and concerns archs that can't raise self IPIs. * Flush irq works before CPU offlining. * Introduce lazy irq works that can wait for the next tick to be executed, unless it's stopped. * Implement klogd wake up using irq work. This removes the ad-hoc printk_tick()/printk_needs_cpu() hooks and make it working even in dynticks mode. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Frederic Weisbecker (7): irq_work: Fix racy IRQ_WORK_BUSY flag setting irq_work: Fix racy check on work pending flag irq_work: Remove CONFIG_HAVE_IRQ_WORK nohz: Add API to check tick state irq_work: Don't stop the tick with pending works irq_work: Make self-IPIs optable printk: Wake up klogd using irq_work Steven Rostedt (2): irq_work: Flush work on CPU_DYING irq_work: Warn if there's still work on cpu_down arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - include/linux/irq_work.h| 20 + include/linux/printk.h |3 - include/linux/tick.h| 17 - init/Kconfig|5 +- kernel/irq_work.c | 131 ++ kernel/printk.c | 36 + kernel/time/tick-sched.c|7 +- kernel/timer.c |1 - 22 files changed, 161 insertions(+), 73 deletions(-) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel
[PATCH 1/4] vtime: Remove the underscore prefix invasion
Prepending irq-unsafe vtime APIs with underscores was actually a bad idea as the result is a big mess in the API namespace that is even waiting to be further extended. Also these helpers are always called from irq safe callers except kvm. Just provide a vtime_account_system_irqsafe() for this specific case so that we can remove the underscore prefix on other vtime functions. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Tony Luck tony.l...@intel.com Cc: Fenghua Yu fenghua...@intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com --- arch/ia64/kernel/time.c|8 arch/powerpc/kernel/time.c |4 ++-- arch/s390/kernel/vtime.c |4 ++-- include/linux/kvm_host.h |4 ++-- include/linux/vtime.h |8 kernel/sched/cputime.c | 12 ++-- 6 files changed, 20 insertions(+), 20 deletions(-) diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c index 5e48503..f638821 100644 --- a/arch/ia64/kernel/time.c +++ b/arch/ia64/kernel/time.c @@ -106,9 +106,9 @@ void vtime_task_switch(struct task_struct *prev) struct thread_info *ni = task_thread_info(current); if (idle_task(smp_processor_id()) != prev) - __vtime_account_system(prev); + vtime_account_system(prev); else - __vtime_account_idle(prev); + vtime_account_idle(prev); vtime_account_user(prev); @@ -135,14 +135,14 @@ static cputime_t vtime_delta(struct task_struct *tsk) return delta_stime; } -void __vtime_account_system(struct task_struct *tsk) +void vtime_account_system(struct task_struct *tsk) { cputime_t delta = vtime_delta(tsk); account_system_time(tsk, 0, delta, delta); } -void __vtime_account_idle(struct task_struct *tsk) +void vtime_account_idle(struct task_struct *tsk) { account_idle_time(vtime_delta(tsk)); } diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 0db456f..ce4cb77 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -336,7 +336,7 @@ static u64 vtime_delta(struct task_struct *tsk, return delta; } -void __vtime_account_system(struct task_struct *tsk) +void vtime_account_system(struct task_struct *tsk) { u64 delta, sys_scaled, stolen; @@ -346,7 +346,7 @@ void __vtime_account_system(struct task_struct *tsk) account_steal_time(stolen); } -void __vtime_account_idle(struct task_struct *tsk) +void vtime_account_idle(struct task_struct *tsk) { u64 delta, sys_scaled, stolen; diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c index 783e988..80d1dbc 100644 --- a/arch/s390/kernel/vtime.c +++ b/arch/s390/kernel/vtime.c @@ -140,9 +140,9 @@ void vtime_account(struct task_struct *tsk) } EXPORT_SYMBOL_GPL(vtime_account); -void __vtime_account_system(struct task_struct *tsk) +void vtime_account_system(struct task_struct *tsk) __attribute__((alias(vtime_account))); -EXPORT_SYMBOL_GPL(__vtime_account_system); +EXPORT_SYMBOL_GPL(vtime_account_system); void __kprobes vtime_stop_cpu(void) { diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0e2212f..f17158b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -741,7 +741,7 @@ static inline void kvm_guest_enter(void) * This is running in ioctl context so we can avoid * the call to vtime_account() with its unnecessary idle check. */ - vtime_account_system(current); + vtime_account_system_irqsafe(current); current-flags |= PF_VCPU; /* KVM does not hold any references to rcu protected data when it * switches CPU into a guest mode. In fact switching to a guest mode @@ -759,7 +759,7 @@ static inline void kvm_guest_exit(void) * This is running in ioctl context so we can avoid * the call to vtime_account() with its unnecessary idle check. */ - vtime_account_system(current); + vtime_account_system_irqsafe(current); current-flags = ~PF_VCPU; } diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 0c2a2d3..5ad13c3 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -5,14 +5,14 @@ struct task_struct; #ifdef CONFIG_VIRT_CPU_ACCOUNTING extern void vtime_task_switch(struct task_struct *prev); -extern void __vtime_account_system(struct task_struct *tsk); extern void vtime_account_system(struct task_struct *tsk); -extern void __vtime_account_idle(struct task_struct *tsk); +extern void vtime_account_system_irqsafe(struct task_struct *tsk); +extern void vtime_account_idle(struct task_struct *tsk); extern
[PATCH 2/4] vtime: Explicitly account pending user time on process tick
All vtime implementations just flush the user time on process tick. Consolidate that in generic code by calling a user time accounting helper. This avoids an indirect call in ia64 and prepare to also consolidate vtime context switch code. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Tony Luck tony.l...@intel.com Cc: Fenghua Yu fenghua...@intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com --- arch/ia64/kernel/time.c | 11 +-- arch/powerpc/kernel/time.c | 14 +++--- arch/s390/kernel/vtime.c|7 ++- include/linux/kernel_stat.h |8 include/linux/vtime.h |1 + 5 files changed, 23 insertions(+), 18 deletions(-) diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c index f638821..834c78b 100644 --- a/arch/ia64/kernel/time.c +++ b/arch/ia64/kernel/time.c @@ -83,7 +83,7 @@ static struct clocksource *itc_clocksource; extern cputime_t cycle_to_cputime(u64 cyc); -static void vtime_account_user(struct task_struct *tsk) +void vtime_account_user(struct task_struct *tsk) { cputime_t delta_utime; struct thread_info *ti = task_thread_info(tsk); @@ -147,15 +147,6 @@ void vtime_account_idle(struct task_struct *tsk) account_idle_time(vtime_delta(tsk)); } -/* - * Called from the timer interrupt handler to charge accumulated user time - * to the current process. Must be called with interrupts disabled. - */ -void account_process_tick(struct task_struct *p, int user_tick) -{ - vtime_account_user(p); -} - #endif /* CONFIG_VIRT_CPU_ACCOUNTING */ static irqreturn_t diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index ce4cb77..a667aaf 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -355,15 +355,15 @@ void vtime_account_idle(struct task_struct *tsk) } /* - * Transfer the user and system times accumulated in the paca - * by the exception entry and exit code to the generic process - * user and system time records. + * Transfer the user time accumulated in the paca + * by the exception entry and exit code to the generic + * process user time records. * Must be called with interrupts disabled. - * Assumes that vtime_account() has been called recently - * (i.e. since the last entry from usermode) so that + * Assumes that vtime_account_system/idle() has been called + * recently (i.e. since the last entry from usermode) so that * get_paca()-user_time_scaled is up to date. */ -void account_process_tick(struct task_struct *tsk, int user_tick) +void vtime_account_user(struct task_struct *tsk) { cputime_t utime, utimescaled; @@ -378,7 +378,7 @@ void account_process_tick(struct task_struct *tsk, int user_tick) void vtime_task_switch(struct task_struct *prev) { vtime_account(prev); - account_process_tick(prev, 0); + vtime_account_user(prev); } #else /* ! CONFIG_VIRT_CPU_ACCOUNTING */ diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c index 80d1dbc..7c6d861 100644 --- a/arch/s390/kernel/vtime.c +++ b/arch/s390/kernel/vtime.c @@ -112,7 +112,12 @@ void vtime_task_switch(struct task_struct *prev) S390_lowcore.system_timer = ti-system_timer; } -void account_process_tick(struct task_struct *tsk, int user_tick) +/* + * In s390, accounting pending user time also implies + * accounting system time in order to correctly compute + * the stolen time accounting. + */ +void vtime_account_user(struct task_struct *tsk) { if (do_account_vtime(tsk, HARDIRQ_OFFSET)) virt_timer_expire(); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..66b7078 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -127,7 +127,15 @@ extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t) extern void account_steal_time(cputime_t); extern void account_idle_time(cputime_t); +#ifdef CONFIG_VIRT_CPU_ACCOUNTING +static inline void account_process_tick(struct task_struct *tsk, int user) +{ + vtime_account_user(tsk); +} +#else extern void account_process_tick(struct task_struct *, int user); +#endif + extern void account_steal_ticks(unsigned long ticks); extern void account_idle_ticks(unsigned long ticks); diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 5ad13c3..ae30ab5 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -8,6 +8,7 @@ extern void vtime_task_switch(struct task_struct *prev); extern void vtime_account_system(struct task_struct *tsk); extern void vtime_account_system_irqsafe(struct task_struct *tsk); extern void vtime_account_idle(struct
[PATCH 4/4] vtime: No need to disable irqs on vtime_account()
vtime_account() is only called from irq entry. irqs are always disabled at this point so we can safely remove the irq disabling guards on that function. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Tony Luck tony.l...@intel.com Cc: Fenghua Yu fenghua...@intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com --- kernel/sched/cputime.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 2e8d34a..80b2fd5 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -467,16 +467,10 @@ void vtime_task_switch(struct task_struct *prev) #ifndef __ARCH_HAS_VTIME_ACCOUNT void vtime_account(struct task_struct *tsk) { - unsigned long flags; - - local_irq_save(flags); - if (in_interrupt() || !is_idle_task(tsk)) vtime_account_system(tsk); else vtime_account_idle(tsk); - - local_irq_restore(flags); } EXPORT_SYMBOL_GPL(vtime_account); #endif /* __ARCH_HAS_VTIME_ACCOUNT */ -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] vtime: Consolidate a bit the ctx switch code
On ia64 and powerpc, vtime context switch only consists in flushing system and user pending time, plus a few arch housekeeping. Consolidate that into a generic implementation. s390 is a special case because pending user and system time accounting there is hard to dissociate. So it's keeping its own implementation. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Tony Luck tony.l...@intel.com Cc: Fenghua Yu fenghua...@intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com --- arch/ia64/include/asm/cputime.h|2 ++ arch/ia64/kernel/time.c|9 + arch/powerpc/include/asm/cputime.h |2 ++ arch/powerpc/kernel/time.c |6 -- arch/s390/include/asm/cputime.h|1 + kernel/sched/cputime.c | 13 + 6 files changed, 19 insertions(+), 14 deletions(-) diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h index 3deac95..7fcf7f0 100644 --- a/arch/ia64/include/asm/cputime.h +++ b/arch/ia64/include/asm/cputime.h @@ -103,5 +103,7 @@ static inline void cputime_to_timeval(const cputime_t ct, struct timeval *val) #define cputime64_to_clock_t(__ct) \ cputime_to_clock_t((__force cputime_t)__ct) +extern void arch_vtime_task_switch(struct task_struct *tsk); + #endif /* CONFIG_VIRT_CPU_ACCOUNTING */ #endif /* __IA64_CPUTIME_H */ diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c index 834c78b..c9a7d2e 100644 --- a/arch/ia64/kernel/time.c +++ b/arch/ia64/kernel/time.c @@ -100,18 +100,11 @@ void vtime_account_user(struct task_struct *tsk) * accumulated times to the current process, and to prepare accounting on * the next process. */ -void vtime_task_switch(struct task_struct *prev) +void arch_vtime_task_switch(struct task_struct *prev) { struct thread_info *pi = task_thread_info(prev); struct thread_info *ni = task_thread_info(current); - if (idle_task(smp_processor_id()) != prev) - vtime_account_system(prev); - else - vtime_account_idle(prev); - - vtime_account_user(prev); - pi-ac_stamp = ni-ac_stamp; ni-ac_stime = ni-ac_utime = 0; } diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index 487d46f..483733b 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -228,6 +228,8 @@ static inline cputime_t clock_t_to_cputime(const unsigned long clk) #define cputime64_to_clock_t(ct) cputime_to_clock_t((cputime_t)(ct)) +static inline void arch_vtime_task_switch(struct task_struct *tsk) { } + #endif /* __KERNEL__ */ #endif /* CONFIG_VIRT_CPU_ACCOUNTING */ #endif /* __POWERPC_CPUTIME_H */ diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index a667aaf..3486cfa 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -375,12 +375,6 @@ void vtime_account_user(struct task_struct *tsk) account_user_time(tsk, utime, utimescaled); } -void vtime_task_switch(struct task_struct *prev) -{ - vtime_account(prev); - vtime_account_user(prev); -} - #else /* ! CONFIG_VIRT_CPU_ACCOUNTING */ #define calc_cputime_factors() #endif diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h index 023d5ae..d2ff4137 100644 --- a/arch/s390/include/asm/cputime.h +++ b/arch/s390/include/asm/cputime.h @@ -14,6 +14,7 @@ #define __ARCH_HAS_VTIME_ACCOUNT +#define __ARCH_HAS_VTIME_TASK_SWITCH /* We want to use full resolution of the CPU timer: 2**-12 micro-seconds. */ diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index c0aa1ba..2e8d34a 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -443,6 +443,19 @@ void vtime_account_system_irqsafe(struct task_struct *tsk) } EXPORT_SYMBOL_GPL(vtime_account_system_irqsafe); +#ifndef __ARCH_HAS_VTIME_TASK_SWITCH +void vtime_task_switch(struct task_struct *prev) +{ + if (is_idle_task(prev)) + vtime_account_idle(prev); + else + vtime_account_system(prev); + + vtime_account_user(prev); + arch_vtime_task_switch(prev); +} +#endif + /* * Archs that account the whole time spent in the idle task * (outside irq) as idle time can rely on this and just implement -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] cputime: Even more cleanups
Hi, While working on full dynticks, I realized some more cleanups needed to be done. Here is it. If no comment arise, I'll send a pull request to Ingo in a week. Thanks. Frederic Weisbecker (4): vtime: Remove the underscore prefix invasion vtime: Explicitly account pending user time on process tick vtime: Consolidate a bit the ctx switch code vtime: No need to disable irqs on vtime_account() arch/ia64/include/asm/cputime.h|2 ++ arch/ia64/kernel/time.c| 24 arch/powerpc/include/asm/cputime.h |2 ++ arch/powerpc/kernel/time.c | 22 -- arch/s390/include/asm/cputime.h|1 + arch/s390/kernel/vtime.c | 11 --- include/linux/kernel_stat.h|8 include/linux/kvm_host.h |4 ++-- include/linux/vtime.h |9 + kernel/sched/cputime.c | 31 +++ 10 files changed, 59 insertions(+), 55 deletions(-) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] vtime: Remove the underscore prefix invasion
2012/11/14 Steven Rostedt rost...@goodmis.org: On Wed, 2012-11-14 at 17:26 +0100, Frederic Weisbecker wrote: Prepending irq-unsafe vtime APIs with underscores was actually a bad idea as the result is a big mess in the API namespace that is even waiting to be further extended. Also these helpers are always called from irq safe callers except kvm. Just provide a vtime_account_system_irqsafe() for this specific case so that we can remove the underscore prefix on other vtime functions. -void __vtime_account_system(struct task_struct *tsk) +void vtime_account_system(struct task_struct *tsk) { cputime_t delta = vtime_delta(tsk); Should we add a WARN_ON(!irqs_disabled()) check here? Why not, I'll add one in vtime_delta(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] vtime: No need to disable irqs on vtime_account()
2012/11/14 Steven Rostedt rost...@goodmis.org: On Wed, 2012-11-14 at 17:26 +0100, Frederic Weisbecker wrote: vtime_account() is only called from irq entry. irqs are always disabled at this point so we can safely remove the irq disabling guards on that function. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Tony Luck tony.l...@intel.com Cc: Fenghua Yu fenghua...@intel.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Martin Schwidefsky schwidef...@de.ibm.com Cc: Heiko Carstens heiko.carst...@de.ibm.com --- kernel/sched/cputime.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 2e8d34a..80b2fd5 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -467,16 +467,10 @@ void vtime_task_switch(struct task_struct *prev) #ifndef __ARCH_HAS_VTIME_ACCOUNT void vtime_account(struct task_struct *tsk) { - unsigned long flags; - - local_irq_save(flags); - I'd add a WARN_ON_ONCE(!irqs_disabled()) again here, or is this also covered by the vtime_delta()? Yeah it's the ending point for both vtime_account_system() and vtime_account_idle() -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] printk: Make it usable on nohz cpus
Ingo, Please pull the printk support in dynticks mode patches that can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git tags/printk-dynticks-for-mingo This branch is based on top of v3.7-rc4 Head is 2fb933986dcef2db1344712162a1feb8d5736ff8: printk: Wake up klogd using irq_work (2012-11-14 17:45:37 +0100) Since last version, very few things have changed: * Added acks from Steve on patches 1/7 and 2/7 * Fixed an arch_needs_cpu() redefinition due to misordered headers (reported by Wu Fengguang). If you get further acks from Peterz or anybody, feel free to cherry pick the patches instead. Or I can rebase my patches to add them, either way. Thanks. Support for printk in dynticks mode: * Fix two races in irq work claiming * Generalize irq_work support to all archs * Don't stop tick with irq works pending. This fix is generally useful and concerns archs that can't raise self IPIs. * Introduce lazy irq works that can wait for the next tick to be executed, unless it's stopped. * Implement klogd wake up using irq work. This removes the ad-hoc printk_tick()/printk_needs_cpu() hooks and make it working even in dynticks mode. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Frederic Weisbecker (7): irq_work: Fix racy IRQ_WORK_BUSY flag setting irq_work: Fix racy check on work pending flag irq_work: Remove CONFIG_HAVE_IRQ_WORK nohz: Add API to check tick state irq_work: Don't stop the tick with pending works irq_work: Make self-IPIs optable printk: Wake up klogd using irq_work arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - include/linux/irq_work.h| 20 + include/linux/printk.h |3 -- include/linux/tick.h| 17 +++- init/Kconfig|5 +-- kernel/irq_work.c | 76 +++ kernel/printk.c | 36 + kernel/time/tick-sched.c|7 ++-- kernel/timer.c |1 - 22 files changed, 112 insertions(+), 67 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/7] irq_work: Fix racy IRQ_WORK_BUSY flag setting
The IRQ_WORK_BUSY flag is set right before we execute the work. Once this flag value is set, the work enters a claimable state again. So if we have specific data to compute in our work, we ensure it's either handled by another CPU or locally by enqueuing the work again. This state machine is guanranteed by atomic operations on the flags. So when we set IRQ_WORK_BUSY without using an xchg-like operation, we break this guarantee as in the following summarized scenario: CPU 1 CPU 2 - - (flags = 0) old_flags = flags; (flags = 0) cmpxchg(flags, old_flags, old_flags | IRQ_WORK_FLAGS) (flags = 3) [...] flags = IRQ_WORK_BUSY (flags = 2) func() (sees flags = 3) cmpxchg(flags, old_flags, old_flags | IRQ_WORK_FLAGS) (give up) cmpxchg(flags, 2, 0); (flags = 0) CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and the work is again in a claimable state. Now CPU 2 has new data to process and try to claim that work but it may see a stale value of the flags and think the work is still pending somewhere that will handle our data. This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically. As a result, the data expected to be handle by CPU 2 won't get handled. To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2 will see the correct value with cmpxchg() using the expected ordering. Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Anish Kumar anish198519851...@gmail.com --- kernel/irq_work.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 1588e3b..57be1a6 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -119,8 +119,11 @@ void irq_work_run(void) /* * Clear the PENDING bit, after this point the @work * can be re-used. +* Make it immediately visible so that other CPUs trying +* to claim that work don't rely on us to handle their data +* while we are in the middle of the func. */ - work-flags = IRQ_WORK_BUSY; + xchg(work-flags, IRQ_WORK_BUSY); work-func(work); /* * Clear the BUSY bit and return to the free state if -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/7] nohz: Add API to check tick state
We need some quick way to check if the CPU has stopped its tick. This will be useful to implement the printk tick using the irq work subsystem. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/tick.h | 17 - kernel/time/tick-sched.c |2 +- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index f37fceb..2307dd3 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -8,6 +8,8 @@ #include linux/clockchips.h #include linux/irqflags.h +#include linux/percpu.h +#include linux/hrtimer.h #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 0; } #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ # ifdef CONFIG_NO_HZ +DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched); + +static inline int tick_nohz_tick_stopped(void) +{ + return __this_cpu_read(tick_cpu_sched.tick_stopped); +} + extern void tick_nohz_idle_enter(void); extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); -# else + +# else /* !CONFIG_NO_HZ */ +static inline int tick_nohz_tick_stopped(void) +{ + return 0; +} + static inline void tick_nohz_idle_enter(void) { } static inline void tick_nohz_idle_exit(void) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a402608..9e945aa 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -28,7 +28,7 @@ /* * Per cpu nohz control structure */ -static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); +DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); /* * The time, when the last jiffy update happened. Protected by xtime_lock. -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] irq_work: Don't stop the tick with pending works
Don't stop the tick if we have pending irq works on the queue, otherwise if the arch can't raise self-IPIs, we may not find an opportunity to execute the pending works for a while. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/irq_work.h |6 ++ kernel/irq_work.c| 11 +++ kernel/time/tick-sched.c |3 ++- 3 files changed, 19 insertions(+), 1 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index 6a9e8f5..a69704f 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work); void irq_work_run(void); void irq_work_sync(struct irq_work *work); +#ifdef CONFIG_IRQ_WORK +bool irq_work_needs_cpu(void); +#else +static bool irq_work_needs_cpu(void) { return false; } +#endif + #endif /* _LINUX_IRQ_WORK_H */ diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 64eddd5..b3c113a 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work) } EXPORT_SYMBOL_GPL(irq_work_queue); +bool irq_work_needs_cpu(void) +{ + struct llist_head *this_list; + + this_list = __get_cpu_var(irq_work_list); + if (llist_empty(this_list)) + return false; + + return true; +} + /* * Run the irq_work entries on this cpu. Requires to be ran from hardirq * context with local IRQs disabled. diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9e945aa..f249e8c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include linux/profile.h #include linux/sched.h #include linux/module.h +#include linux/irq_work.h #include asm/irq_regs.h @@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, } while (read_seqretry(xtime_lock, seq)); if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || - arch_needs_cpu(cpu)) { + arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; } else { -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/7] irq_work: Fix racy check on work pending flag
Work claiming wants to be SMP-safe. And by the time we try to claim a work, if it is already executing concurrently on another CPU, we want to succeed the claiming and queue the work again because the other CPU may have missed the data we wanted to handle in our work if it's about to complete there. This scenario is summarized below: CPU 1 CPU 2 - - (flags = 0) cmpxchg(flags, 0, IRQ_WORK_FLAGS) (flags = 3) [...] xchg(flags, IRQ_WORK_BUSY) (flags = 2) func() if (flags IRQ_WORK_PENDING) (not true) cmpxchg(flags, flags, IRQ_WORK_FLAGS) (flags = 3) [...] cmpxchg(flags, IRQ_WORK_BUSY, 0); (fail, pending on CPU 2) This state machine is synchronized using [cmp]xchg() on the flags. As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy. By the time we check it, we may be dealing with a stale value because we aren't using an atomic accessor. As a result, CPU 2 may see that the work is still pending on another CPU while it may be actually completing the work function exection already, leaving our data unprocessed. To fix this, we start by speculating about the value we wish to be in the work-flags but we only make any conclusion after the value returned by the cmpxchg() call that either claims the work or let the current owner handle the pending work for us. Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Anish Kumar anish198519851...@gmail.com --- kernel/irq_work.c | 16 +++- 1 files changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 57be1a6..64eddd5 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list); */ static bool irq_work_claim(struct irq_work *work) { - unsigned long flags, nflags; + unsigned long flags, oflags, nflags; + /* +* Start with our best wish as a premise but only trust any +* flag value after cmpxchg() result. +*/ + flags = work-flags ~IRQ_WORK_PENDING; for (;;) { - flags = work-flags; - if (flags IRQ_WORK_PENDING) - return false; nflags = flags | IRQ_WORK_FLAGS; - if (cmpxchg(work-flags, flags, nflags) == flags) + oflags = cmpxchg(work-flags, flags, nflags); + if (oflags == flags) break; + if (oflags IRQ_WORK_PENDING) + return false; + flags = oflags; cpu_relax(); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/7] printk: Wake up klogd using irq_work
klogd is woken up asynchronously from the tick in order to do it safely. However if printk is called when the tick is stopped, the reader won't be woken up until the next interrupt, which might not fire for a while. As a result, the user may miss some message. To fix this, lets implement the printk tick using a lazy irq work. This subsystem takes care of the timer tick state and can fix up accordingly. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/printk.h |3 --- init/Kconfig |1 + kernel/printk.c | 36 kernel/time/tick-sched.c |2 +- kernel/timer.c |1 - 5 files changed, 22 insertions(+), 21 deletions(-) diff --git a/include/linux/printk.h b/include/linux/printk.h index 9afc01e..86c4b62 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...) extern asmlinkage __printf(1, 2) void early_printk(const char *fmt, ...); -extern int printk_needs_cpu(int cpu); -extern void printk_tick(void); - #ifdef CONFIG_PRINTK asmlinkage __printf(5, 0) int vprintk_emit(int facility, int level, diff --git a/init/Kconfig b/init/Kconfig index cdc152c..c575566 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1196,6 +1196,7 @@ config HOTPLUG config PRINTK default y bool Enable support for printk if EXPERT + select IRQ_WORK help This option enables normal printk support. Removing it eliminates most of the message strings from the kernel image diff --git a/kernel/printk.c b/kernel/printk.c index 2d607f4..c9104fe 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -42,6 +42,7 @@ #include linux/notifier.h #include linux/rculist.h #include linux/poll.h +#include linux/irq_work.h #include asm/uaccess.h @@ -1955,30 +1956,32 @@ int is_console_locked(void) static DEFINE_PER_CPU(int, printk_pending); static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf); -void printk_tick(void) +static void wake_up_klogd_work_func(struct irq_work *irq_work) { - if (__this_cpu_read(printk_pending)) { - int pending = __this_cpu_xchg(printk_pending, 0); - if (pending PRINTK_PENDING_SCHED) { - char *buf = __get_cpu_var(printk_sched_buf); - printk(KERN_WARNING [sched_delayed] %s, buf); - } - if (pending PRINTK_PENDING_WAKEUP) - wake_up_interruptible(log_wait); + int pending = __this_cpu_xchg(printk_pending, 0); + + if (pending PRINTK_PENDING_SCHED) { + char *buf = __get_cpu_var(printk_sched_buf); + printk(KERN_WARNING [sched_delayed] %s, buf); } -} -int printk_needs_cpu(int cpu) -{ - if (cpu_is_offline(cpu)) - printk_tick(); - return __this_cpu_read(printk_pending); + if (pending PRINTK_PENDING_WAKEUP) + wake_up_interruptible(log_wait); } +static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = { + .func = wake_up_klogd_work_func, + .flags = IRQ_WORK_LAZY, +}; + void wake_up_klogd(void) { - if (waitqueue_active(log_wait)) + preempt_disable(); + if (waitqueue_active(log_wait)) { this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP); + irq_work_queue(__get_cpu_var(wake_up_klogd_work)); + } + preempt_enable(); } static void console_cont_flush(char *text, size_t size) @@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...) va_end(args); __this_cpu_or(printk_pending, PRINTK_PENDING_SCHED); + irq_work_queue(__get_cpu_var(wake_up_klogd_work)); local_irq_restore(flags); return r; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f249e8c..822d757 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, time_delta = timekeeping_max_deferment(); } while (read_seqretry(xtime_lock, seq)); - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; diff --git a/kernel/timer.c b/kernel/timer.c index 367d008..ff3b516 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1351,7 +1351,6 @@ void update_process_times(int user_tick) account_process_tick(p, user_tick); run_local_timers(); rcu_check_callbacks(cpu, user_tick); - printk_tick(); #ifdef
[PATCH 6/7] irq_work: Make self-IPIs optable
On irq work initialization, let the user choose to define it as lazy or not. Lazy means that we don't want to send an IPI (provided the arch can anyway) when we enqueue this work but we rather prefer to wait for the next timer tick to execute our work if possible. This is going to be a benefit for non-urgent enqueuers (like printk in the future) that may prefer not to raise an IPI storm in case of frequent enqueuing on short periods of time. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/irq_work.h | 14 ++ kernel/irq_work.c| 46 ++ 2 files changed, 40 insertions(+), 20 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index a69704f..b28eb60 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -3,6 +3,20 @@ #include linux/llist.h +/* + * An entry can be in one of four states: + * + * free NULL, 0 - {claimed} : free to be used + * claimed NULL, 3 - {pending} : claimed to be enqueued + * pending next, 3 - {busy} : queued, pending callback + * busy NULL, 2 - {free, claimed} : callback in progress, can be claimed + */ + +#define IRQ_WORK_PENDING 1UL +#define IRQ_WORK_BUSY 2UL +#define IRQ_WORK_FLAGS 3UL +#define IRQ_WORK_LAZY 4UL /* Doesn't want IPI, wait for tick */ + struct irq_work { unsigned long flags; struct llist_node llnode; diff --git a/kernel/irq_work.c b/kernel/irq_work.c index b3c113a..65c65dc 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -12,22 +12,13 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/irqflags.h +#include linux/sched.h +#include linux/tick.h #include asm/processor.h -/* - * An entry can be in one of four states: - * - * free NULL, 0 - {claimed} : free to be used - * claimed NULL, 3 - {pending} : claimed to be enqueued - * pending next, 3 - {busy} : queued, pending callback - * busy NULL, 2 - {free, claimed} : callback in progress, can be claimed - */ - -#define IRQ_WORK_PENDING 1UL -#define IRQ_WORK_BUSY 2UL -#define IRQ_WORK_FLAGS 3UL static DEFINE_PER_CPU(struct llist_head, irq_work_list); +static DEFINE_PER_CPU(int, irq_work_raised); /* * Claim the entry so that no one else will poke at it. @@ -67,14 +58,18 @@ void __weak arch_irq_work_raise(void) */ static void __irq_work_queue(struct irq_work *work) { - bool empty; - preempt_disable(); - empty = llist_add(work-llnode, __get_cpu_var(irq_work_list)); - /* The list was empty, raise self-interrupt to start processing. */ - if (empty) - arch_irq_work_raise(); + llist_add(work-llnode, __get_cpu_var(irq_work_list)); + + /* +* If the work is flagged as lazy, just wait for the next tick +* to run it. Otherwise, or if the tick is stopped, raise the irq work. +*/ + if (!(work-flags IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) { + if (!this_cpu_cmpxchg(irq_work_raised, 0, 1)) + arch_irq_work_raise(); + } preempt_enable(); } @@ -116,10 +111,19 @@ bool irq_work_needs_cpu(void) */ void irq_work_run(void) { + unsigned long flags; struct irq_work *work; struct llist_head *this_list; struct llist_node *llnode; + + /* +* Reset the raised state right before we check the list because +* an NMI may enqueue after we find the list empty from the runner. +*/ + __this_cpu_write(irq_work_raised, 0); + barrier(); + this_list = __get_cpu_var(irq_work_list); if (llist_empty(this_list)) return; @@ -140,13 +144,15 @@ void irq_work_run(void) * to claim that work don't rely on us to handle their data * while we are in the middle of the func. */ - xchg(work-flags, IRQ_WORK_BUSY); + flags = work-flags ~IRQ_WORK_PENDING; + xchg(work-flags, flags); + work-func(work); /* * Clear the BUSY bit and return to the free state if * no-one else claimed it meanwhile. */ - (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0); + (void)cmpxchg(work-flags, flags, flags ~IRQ_WORK_BUSY); } } EXPORT_SYMBOL_GPL(irq_work_run); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read
[PATCH 3/7] irq_work: Remove CONFIG_HAVE_IRQ_WORK
irq work can run on any arch even without IPI support because of the hook on update_process_times(). So lets remove HAVE_IRQ_WORK because it doesn't reflect any backend requirement. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - init/Kconfig|4 15 files changed, 0 insertions(+), 18 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 5dd7f5d..e56c2d1 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -5,7 +5,6 @@ config ALPHA select HAVE_IDE select HAVE_OPROFILE select HAVE_SYSCALL_WRAPPERS - select HAVE_IRQ_WORK select HAVE_PCSPKR_PLATFORM select HAVE_PERF_EVENTS select HAVE_DMA_ATTRS diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index ade7e92..22d378b 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -36,7 +36,6 @@ config ARM select HAVE_GENERIC_HARDIRQS select HAVE_HW_BREAKPOINT if (PERF_EVENTS (CPU_V6 || CPU_V6K || CPU_V7)) select HAVE_IDE if PCI || ISA || PCMCIA - select HAVE_IRQ_WORK select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA select HAVE_KERNEL_LZO diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index ef54a59..dd50d72 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -17,7 +17,6 @@ config ARM64 select HAVE_GENERIC_DMA_COHERENT select HAVE_GENERIC_HARDIRQS select HAVE_HW_BREAKPOINT if PERF_EVENTS - select HAVE_IRQ_WORK select HAVE_MEMBLOCK select HAVE_PERF_EVENTS select HAVE_SPARSE_IRQ diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig index b6f3ad5..86f891f 100644 --- a/arch/blackfin/Kconfig +++ b/arch/blackfin/Kconfig @@ -24,7 +24,6 @@ config BLACKFIN select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_TRACE_MCOUNT_TEST select HAVE_IDE - select HAVE_IRQ_WORK select HAVE_KERNEL_GZIP if RAMKERNEL select HAVE_KERNEL_BZIP2 if RAMKERNEL select HAVE_KERNEL_LZMA if RAMKERNEL diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig index df2eb4b..c44fd6e 100644 --- a/arch/frv/Kconfig +++ b/arch/frv/Kconfig @@ -3,7 +3,6 @@ config FRV default y select HAVE_IDE select HAVE_ARCH_TRACEHOOK - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select HAVE_UID16 select HAVE_GENERIC_HARDIRQS diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index 0744f7d..40a3185 100644 --- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -14,7 +14,6 @@ config HEXAGON # select HAVE_CLK # select IRQ_PER_CPU # select GENERIC_PENDING_IRQ if SMP - select HAVE_IRQ_WORK select GENERIC_ATOMIC64 select HAVE_PERF_EVENTS select HAVE_GENERIC_HARDIRQS diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index dba9390..3d86d69 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -4,7 +4,6 @@ config MIPS select HAVE_GENERIC_DMA_COHERENT select HAVE_IDE select HAVE_OPROFILE - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select PERF_USE_VMALLOC select HAVE_ARCH_KGDB diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 11def45..8f0df47 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -9,7 +9,6 @@ config PARISC select RTC_DRV_GENERIC select INIT_ALL_POSSIBLE select BUG - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select GENERIC_ATOMIC64 if !64BIT select HAVE_GENERIC_HARDIRQS diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a902a5c..a90f0c9 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -118,7 +118,6 @@ config PPC select HAVE_SYSCALL_WRAPPERS if PPC64 select GENERIC_ATOMIC64 if PPC32 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_HW_BREAKPOINT if PERF_EVENTS PPC_BOOK3S_64 diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5dba755..0816ff0 100644
Re: [PATCH 7/7] printk: Wake up klogd using irq_work
2012/11/15 Steven Rostedt rost...@goodmis.org: On Wed, 2012-11-14 at 21:37 +0100, Frederic Weisbecker wrote: diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f249e8c..822d757 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, time_delta = timekeeping_max_deferment(); } while (read_seqretry(xtime_lock, seq)); - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || If the CPU is going offline, the printk_tick() would be executed here. But now that printk_tick() is done with the irq_work code, it wont be executed till the next tick. Could this cause a missed printk because of this, if the cpu is going offline? Actually, how does irq_work in general handle cpu offline work? Good point, and that's not trivial to solve. The hotplug down sequence does: - CPU that offilines CPU offlining - - cpu_down() { __stop_machine(take_cpu_down) take_cpu_down() { __cpu_disable() { * disable irqs in hw * clear from online mask } move all tasks somewhere } while (!idle_cpu(offlining)) cpu_relax() cpu_die(); - So the offlining CPU goes to idle in the end once irqs are disabled in the apic level. Does that include the timer tick? If so then the last resort to offline without irq works in the queue is to make take_cpu_down() ask for a retry if there are pending irq works during its execution. Now if we have printk() calls between __cpu_disable() and the idle loop, they will be lost until the next onlining. Unless we do an explicit call to printk_tick() from the idle loop if the CPU is offline. Note that !CONFIG_NO_HZ doesn't seem to handle that. Which makes me wonder if the tick is really part of the whole IRQ disablement done in __cpu_disable(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/7] printk: Wake up klogd using irq_work
2012/11/15 Frederic Weisbecker fweis...@gmail.com: - CPU that offilines CPU offlining - - cpu_down() { __stop_machine(take_cpu_down) take_cpu_down() { __cpu_disable() { * disable irqs in hw * clear from online mask } move all tasks somewhere } while (!idle_cpu(offlining)) cpu_relax() cpu_die(); - Oh thanks gmail for the mess. Sometimes it mangles contents, sometimes not. Probably depend if the moon is odd or even. Here is a pastebin: http://pastebin.com/aACvyu6p -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/7] irq_work: Don't stop the tick with pending works
2012/11/15 Steven Rostedt rost...@goodmis.org: On Wed, 2012-11-14 at 21:37 +0100, Frederic Weisbecker wrote: diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 64eddd5..b3c113a 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work) } EXPORT_SYMBOL_GPL(irq_work_queue); +bool irq_work_needs_cpu(void) +{ + struct llist_head *this_list; + + this_list = __get_cpu_var(irq_work_list); + if (llist_empty(this_list)) + return false; + I wounder if this should just be: return !llist_empty(this_cpu_read(irq_work_list)); Yeah I'll simplify that way. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] irq_work: Flush work on CPU_DYING (was: Re: [PATCH 7/7] printk: Wake up klogd using irq_work)
2012/11/15 Steven Rostedt rost...@goodmis.org: On Thu, 2012-11-15 at 16:25 +0100, Frederic Weisbecker wrote: 2012/11/15 Steven Rostedt rost...@goodmis.org: On Wed, 2012-11-14 at 21:37 +0100, Frederic Weisbecker wrote: diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f249e8c..822d757 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, time_delta = timekeeping_max_deferment(); } while (read_seqretry(xtime_lock, seq)); - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || If the CPU is going offline, the printk_tick() would be executed here. But now that printk_tick() is done with the irq_work code, it wont be executed till the next tick. Could this cause a missed printk because of this, if the cpu is going offline? Actually, how does irq_work in general handle cpu offline work? Good point, and that's not trivial to solve. The hotplug down sequence does: - CPU that offilines CPU offlining - - cpu_down() { __stop_machine(take_cpu_down) take_cpu_down() { __cpu_disable() { * disable irqs in hw * clear from online mask } move all tasks somewhere } while (!idle_cpu(offlining)) cpu_relax() cpu_die(); - So the offlining CPU goes to idle in the end once irqs are disabled in the apic level. Does that include the timer tick? If so then the last resort to offline without irq works in the queue is to make take_cpu_down() ask for a retry if there are pending irq works during its execution. Now if we have printk() calls between __cpu_disable() and the idle loop, they will be lost until the next onlining. Unless we do an explicit call to printk_tick() from the idle loop if the CPU is offline. Note that !CONFIG_NO_HZ doesn't seem to handle that. Which makes me wonder if the tick is really part of the whole IRQ disablement done in __cpu_disable(). How about flushing all irq_work from CPU_DYING. The notifier is called by stop_machine on the CPU that is going down. Grant you, the code will not be called from irq context (so things like get_irq_regs() wont work) but I'm not sure what the requirements are for irq_work in that regard (Peter?). But irqs are disabled and the CPU is about to go offline. Might as well flush the work. I ran this against my stress_cpu_hotplug script (attached) and it seemed to work fine. I even did a: perf record ./stress-cpu-hotplug Signed-off-by: Steven Rostedt rost...@goodmis.org Index: linux-rt.git/kernel/irq_work.c === --- linux-rt.git.orig/kernel/irq_work.c +++ linux-rt.git/kernel/irq_work.c @@ -14,6 +14,7 @@ #include linux/irqflags.h #include linux/sched.h #include linux/tick.h +#include linux/cpu.h #include asm/processor.h @@ -105,11 +106,7 @@ bool irq_work_needs_cpu(void) return true; } -/* - * Run the irq_work entries on this cpu. Requires to be ran from hardirq - * context with local IRQs disabled. - */ -void irq_work_run(void) +static void __irq_work_run(void) { unsigned long flags; struct irq_work *work; @@ -128,7 +125,6 @@ void irq_work_run(void) if (llist_empty(this_list)) return; - BUG_ON(!in_irq()); BUG_ON(!irqs_disabled()); llnode = llist_del_all(this_list); @@ -155,8 +151,23 @@ void irq_work_run(void) (void)cmpxchg(work-flags, flags, flags ~IRQ_WORK_BUSY); } } + +/* + * Run the irq_work entries on this cpu. Requires to be ran from hardirq + * context with local IRQs disabled. + */ +void irq_work_run(void) +{ + BUG_ON(!in_irq()); + __irq_work_run(); +} EXPORT_SYMBOL_GPL(irq_work_run); +static void irq_work_run_cpu_down(void) +{ + __irq_work_run(); +} + /* * Synchronize against the irq_work @entry, ensures the entry is not * currently in use. @@ -169,3 +180,35 @@ void irq_work_sync(struct irq_work *work cpu_relax(); } EXPORT_SYMBOL_GPL(irq_work_sync); + +#ifdef CONFIG_HOTPLUG_CPU +static int irq_work_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + switch (action) { + case CPU_DYING: Looks good. Perf has already deactivated the cpu wide events on CPU_DOWN_PREPARE. I suspect it's the only irq work enqueuer from NMI. At this stage of cpu down hotplug, irqs are deactivated so the last possible enqueuers before the CPU goes idle/down are from subsequent
Re: [PATCH RFC] irq_work: Warn if there's still work on cpu_down
2012/11/15 Steven Rostedt rost...@goodmis.org: If we are in nohz and there's still irq_work to be done when the idle task is about to go offline. Give a nasty warning. Signed-off-by: Steven Rostedt rost...@goodmis.org Index: linux-rt.git/kernel/irq_work.c === --- linux-rt.git.orig/kernel/irq_work.c +++ linux-rt.git/kernel/irq_work.c @@ -103,6 +103,9 @@ bool irq_work_needs_cpu(void) if (llist_empty(this_list)) return false; + /* All work should have been flushed before going offline */ + WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); Should we return false in that case? I don't know what can happen if we wait for one more tick while the CPU is offline and apic is deactivated. + return true; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/9] irq_work: Fix racy IRQ_WORK_BUSY flag setting
The IRQ_WORK_BUSY flag is set right before we execute the work. Once this flag value is set, the work enters a claimable state again. So if we have specific data to compute in our work, we ensure it's either handled by another CPU or locally by enqueuing the work again. This state machine is guanranteed by atomic operations on the flags. So when we set IRQ_WORK_BUSY without using an xchg-like operation, we break this guarantee as in the following summarized scenario: CPU 1 CPU 2 - - (flags = 0) old_flags = flags; (flags = 0) cmpxchg(flags, old_flags, old_flags | IRQ_WORK_FLAGS) (flags = 3) [...] flags = IRQ_WORK_BUSY (flags = 2) func() (sees flags = 3) cmpxchg(flags, old_flags, old_flags | IRQ_WORK_FLAGS) (give up) cmpxchg(flags, 2, 0); (flags = 0) CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and the work is again in a claimable state. Now CPU 2 has new data to process and try to claim that work but it may see a stale value of the flags and think the work is still pending somewhere that will handle our data. This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically. As a result, the data expected to be handle by CPU 2 won't get handled. To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2 will see the correct value with cmpxchg() using the expected ordering. Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Anish Kumar anish198519851...@gmail.com --- kernel/irq_work.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 1588e3b..57be1a6 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -119,8 +119,11 @@ void irq_work_run(void) /* * Clear the PENDING bit, after this point the @work * can be re-used. +* Make it immediately visible so that other CPUs trying +* to claim that work don't rely on us to handle their data +* while we are in the middle of the func. */ - work-flags = IRQ_WORK_BUSY; + xchg(work-flags, IRQ_WORK_BUSY); work-func(work); /* * Clear the BUSY bit and return to the free state if -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/9] nohz: Add API to check tick state
We need some quick way to check if the CPU has stopped its tick. This will be useful to implement the printk tick using the irq work subsystem. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/tick.h | 17 - kernel/time/tick-sched.c |2 +- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index f37fceb..2307dd3 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -8,6 +8,8 @@ #include linux/clockchips.h #include linux/irqflags.h +#include linux/percpu.h +#include linux/hrtimer.h #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 0; } #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ # ifdef CONFIG_NO_HZ +DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched); + +static inline int tick_nohz_tick_stopped(void) +{ + return __this_cpu_read(tick_cpu_sched.tick_stopped); +} + extern void tick_nohz_idle_enter(void); extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); -# else + +# else /* !CONFIG_NO_HZ */ +static inline int tick_nohz_tick_stopped(void) +{ + return 0; +} + static inline void tick_nohz_idle_enter(void) { } static inline void tick_nohz_idle_exit(void) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a402608..9e945aa 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -28,7 +28,7 @@ /* * Per cpu nohz control structure */ -static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); +DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); /* * The time, when the last jiffy update happened. Protected by xtime_lock. -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/9] irq_work: Don't stop the tick with pending works
Don't stop the tick if we have pending irq works on the queue, otherwise if the arch can't raise self-IPIs, we may not find an opportunity to execute the pending works for a while. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/irq_work.h |6 ++ kernel/irq_work.c| 11 +++ kernel/time/tick-sched.c |3 ++- 3 files changed, 19 insertions(+), 1 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index 6a9e8f5..a69704f 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work); void irq_work_run(void); void irq_work_sync(struct irq_work *work); +#ifdef CONFIG_IRQ_WORK +bool irq_work_needs_cpu(void); +#else +static bool irq_work_needs_cpu(void) { return false; } +#endif + #endif /* _LINUX_IRQ_WORK_H */ diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 64eddd5..b3c113a 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work) } EXPORT_SYMBOL_GPL(irq_work_queue); +bool irq_work_needs_cpu(void) +{ + struct llist_head *this_list; + + this_list = __get_cpu_var(irq_work_list); + if (llist_empty(this_list)) + return false; + + return true; +} + /* * Run the irq_work entries on this cpu. Requires to be ran from hardirq * context with local IRQs disabled. diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9e945aa..f249e8c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include linux/profile.h #include linux/sched.h #include linux/module.h +#include linux/irq_work.h #include asm/irq_regs.h @@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, } while (read_seqretry(xtime_lock, seq)); if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || - arch_needs_cpu(cpu)) { + arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; } else { -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/9] irq_work: Warn if there's still work on cpu_down
From: Steven Rostedt rost...@goodmis.org If we are in nohz and there's still irq_work to be done when the idle task is about to go offline, give a nasty warning. Everything should have been flushed from the CPU_DYING notifier already. Further attempts to enqueue an irq_work are buggy because irqs are disabled by __cpu_disable(). The best we can do is to report the issue to the user. Signed-off-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Signed-off-by: Frederic Weisbecker fweis...@gmail.com --- kernel/irq_work.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cf8b657..fcaadae 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -108,6 +108,9 @@ bool irq_work_needs_cpu(void) if (llist_empty(this_list)) return false; + /* All work should have been flushed before going offline */ + WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); + return true; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/9] printk: Wake up klogd using irq_work
klogd is woken up asynchronously from the tick in order to do it safely. However if printk is called when the tick is stopped, the reader won't be woken up until the next interrupt, which might not fire for a while. As a result, the user may miss some message. To fix this, lets implement the printk tick using a lazy irq work. This subsystem takes care of the timer tick state and can fix up accordingly. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/printk.h |3 --- init/Kconfig |1 + kernel/printk.c | 36 kernel/time/tick-sched.c |2 +- kernel/timer.c |1 - 5 files changed, 22 insertions(+), 21 deletions(-) diff --git a/include/linux/printk.h b/include/linux/printk.h index 9afc01e..86c4b62 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...) extern asmlinkage __printf(1, 2) void early_printk(const char *fmt, ...); -extern int printk_needs_cpu(int cpu); -extern void printk_tick(void); - #ifdef CONFIG_PRINTK asmlinkage __printf(5, 0) int vprintk_emit(int facility, int level, diff --git a/init/Kconfig b/init/Kconfig index cdc152c..c575566 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1196,6 +1196,7 @@ config HOTPLUG config PRINTK default y bool Enable support for printk if EXPERT + select IRQ_WORK help This option enables normal printk support. Removing it eliminates most of the message strings from the kernel image diff --git a/kernel/printk.c b/kernel/printk.c index 2d607f4..c9104fe 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -42,6 +42,7 @@ #include linux/notifier.h #include linux/rculist.h #include linux/poll.h +#include linux/irq_work.h #include asm/uaccess.h @@ -1955,30 +1956,32 @@ int is_console_locked(void) static DEFINE_PER_CPU(int, printk_pending); static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf); -void printk_tick(void) +static void wake_up_klogd_work_func(struct irq_work *irq_work) { - if (__this_cpu_read(printk_pending)) { - int pending = __this_cpu_xchg(printk_pending, 0); - if (pending PRINTK_PENDING_SCHED) { - char *buf = __get_cpu_var(printk_sched_buf); - printk(KERN_WARNING [sched_delayed] %s, buf); - } - if (pending PRINTK_PENDING_WAKEUP) - wake_up_interruptible(log_wait); + int pending = __this_cpu_xchg(printk_pending, 0); + + if (pending PRINTK_PENDING_SCHED) { + char *buf = __get_cpu_var(printk_sched_buf); + printk(KERN_WARNING [sched_delayed] %s, buf); } -} -int printk_needs_cpu(int cpu) -{ - if (cpu_is_offline(cpu)) - printk_tick(); - return __this_cpu_read(printk_pending); + if (pending PRINTK_PENDING_WAKEUP) + wake_up_interruptible(log_wait); } +static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = { + .func = wake_up_klogd_work_func, + .flags = IRQ_WORK_LAZY, +}; + void wake_up_klogd(void) { - if (waitqueue_active(log_wait)) + preempt_disable(); + if (waitqueue_active(log_wait)) { this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP); + irq_work_queue(__get_cpu_var(wake_up_klogd_work)); + } + preempt_enable(); } static void console_cont_flush(char *text, size_t size) @@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...) va_end(args); __this_cpu_or(printk_pending, PRINTK_PENDING_SCHED); + irq_work_queue(__get_cpu_var(wake_up_klogd_work)); local_irq_restore(flags); return r; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f249e8c..822d757 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, time_delta = timekeeping_max_deferment(); } while (read_seqretry(xtime_lock, seq)); - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; diff --git a/kernel/timer.c b/kernel/timer.c index 367d008..ff3b516 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1351,7 +1351,6 @@ void update_process_times(int user_tick) account_process_tick(p, user_tick); run_local_timers(); rcu_check_callbacks(cpu, user_tick); - printk_tick(); #ifdef
[PATCH 8/9] irq_work: Make self-IPIs optable
On irq work initialization, let the user choose to define it as lazy or not. Lazy means that we don't want to send an IPI (provided the arch can anyway) when we enqueue this work but we rather prefer to wait for the next timer tick to execute our work if possible. This is going to be a benefit for non-urgent enqueuers (like printk in the future) that may prefer not to raise an IPI storm in case of frequent enqueuing on short periods of time. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/irq_work.h | 14 ++ kernel/irq_work.c| 46 ++ 2 files changed, 40 insertions(+), 20 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index a69704f..b28eb60 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -3,6 +3,20 @@ #include linux/llist.h +/* + * An entry can be in one of four states: + * + * free NULL, 0 - {claimed} : free to be used + * claimed NULL, 3 - {pending} : claimed to be enqueued + * pending next, 3 - {busy} : queued, pending callback + * busy NULL, 2 - {free, claimed} : callback in progress, can be claimed + */ + +#define IRQ_WORK_PENDING 1UL +#define IRQ_WORK_BUSY 2UL +#define IRQ_WORK_FLAGS 3UL +#define IRQ_WORK_LAZY 4UL /* Doesn't want IPI, wait for tick */ + struct irq_work { unsigned long flags; struct llist_node llnode; diff --git a/kernel/irq_work.c b/kernel/irq_work.c index fcaadae..cef098d 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -12,23 +12,14 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/irqflags.h +#include linux/sched.h +#include linux/tick.h #include linux/cpu.h #include asm/processor.h -/* - * An entry can be in one of four states: - * - * free NULL, 0 - {claimed} : free to be used - * claimed NULL, 3 - {pending} : claimed to be enqueued - * pending next, 3 - {busy} : queued, pending callback - * busy NULL, 2 - {free, claimed} : callback in progress, can be claimed - */ - -#define IRQ_WORK_PENDING 1UL -#define IRQ_WORK_BUSY 2UL -#define IRQ_WORK_FLAGS 3UL static DEFINE_PER_CPU(struct llist_head, irq_work_list); +static DEFINE_PER_CPU(int, irq_work_raised); /* * Claim the entry so that no one else will poke at it. @@ -68,14 +59,18 @@ void __weak arch_irq_work_raise(void) */ static void __irq_work_queue(struct irq_work *work) { - bool empty; - preempt_disable(); - empty = llist_add(work-llnode, __get_cpu_var(irq_work_list)); - /* The list was empty, raise self-interrupt to start processing. */ - if (empty) - arch_irq_work_raise(); + llist_add(work-llnode, __get_cpu_var(irq_work_list)); + + /* +* If the work is flagged as lazy, just wait for the next tick +* to run it. Otherwise, or if the tick is stopped, raise the irq work. +*/ + if (!(work-flags IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) { + if (!this_cpu_cmpxchg(irq_work_raised, 0, 1)) + arch_irq_work_raise(); + } preempt_enable(); } @@ -116,10 +111,19 @@ bool irq_work_needs_cpu(void) static void __irq_work_run(void) { + unsigned long flags; struct irq_work *work; struct llist_head *this_list; struct llist_node *llnode; + + /* +* Reset the raised state right before we check the list because +* an NMI may enqueue after we find the list empty from the runner. +*/ + __this_cpu_write(irq_work_raised, 0); + barrier(); + this_list = __get_cpu_var(irq_work_list); if (llist_empty(this_list)) return; @@ -139,13 +143,15 @@ static void __irq_work_run(void) * to claim that work don't rely on us to handle their data * while we are in the middle of the func. */ - xchg(work-flags, IRQ_WORK_BUSY); + flags = work-flags ~IRQ_WORK_PENDING; + xchg(work-flags, flags); + work-func(work); /* * Clear the BUSY bit and return to the free state if * no-one else claimed it meanwhile. */ - (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0); + (void)cmpxchg(work-flags, flags, flags ~IRQ_WORK_BUSY); } } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read
[PATCH 0/9] printk: Make it usable on nohz cpus v6
Hi, Previous patches haven't changed. This pile just adds two patches from Steven Rostedt to ensure all pending irq works are executed before we offline a CPU. The branch can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/printk-v6 Thanks. Frederic Weisbecker (7): irq_work: Fix racy IRQ_WORK_BUSY flag setting irq_work: Fix racy check on work pending flag irq_work: Remove CONFIG_HAVE_IRQ_WORK nohz: Add API to check tick state irq_work: Don't stop the tick with pending works irq_work: Make self-IPIs optable printk: Wake up klogd using irq_work Steven Rostedt (2): irq_work: Flush work on CPU_DYING irq_work: Warn if there's still work on cpu_down arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - include/linux/irq_work.h| 20 ++ include/linux/printk.h |3 - include/linux/tick.h| 17 - init/Kconfig|5 +- kernel/irq_work.c | 129 ++ kernel/printk.c | 36 ++ kernel/time/tick-sched.c|7 +- kernel/timer.c |1 - 22 files changed, 159 insertions(+), 73 deletions(-) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/9] irq_work: Fix racy check on work pending flag
Work claiming wants to be SMP-safe. And by the time we try to claim a work, if it is already executing concurrently on another CPU, we want to succeed the claiming and queue the work again because the other CPU may have missed the data we wanted to handle in our work if it's about to complete there. This scenario is summarized below: CPU 1 CPU 2 - - (flags = 0) cmpxchg(flags, 0, IRQ_WORK_FLAGS) (flags = 3) [...] xchg(flags, IRQ_WORK_BUSY) (flags = 2) func() if (flags IRQ_WORK_PENDING) (not true) cmpxchg(flags, flags, IRQ_WORK_FLAGS) (flags = 3) [...] cmpxchg(flags, IRQ_WORK_BUSY, 0); (fail, pending on CPU 2) This state machine is synchronized using [cmp]xchg() on the flags. As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy. By the time we check it, we may be dealing with a stale value because we aren't using an atomic accessor. As a result, CPU 2 may see that the work is still pending on another CPU while it may be actually completing the work function exection already, leaving our data unprocessed. To fix this, we start by speculating about the value we wish to be in the work-flags but we only make any conclusion after the value returned by the cmpxchg() call that either claims the work or let the current owner handle the pending work for us. Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Anish Kumar anish198519851...@gmail.com --- kernel/irq_work.c | 16 +++- 1 files changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 57be1a6..64eddd5 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list); */ static bool irq_work_claim(struct irq_work *work) { - unsigned long flags, nflags; + unsigned long flags, oflags, nflags; + /* +* Start with our best wish as a premise but only trust any +* flag value after cmpxchg() result. +*/ + flags = work-flags ~IRQ_WORK_PENDING; for (;;) { - flags = work-flags; - if (flags IRQ_WORK_PENDING) - return false; nflags = flags | IRQ_WORK_FLAGS; - if (cmpxchg(work-flags, flags, nflags) == flags) + oflags = cmpxchg(work-flags, flags, nflags); + if (oflags == flags) break; + if (oflags IRQ_WORK_PENDING) + return false; + flags = oflags; cpu_relax(); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/9] irq_work: Flush work on CPU_DYING
From: Steven Rostedt rost...@goodmis.org In order not to offline a CPU with pending irq works, flush the queue from CPU_DYING. The notifier is called by stop_machine on the CPU that is going down. The code will not be called from irq context (so things like get_irq_regs() wont work) but I'm not sure what the requirements are for irq_work in that regard (Peter?). But irqs are disabled and the CPU is about to go offline. Might as well flush the work. Signed-off-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Signed-off-by: Frederic Weisbecker fweis...@gmail.com --- kernel/irq_work.c | 50 -- 1 files changed, 44 insertions(+), 6 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index b3c113a..cf8b657 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -12,6 +12,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/irqflags.h +#include linux/cpu.h #include asm/processor.h /* @@ -110,11 +111,7 @@ bool irq_work_needs_cpu(void) return true; } -/* - * Run the irq_work entries on this cpu. Requires to be ran from hardirq - * context with local IRQs disabled. - */ -void irq_work_run(void) +static void __irq_work_run(void) { struct irq_work *work; struct llist_head *this_list; @@ -124,7 +121,6 @@ void irq_work_run(void) if (llist_empty(this_list)) return; - BUG_ON(!in_irq()); BUG_ON(!irqs_disabled()); llnode = llist_del_all(this_list); @@ -149,6 +145,16 @@ void irq_work_run(void) (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0); } } + +/* + * Run the irq_work entries on this cpu. Requires to be ran from hardirq + * context with local IRQs disabled. + */ +void irq_work_run(void) +{ + BUG_ON(!in_irq()); + __irq_work_run(); +} EXPORT_SYMBOL_GPL(irq_work_run); /* @@ -163,3 +169,35 @@ void irq_work_sync(struct irq_work *work) cpu_relax(); } EXPORT_SYMBOL_GPL(irq_work_sync); + +#ifdef CONFIG_HOTPLUG_CPU +static int irq_work_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + switch (action) { + case CPU_DYING: + /* Called from stop_machine */ + if (WARN_ON_ONCE(cpu != smp_processor_id())) + break; + __irq_work_run(); + break; + default: + break; + } + return NOTIFY_OK; +} + +static struct notifier_block cpu_notify; + +static __init int irq_work_init_cpu_notifier(void) +{ + cpu_notify.notifier_call = irq_work_cpu_notify; + cpu_notify.priority = 0; + register_cpu_notifier(cpu_notify); + return 0; +} +device_initcall(irq_work_init_cpu_notifier); + +#endif /* CONFIG_HOTPLUG_CPU */ -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/9] irq_work: Remove CONFIG_HAVE_IRQ_WORK
irq work can run on any arch even without IPI support because of the hook on update_process_times(). So lets remove HAVE_IRQ_WORK because it doesn't reflect any backend requirement. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Steven Rostedt rost...@goodmis.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - init/Kconfig|4 15 files changed, 0 insertions(+), 18 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 5dd7f5d..e56c2d1 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -5,7 +5,6 @@ config ALPHA select HAVE_IDE select HAVE_OPROFILE select HAVE_SYSCALL_WRAPPERS - select HAVE_IRQ_WORK select HAVE_PCSPKR_PLATFORM select HAVE_PERF_EVENTS select HAVE_DMA_ATTRS diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index ade7e92..22d378b 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -36,7 +36,6 @@ config ARM select HAVE_GENERIC_HARDIRQS select HAVE_HW_BREAKPOINT if (PERF_EVENTS (CPU_V6 || CPU_V6K || CPU_V7)) select HAVE_IDE if PCI || ISA || PCMCIA - select HAVE_IRQ_WORK select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA select HAVE_KERNEL_LZO diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index ef54a59..dd50d72 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -17,7 +17,6 @@ config ARM64 select HAVE_GENERIC_DMA_COHERENT select HAVE_GENERIC_HARDIRQS select HAVE_HW_BREAKPOINT if PERF_EVENTS - select HAVE_IRQ_WORK select HAVE_MEMBLOCK select HAVE_PERF_EVENTS select HAVE_SPARSE_IRQ diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig index b6f3ad5..86f891f 100644 --- a/arch/blackfin/Kconfig +++ b/arch/blackfin/Kconfig @@ -24,7 +24,6 @@ config BLACKFIN select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_TRACE_MCOUNT_TEST select HAVE_IDE - select HAVE_IRQ_WORK select HAVE_KERNEL_GZIP if RAMKERNEL select HAVE_KERNEL_BZIP2 if RAMKERNEL select HAVE_KERNEL_LZMA if RAMKERNEL diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig index df2eb4b..c44fd6e 100644 --- a/arch/frv/Kconfig +++ b/arch/frv/Kconfig @@ -3,7 +3,6 @@ config FRV default y select HAVE_IDE select HAVE_ARCH_TRACEHOOK - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select HAVE_UID16 select HAVE_GENERIC_HARDIRQS diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index 0744f7d..40a3185 100644 --- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -14,7 +14,6 @@ config HEXAGON # select HAVE_CLK # select IRQ_PER_CPU # select GENERIC_PENDING_IRQ if SMP - select HAVE_IRQ_WORK select GENERIC_ATOMIC64 select HAVE_PERF_EVENTS select HAVE_GENERIC_HARDIRQS diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index dba9390..3d86d69 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -4,7 +4,6 @@ config MIPS select HAVE_GENERIC_DMA_COHERENT select HAVE_IDE select HAVE_OPROFILE - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select PERF_USE_VMALLOC select HAVE_ARCH_KGDB diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 11def45..8f0df47 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -9,7 +9,6 @@ config PARISC select RTC_DRV_GENERIC select INIT_ALL_POSSIBLE select BUG - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select GENERIC_ATOMIC64 if !64BIT select HAVE_GENERIC_HARDIRQS diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a902a5c..a90f0c9 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -118,7 +118,6 @@ config PPC select HAVE_SYSCALL_WRAPPERS if PPC64 select GENERIC_ATOMIC64 if PPC32 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_HW_BREAKPOINT if PERF_EVENTS PPC_BOOK3S_64 diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5dba755..0816ff0 100644
Re: [PATCH 8/9] irq_work: Make self-IPIs optable
2012/11/16 Steven Rostedt rost...@goodmis.org: On Fri, 2012-11-16 at 03:21 +0100, Frederic Weisbecker wrote: /* * Claim the entry so that no one else will poke at it. @@ -68,14 +59,18 @@ void __weak arch_irq_work_raise(void) */ static void __irq_work_queue(struct irq_work *work) { - bool empty; - preempt_disable(); - empty = llist_add(work-llnode, __get_cpu_var(irq_work_list)); - /* The list was empty, raise self-interrupt to start processing. */ - if (empty) - arch_irq_work_raise(); + llist_add(work-llnode, __get_cpu_var(irq_work_list)); + + /* + * If the work is flagged as lazy, just wait for the next tick + * to run it. Otherwise, or if the tick is stopped, raise the irq work. Speaking more Greek? ;-) How about: If the work is not lazy or the tick is stopped, raise the irq work interrupt (if supported by the arch), otherwise, just wait for the next tick. Much better :) Other than that, Acked-by: Steven Rostedt rost...@goodmis.org Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL v2] printk: Make it usable on nohz cpus
Ingo, Please pull the printk support in dynticks mode patches that can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git tags/printk-dynticks-for-mingo-v2 for you to fetch changes up to 74876a98a87a115254b3a66a14b27320b7f0acaa: printk: Wake up klogd using irq_work (2012-11-18 01:01:49 +0100) It is based on v3.7-rc4. Changes since previous pull request include support for irq work flush on CPU offlining and acks from Steve. The rest hasn't changed except some comment fix. Thanks. Support for printk in dynticks mode: * Fix two races in irq work claiming * Generalize irq_work support to all archs * Don't stop tick with irq works pending. This fix is generally useful and concerns archs that can't raise self IPIs. * Flush irq works before CPU offlining. * Introduce lazy irq works that can wait for the next tick to be executed, unless it's stopped. * Implement klogd wake up using irq work. This removes the ad-hoc printk_tick()/printk_needs_cpu() hooks and make it working even in dynticks mode. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Frederic Weisbecker (7): irq_work: Fix racy IRQ_WORK_BUSY flag setting irq_work: Fix racy check on work pending flag irq_work: Remove CONFIG_HAVE_IRQ_WORK nohz: Add API to check tick state irq_work: Don't stop the tick with pending works irq_work: Make self-IPIs optable printk: Wake up klogd using irq_work Steven Rostedt (2): irq_work: Flush work on CPU_DYING irq_work: Warn if there's still work on cpu_down arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - include/linux/irq_work.h| 20 ++ include/linux/printk.h |3 - include/linux/tick.h| 17 - init/Kconfig|5 +- kernel/irq_work.c | 131 ++- kernel/printk.c | 36 +- kernel/time/tick-sched.c|7 +- kernel/timer.c |1 - 22 files changed, 161 insertions(+), 73 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/9] irq_work: Fix racy check on work pending flag
Work claiming wants to be SMP-safe. And by the time we try to claim a work, if it is already executing concurrently on another CPU, we want to succeed the claiming and queue the work again because the other CPU may have missed the data we wanted to handle in our work if it's about to complete there. This scenario is summarized below: CPU 1 CPU 2 - - (flags = 0) cmpxchg(flags, 0, IRQ_WORK_FLAGS) (flags = 3) [...] xchg(flags, IRQ_WORK_BUSY) (flags = 2) func() if (flags IRQ_WORK_PENDING) (not true) cmpxchg(flags, flags, IRQ_WORK_FLAGS) (flags = 3) [...] cmpxchg(flags, IRQ_WORK_BUSY, 0); (fail, pending on CPU 2) This state machine is synchronized using [cmp]xchg() on the flags. As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy. By the time we check it, we may be dealing with a stale value because we aren't using an atomic accessor. As a result, CPU 2 may see that the work is still pending on another CPU while it may be actually completing the work function exection already, leaving our data unprocessed. To fix this, we start by speculating about the value we wish to be in the work-flags but we only make any conclusion after the value returned by the cmpxchg() call that either claims the work or let the current owner handle the pending work for us. Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Anish Kumar anish198519851...@gmail.com --- kernel/irq_work.c | 16 +++- 1 files changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 57be1a6..64eddd5 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -34,15 +34,21 @@ static DEFINE_PER_CPU(struct llist_head, irq_work_list); */ static bool irq_work_claim(struct irq_work *work) { - unsigned long flags, nflags; + unsigned long flags, oflags, nflags; + /* +* Start with our best wish as a premise but only trust any +* flag value after cmpxchg() result. +*/ + flags = work-flags ~IRQ_WORK_PENDING; for (;;) { - flags = work-flags; - if (flags IRQ_WORK_PENDING) - return false; nflags = flags | IRQ_WORK_FLAGS; - if (cmpxchg(work-flags, flags, nflags) == flags) + oflags = cmpxchg(work-flags, flags, nflags); + if (oflags == flags) break; + if (oflags IRQ_WORK_PENDING) + return false; + flags = oflags; cpu_relax(); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/9] printk: Wake up klogd using irq_work
klogd is woken up asynchronously from the tick in order to do it safely. However if printk is called when the tick is stopped, the reader won't be woken up until the next interrupt, which might not fire for a while. As a result, the user may miss some message. To fix this, lets implement the printk tick using a lazy irq work. This subsystem takes care of the timer tick state and can fix up accordingly. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/printk.h |3 --- init/Kconfig |1 + kernel/printk.c | 36 kernel/time/tick-sched.c |2 +- kernel/timer.c |1 - 5 files changed, 22 insertions(+), 21 deletions(-) diff --git a/include/linux/printk.h b/include/linux/printk.h index 9afc01e..86c4b62 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -98,9 +98,6 @@ int no_printk(const char *fmt, ...) extern asmlinkage __printf(1, 2) void early_printk(const char *fmt, ...); -extern int printk_needs_cpu(int cpu); -extern void printk_tick(void); - #ifdef CONFIG_PRINTK asmlinkage __printf(5, 0) int vprintk_emit(int facility, int level, diff --git a/init/Kconfig b/init/Kconfig index cdc152c..c575566 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1196,6 +1196,7 @@ config HOTPLUG config PRINTK default y bool Enable support for printk if EXPERT + select IRQ_WORK help This option enables normal printk support. Removing it eliminates most of the message strings from the kernel image diff --git a/kernel/printk.c b/kernel/printk.c index 2d607f4..c9104fe 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -42,6 +42,7 @@ #include linux/notifier.h #include linux/rculist.h #include linux/poll.h +#include linux/irq_work.h #include asm/uaccess.h @@ -1955,30 +1956,32 @@ int is_console_locked(void) static DEFINE_PER_CPU(int, printk_pending); static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf); -void printk_tick(void) +static void wake_up_klogd_work_func(struct irq_work *irq_work) { - if (__this_cpu_read(printk_pending)) { - int pending = __this_cpu_xchg(printk_pending, 0); - if (pending PRINTK_PENDING_SCHED) { - char *buf = __get_cpu_var(printk_sched_buf); - printk(KERN_WARNING [sched_delayed] %s, buf); - } - if (pending PRINTK_PENDING_WAKEUP) - wake_up_interruptible(log_wait); + int pending = __this_cpu_xchg(printk_pending, 0); + + if (pending PRINTK_PENDING_SCHED) { + char *buf = __get_cpu_var(printk_sched_buf); + printk(KERN_WARNING [sched_delayed] %s, buf); } -} -int printk_needs_cpu(int cpu) -{ - if (cpu_is_offline(cpu)) - printk_tick(); - return __this_cpu_read(printk_pending); + if (pending PRINTK_PENDING_WAKEUP) + wake_up_interruptible(log_wait); } +static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = { + .func = wake_up_klogd_work_func, + .flags = IRQ_WORK_LAZY, +}; + void wake_up_klogd(void) { - if (waitqueue_active(log_wait)) + preempt_disable(); + if (waitqueue_active(log_wait)) { this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP); + irq_work_queue(__get_cpu_var(wake_up_klogd_work)); + } + preempt_enable(); } static void console_cont_flush(char *text, size_t size) @@ -2458,6 +2461,7 @@ int printk_sched(const char *fmt, ...) va_end(args); __this_cpu_or(printk_pending, PRINTK_PENDING_SCHED); + irq_work_queue(__get_cpu_var(wake_up_klogd_work)); local_irq_restore(flags); return r; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f249e8c..822d757 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -289,7 +289,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, time_delta = timekeeping_max_deferment(); } while (read_seqretry(xtime_lock, seq)); - if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || + if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; diff --git a/kernel/timer.c b/kernel/timer.c index 367d008..ff3b516 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1351,7 +1351,6 @@ void update_process_times(int user_tick) account_process_tick(p, user_tick); run_local_timers(); rcu_check_callbacks(cpu, user_tick); - printk_tick
[PATCH 4/9] nohz: Add API to check tick state
We need some quick way to check if the CPU has stopped its tick. This will be useful to implement the printk tick using the irq work subsystem. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/tick.h | 17 - kernel/time/tick-sched.c |2 +- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index f37fceb..2307dd3 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -8,6 +8,8 @@ #include linux/clockchips.h #include linux/irqflags.h +#include linux/percpu.h +#include linux/hrtimer.h #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -122,13 +124,26 @@ static inline int tick_oneshot_mode_active(void) { return 0; } #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ # ifdef CONFIG_NO_HZ +DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched); + +static inline int tick_nohz_tick_stopped(void) +{ + return __this_cpu_read(tick_cpu_sched.tick_stopped); +} + extern void tick_nohz_idle_enter(void); extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); -# else + +# else /* !CONFIG_NO_HZ */ +static inline int tick_nohz_tick_stopped(void) +{ + return 0; +} + static inline void tick_nohz_idle_enter(void) { } static inline void tick_nohz_idle_exit(void) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a402608..9e945aa 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -28,7 +28,7 @@ /* * Per cpu nohz control structure */ -static DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); +DEFINE_PER_CPU(struct tick_sched, tick_cpu_sched); /* * The time, when the last jiffy update happened. Protected by xtime_lock. -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/9] irq_work: Don't stop the tick with pending works
Don't stop the tick if we have pending irq works on the queue, otherwise if the arch can't raise self-IPIs, we may not find an opportunity to execute the pending works for a while. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/irq_work.h |6 ++ kernel/irq_work.c| 11 +++ kernel/time/tick-sched.c |3 ++- 3 files changed, 19 insertions(+), 1 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index 6a9e8f5..a69704f 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -20,4 +20,10 @@ bool irq_work_queue(struct irq_work *work); void irq_work_run(void); void irq_work_sync(struct irq_work *work); +#ifdef CONFIG_IRQ_WORK +bool irq_work_needs_cpu(void); +#else +static bool irq_work_needs_cpu(void) { return false; } +#endif + #endif /* _LINUX_IRQ_WORK_H */ diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 64eddd5..b3c113a 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -99,6 +99,17 @@ bool irq_work_queue(struct irq_work *work) } EXPORT_SYMBOL_GPL(irq_work_queue); +bool irq_work_needs_cpu(void) +{ + struct llist_head *this_list; + + this_list = __get_cpu_var(irq_work_list); + if (llist_empty(this_list)) + return false; + + return true; +} + /* * Run the irq_work entries on this cpu. Requires to be ran from hardirq * context with local IRQs disabled. diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9e945aa..f249e8c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include linux/profile.h #include linux/sched.h #include linux/module.h +#include linux/irq_work.h #include asm/irq_regs.h @@ -289,7 +290,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, } while (read_seqretry(xtime_lock, seq)); if (rcu_needs_cpu(cpu, rcu_delta_jiffies) || printk_needs_cpu(cpu) || - arch_needs_cpu(cpu)) { + arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; } else { -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/9] irq_work: Make self-IPIs optable
On irq work initialization, let the user choose to define it as lazy or not. Lazy means that we don't want to send an IPI (provided the arch can anyway) when we enqueue this work but we rather prefer to wait for the next timer tick to execute our work if possible. This is going to be a benefit for non-urgent enqueuers (like printk in the future) that may prefer not to raise an IPI storm in case of frequent enqueuing on short periods of time. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- include/linux/irq_work.h | 14 + kernel/irq_work.c| 47 ++--- 2 files changed, 41 insertions(+), 20 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index a69704f..b28eb60 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -3,6 +3,20 @@ #include linux/llist.h +/* + * An entry can be in one of four states: + * + * free NULL, 0 - {claimed} : free to be used + * claimed NULL, 3 - {pending} : claimed to be enqueued + * pending next, 3 - {busy} : queued, pending callback + * busy NULL, 2 - {free, claimed} : callback in progress, can be claimed + */ + +#define IRQ_WORK_PENDING 1UL +#define IRQ_WORK_BUSY 2UL +#define IRQ_WORK_FLAGS 3UL +#define IRQ_WORK_LAZY 4UL /* Doesn't want IPI, wait for tick */ + struct irq_work { unsigned long flags; struct llist_node llnode; diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 480f747..7f3a59b 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -12,24 +12,15 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/irqflags.h +#include linux/sched.h +#include linux/tick.h #include linux/cpu.h #include linux/notifier.h #include asm/processor.h -/* - * An entry can be in one of four states: - * - * free NULL, 0 - {claimed} : free to be used - * claimed NULL, 3 - {pending} : claimed to be enqueued - * pending next, 3 - {busy} : queued, pending callback - * busy NULL, 2 - {free, claimed} : callback in progress, can be claimed - */ - -#define IRQ_WORK_PENDING 1UL -#define IRQ_WORK_BUSY 2UL -#define IRQ_WORK_FLAGS 3UL static DEFINE_PER_CPU(struct llist_head, irq_work_list); +static DEFINE_PER_CPU(int, irq_work_raised); /* * Claim the entry so that no one else will poke at it. @@ -69,14 +60,19 @@ void __weak arch_irq_work_raise(void) */ static void __irq_work_queue(struct irq_work *work) { - bool empty; - preempt_disable(); - empty = llist_add(work-llnode, __get_cpu_var(irq_work_list)); - /* The list was empty, raise self-interrupt to start processing. */ - if (empty) - arch_irq_work_raise(); + llist_add(work-llnode, __get_cpu_var(irq_work_list)); + + /* +* If the work is not lazy or the tick is stopped, raise the irq +* work interrupt (if supported by the arch), otherwise, just wait +* for the next tick. +*/ + if (!(work-flags IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) { + if (!this_cpu_cmpxchg(irq_work_raised, 0, 1)) + arch_irq_work_raise(); + } preempt_enable(); } @@ -117,10 +113,19 @@ bool irq_work_needs_cpu(void) static void __irq_work_run(void) { + unsigned long flags; struct irq_work *work; struct llist_head *this_list; struct llist_node *llnode; + + /* +* Reset the raised state right before we check the list because +* an NMI may enqueue after we find the list empty from the runner. +*/ + __this_cpu_write(irq_work_raised, 0); + barrier(); + this_list = __get_cpu_var(irq_work_list); if (llist_empty(this_list)) return; @@ -140,13 +145,15 @@ static void __irq_work_run(void) * to claim that work don't rely on us to handle their data * while we are in the middle of the func. */ - xchg(work-flags, IRQ_WORK_BUSY); + flags = work-flags ~IRQ_WORK_PENDING; + xchg(work-flags, flags); + work-func(work); /* * Clear the BUSY bit and return to the free state if * no-one else claimed it meanwhile. */ - (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0); + (void)cmpxchg(work-flags, flags, flags ~IRQ_WORK_BUSY); } } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info
[PATCH 1/9] irq_work: Fix racy IRQ_WORK_BUSY flag setting
The IRQ_WORK_BUSY flag is set right before we execute the work. Once this flag value is set, the work enters a claimable state again. So if we have specific data to compute in our work, we ensure it's either handled by another CPU or locally by enqueuing the work again. This state machine is guanranteed by atomic operations on the flags. So when we set IRQ_WORK_BUSY without using an xchg-like operation, we break this guarantee as in the following summarized scenario: CPU 1 CPU 2 - - (flags = 0) old_flags = flags; (flags = 0) cmpxchg(flags, old_flags, old_flags | IRQ_WORK_FLAGS) (flags = 3) [...] flags = IRQ_WORK_BUSY (flags = 2) func() (sees flags = 3) cmpxchg(flags, old_flags, old_flags | IRQ_WORK_FLAGS) (give up) cmpxchg(flags, 2, 0); (flags = 0) CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and the work is again in a claimable state. Now CPU 2 has new data to process and try to claim that work but it may see a stale value of the flags and think the work is still pending somewhere that will handle our data. This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically. As a result, the data expected to be handle by CPU 2 won't get handled. To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2 will see the correct value with cmpxchg() using the expected ordering. Changelog-heavily-inspired-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Cc: Anish Kumar anish198519851...@gmail.com --- kernel/irq_work.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 1588e3b..57be1a6 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -119,8 +119,11 @@ void irq_work_run(void) /* * Clear the PENDING bit, after this point the @work * can be re-used. +* Make it immediately visible so that other CPUs trying +* to claim that work don't rely on us to handle their data +* while we are in the middle of the func. */ - work-flags = IRQ_WORK_BUSY; + xchg(work-flags, IRQ_WORK_BUSY); work-func(work); /* * Clear the BUSY bit and return to the free state if -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/9] irq_work: Warn if there's still work on cpu_down
From: Steven Rostedt rost...@goodmis.org If we are in nohz and there's still irq_work to be done when the idle task is about to go offline, give a nasty warning. Everything should have been flushed from the CPU_DYING notifier already. Further attempts to enqueue an irq_work are buggy because irqs are disabled by __cpu_disable(). The best we can do is to report the issue to the user. Signed-off-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Signed-off-by: Frederic Weisbecker fweis...@gmail.com --- kernel/irq_work.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 4ed1749..480f747 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -109,6 +109,9 @@ bool irq_work_needs_cpu(void) if (llist_empty(this_list)) return false; + /* All work should have been flushed before going offline */ + WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); + return true; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/9] irq_work: Flush work on CPU_DYING
From: Steven Rostedt rost...@goodmis.org In order not to offline a CPU with pending irq works, flush the queue from CPU_DYING. The notifier is called by stop_machine on the CPU that is going down. The code will not be called from irq context (so things like get_irq_regs() wont work) but I'm not sure what the requirements are for irq_work in that regard (Peter?). But irqs are disabled and the CPU is about to go offline. Might as well flush the work. Signed-off-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com Signed-off-by: Frederic Weisbecker fweis...@gmail.com --- kernel/irq_work.c | 51 +-- 1 files changed, 45 insertions(+), 6 deletions(-) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index b3c113a..4ed1749 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -12,6 +12,8 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/irqflags.h +#include linux/cpu.h +#include linux/notifier.h #include asm/processor.h /* @@ -110,11 +112,7 @@ bool irq_work_needs_cpu(void) return true; } -/* - * Run the irq_work entries on this cpu. Requires to be ran from hardirq - * context with local IRQs disabled. - */ -void irq_work_run(void) +static void __irq_work_run(void) { struct irq_work *work; struct llist_head *this_list; @@ -124,7 +122,6 @@ void irq_work_run(void) if (llist_empty(this_list)) return; - BUG_ON(!in_irq()); BUG_ON(!irqs_disabled()); llnode = llist_del_all(this_list); @@ -149,6 +146,16 @@ void irq_work_run(void) (void)cmpxchg(work-flags, IRQ_WORK_BUSY, 0); } } + +/* + * Run the irq_work entries on this cpu. Requires to be ran from hardirq + * context with local IRQs disabled. + */ +void irq_work_run(void) +{ + BUG_ON(!in_irq()); + __irq_work_run(); +} EXPORT_SYMBOL_GPL(irq_work_run); /* @@ -163,3 +170,35 @@ void irq_work_sync(struct irq_work *work) cpu_relax(); } EXPORT_SYMBOL_GPL(irq_work_sync); + +#ifdef CONFIG_HOTPLUG_CPU +static int irq_work_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + switch (action) { + case CPU_DYING: + /* Called from stop_machine */ + if (WARN_ON_ONCE(cpu != smp_processor_id())) + break; + __irq_work_run(); + break; + default: + break; + } + return NOTIFY_OK; +} + +static struct notifier_block cpu_notify; + +static __init int irq_work_init_cpu_notifier(void) +{ + cpu_notify.notifier_call = irq_work_cpu_notify; + cpu_notify.priority = 0; + register_cpu_notifier(cpu_notify); + return 0; +} +device_initcall(irq_work_init_cpu_notifier); + +#endif /* CONFIG_HOTPLUG_CPU */ -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/9] irq_work: Remove CONFIG_HAVE_IRQ_WORK
irq work can run on any arch even without IPI support because of the hook on update_process_times(). So lets remove HAVE_IRQ_WORK because it doesn't reflect any backend requirement. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Acked-by: Steven Rostedt rost...@goodmis.org Cc: Peter Zijlstra pet...@infradead.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@kernel.org Cc: Andrew Morton a...@linux-foundation.org Cc: Paul Gortmaker paul.gortma...@windriver.com --- arch/alpha/Kconfig |1 - arch/arm/Kconfig|1 - arch/arm64/Kconfig |1 - arch/blackfin/Kconfig |1 - arch/frv/Kconfig|1 - arch/hexagon/Kconfig|1 - arch/mips/Kconfig |1 - arch/parisc/Kconfig |1 - arch/powerpc/Kconfig|1 - arch/s390/Kconfig |1 - arch/sh/Kconfig |1 - arch/sparc/Kconfig |1 - arch/x86/Kconfig|1 - drivers/staging/iio/trigger/Kconfig |1 - init/Kconfig|4 15 files changed, 0 insertions(+), 18 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 5dd7f5d..e56c2d1 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -5,7 +5,6 @@ config ALPHA select HAVE_IDE select HAVE_OPROFILE select HAVE_SYSCALL_WRAPPERS - select HAVE_IRQ_WORK select HAVE_PCSPKR_PLATFORM select HAVE_PERF_EVENTS select HAVE_DMA_ATTRS diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index ade7e92..22d378b 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -36,7 +36,6 @@ config ARM select HAVE_GENERIC_HARDIRQS select HAVE_HW_BREAKPOINT if (PERF_EVENTS (CPU_V6 || CPU_V6K || CPU_V7)) select HAVE_IDE if PCI || ISA || PCMCIA - select HAVE_IRQ_WORK select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA select HAVE_KERNEL_LZO diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index ef54a59..dd50d72 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -17,7 +17,6 @@ config ARM64 select HAVE_GENERIC_DMA_COHERENT select HAVE_GENERIC_HARDIRQS select HAVE_HW_BREAKPOINT if PERF_EVENTS - select HAVE_IRQ_WORK select HAVE_MEMBLOCK select HAVE_PERF_EVENTS select HAVE_SPARSE_IRQ diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig index b6f3ad5..86f891f 100644 --- a/arch/blackfin/Kconfig +++ b/arch/blackfin/Kconfig @@ -24,7 +24,6 @@ config BLACKFIN select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_TRACE_MCOUNT_TEST select HAVE_IDE - select HAVE_IRQ_WORK select HAVE_KERNEL_GZIP if RAMKERNEL select HAVE_KERNEL_BZIP2 if RAMKERNEL select HAVE_KERNEL_LZMA if RAMKERNEL diff --git a/arch/frv/Kconfig b/arch/frv/Kconfig index df2eb4b..c44fd6e 100644 --- a/arch/frv/Kconfig +++ b/arch/frv/Kconfig @@ -3,7 +3,6 @@ config FRV default y select HAVE_IDE select HAVE_ARCH_TRACEHOOK - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select HAVE_UID16 select HAVE_GENERIC_HARDIRQS diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index 0744f7d..40a3185 100644 --- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -14,7 +14,6 @@ config HEXAGON # select HAVE_CLK # select IRQ_PER_CPU # select GENERIC_PENDING_IRQ if SMP - select HAVE_IRQ_WORK select GENERIC_ATOMIC64 select HAVE_PERF_EVENTS select HAVE_GENERIC_HARDIRQS diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index dba9390..3d86d69 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -4,7 +4,6 @@ config MIPS select HAVE_GENERIC_DMA_COHERENT select HAVE_IDE select HAVE_OPROFILE - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select PERF_USE_VMALLOC select HAVE_ARCH_KGDB diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 11def45..8f0df47 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -9,7 +9,6 @@ config PARISC select RTC_DRV_GENERIC select INIT_ALL_POSSIBLE select BUG - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select GENERIC_ATOMIC64 if !64BIT select HAVE_GENERIC_HARDIRQS diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a902a5c..a90f0c9 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -118,7 +118,6 @@ config PPC select HAVE_SYSCALL_WRAPPERS if PPC64 select GENERIC_ATOMIC64 if PPC32 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE - select HAVE_IRQ_WORK select HAVE_PERF_EVENTS select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_HW_BREAKPOINT if PERF_EVENTS PPC_BOOK3S_64 diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5dba755..0816ff0
Re: [PATCH] nohz/cpuset: Make a CPU stick with do_timer() duty in the presence of nohz cpusets
Hi Hakan, As I start to focus on timekeeing for full dynticks, I'm looking at your patch. Sorry I haven't yet replied with a serious review until now. But here is it, finally: 2012/6/17 Hakan Akkan hakanak...@gmail.com: An adaptive nohz (AHZ) CPU may not do do_timer() for a while despite being non-idle. When all other CPUs are idle, AHZ CPUs might be using stale jiffies values. To prevent this always keep a CPU with ticks if there is one or more AHZ CPUs. The patch changes can_stop_{idle,adaptive}_tick functions and prevents either the last CPU who did the do_timer() duty or the AHZ CPU itself from stopping its sched timer if there is one or more AHZ CPUs in the system. This means AHZ CPUs might keep the ticks running for short periods until a non-AHZ CPU takes the charge away in tick_do_timer_check_handler() function. When a non-AHZ CPU takes the charge, it never gives it away so that AHZ CPUs can run tickless. Signed-off-by: Hakan Akkan hakanak...@gmail.com CC: Frederic Weisbecker fweis...@gmail.com --- include/linux/cpuset.h |3 ++- kernel/cpuset.c |5 + kernel/time/tick-sched.c | 31 ++- 3 files changed, 37 insertions(+), 2 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index ccbc2fd..19aa448 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -266,11 +266,12 @@ static inline bool cpuset_adaptive_nohz(void) extern void cpuset_exit_nohz_interrupt(void *unused); extern void cpuset_nohz_flush_cputimes(void); +extern bool nohz_cpu_exist(void); #else static inline bool cpuset_cpu_adaptive_nohz(int cpu) { return false; } static inline bool cpuset_adaptive_nohz(void) { return false; } static inline void cpuset_nohz_flush_cputimes(void) { } - +static inline bool nohz_cpu_exist(void) { return false; } #endif /* CONFIG_CPUSETS_NO_HZ */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 858217b..ccbaac9 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1231,6 +1231,11 @@ DEFINE_PER_CPU(int, cpu_adaptive_nohz_ref); static cpumask_t nohz_cpuset_mask; +inline bool nohz_cpu_exist(void) +{ + return !cpumask_empty(nohz_cpuset_mask); +} + static void flush_cputime_interrupt(void *unused) { trace_printk(IPI: flush cputime\n); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index bdc8aeb..e60d541 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -409,6 +409,25 @@ out: return ret; } +static inline bool must_take_timer_duty(int cpu) +{ + int handler = tick_do_timer_cpu; + bool ret = false; + bool tick_needed = nohz_cpu_exist(); Note this is racy because this fetches the value of the nohz_cpuset_mask without locking. We may see there is no nohz cpusets whereas we just set some of them as nohz and they could even have shut down their tick already. + + /* +* A CPU will have to take the timer duty if there is an adaptive +* nohz CPU in the system. The last handler == cpu check ensures +* that the last cpu that did the do_timer() sticks with the duty. +* A normal (non nohz) cpu will take the charge from a nohz cpu in +* tick_do_timer_check_handler anyway. +*/ + if (tick_needed (handler == TICK_DO_TIMER_NONE || handler == cpu)) + ret = true; This check is also racy due to the lack of locking. The previous handler may have set TICK_DO_TIMER_NONE and gone to sleep. We have no guarantee that the CPU can see that new value. It could believe there is still a handler. This needs at least cmpxchg() to make the test and set atomic. + + return ret; +} + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { /* @@ -421,6 +440,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) if (unlikely(!cpu_online(cpu))) { if (cpu == tick_do_timer_cpu) tick_do_timer_cpu = TICK_DO_TIMER_NONE; + } else if (must_take_timer_duty(cpu)) { + tick_do_timer_cpu = cpu; + return false; } if (unlikely(ts-nohz_mode == NOHZ_MODE_INACTIVE)) @@ -512,6 +534,13 @@ void tick_nohz_idle_enter(void) #ifdef CONFIG_CPUSETS_NO_HZ static bool can_stop_adaptive_tick(void) { + int cpu = smp_processor_id(); + + if (must_take_timer_duty(cpu)) { + tick_do_timer_cpu = cpu; + return false; One problem I see here is that you randomize the handler. It could be an adaptive nohz CPU or an idle CPU. It's a problem if the user wants CPU isolation. I suggest to rather define a tunable timekeeping duty CPU affinity in a cpumask file at /sys/devices/system/cpu/timekeeping and a toggle at /sys/devices/system/cpu/cpuX/timekeeping (like the online file). This way the user can decide whether adaptive nohz CPU
Re: linux-next: manual merge of the tip tree with the rr tree
On Fri, Sep 28, 2012 at 01:33:41PM +1000, Stephen Rothwell wrote: Hi all, Today's linux-next merge of the tip tree got a conflict in arch/Kconfig between commit 9a9d5786a5e7 (Make most arch asm/module.h files use asm-generic/module.h) from the rr tree and commits fdf9c356502a (cputime: Make finegrained irqtime accounting generally available) and 2b1d5024e17b (rcu: Settle config for userspace extended quiescent state) from the tip tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). Looks good. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rcu: eqs related warnings in linux-next
On Fri, Sep 28, 2012 at 02:51:03PM +0200, Sasha Levin wrote: Hi all, While fuzzing with trinity inside a KVM tools guest with the latest linux-next kernel, I've stumbled on the following during boot: [ 199.224369] WARNING: at kernel/rcutree.c:513 rcu_eqs_exit_common+0x4a/0x3a0() [ 199.225307] Pid: 1, comm: init Tainted: GW 3.6.0-rc7-next-20120928-sasha-1-g8b2d05d-dirty #13 [ 199.226611] Call Trace: [ 199.226951] [811c8d1a] ? rcu_eqs_exit_common+0x4a/0x3a0 [ 199.227773] [81108e36] warn_slowpath_common+0x86/0xb0 [ 199.228572] [81108f25] warn_slowpath_null+0x15/0x20 [ 199.229348] [811c8d1a] rcu_eqs_exit_common+0x4a/0x3a0 [ 199.230037] [8117f267] ? __lock_acquire+0x1c37/0x1ca0 [ 199.230037] [811c936c] rcu_eqs_exit+0x9c/0xb0 [ 199.230037] [811c940c] rcu_user_exit+0x8c/0xf0 [ 199.230037] [810a98bb] do_page_fault+0x1b/0x40 [ 199.230037] [810a2a90] do_async_page_fault+0x30/0xa0 [ 199.230037] [83a3eea8] async_page_fault+0x28/0x30 [ 199.230037] [819f357b] ? debug_object_activate+0x6b/0x1b0 [ 199.230037] [819f3586] ? debug_object_activate+0x76/0x1b0 [ 199.230037] [8111af13] ? lock_timer_base.isra.19+0x33/0x70 [ 199.230037] [8111d45f] mod_timer_pinned+0x9f/0x260 [ 199.230037] [811c5ff4] rcu_eqs_enter_common+0x894/0x970 [ 199.230037] [839dc2ac] ? init_post+0x75/0xc8 [ 199.230037] [85abfed5] ? kernel_init+0x1e1/0x1e1 [ 199.230037] [811c63df] rcu_eqs_enter+0xaf/0xc0 [ 199.230037] [811c64c5] rcu_user_enter+0xd5/0x140 [ 199.230037] [8107d0fd] syscall_trace_leave+0xfd/0x150 [ 199.230037] [83a3f7af] int_check_syscall_exit_work+0x34/0x3d [ 199.230037] ---[ end trace a582c3a264d5bd1a ]--- We are faulting in the middle of rcu_user_enter() and thus we call rcu_user_exit() while the whole transition state in rcu_user_enter() is not yet finished (rdtp-dynticks not incremented). Not sure how to solve this... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rcu: eqs related warnings in linux-next
On Fri, Sep 28, 2012 at 02:51:03PM +0200, Sasha Levin wrote: Hi all, While fuzzing with trinity inside a KVM tools guest with the latest linux-next kernel, I've stumbled on the following during boot: [ 199.224369] WARNING: at kernel/rcutree.c:513 rcu_eqs_exit_common+0x4a/0x3a0() [ 199.225307] Pid: 1, comm: init Tainted: GW 3.6.0-rc7-next-20120928-sasha-1-g8b2d05d-dirty #13 [ 199.226611] Call Trace: [ 199.226951] [811c8d1a] ? rcu_eqs_exit_common+0x4a/0x3a0 [ 199.227773] [81108e36] warn_slowpath_common+0x86/0xb0 [ 199.228572] [81108f25] warn_slowpath_null+0x15/0x20 [ 199.229348] [811c8d1a] rcu_eqs_exit_common+0x4a/0x3a0 [ 199.230037] [8117f267] ? __lock_acquire+0x1c37/0x1ca0 [ 199.230037] [811c936c] rcu_eqs_exit+0x9c/0xb0 [ 199.230037] [811c940c] rcu_user_exit+0x8c/0xf0 [ 199.230037] [810a98bb] do_page_fault+0x1b/0x40 [ 199.230037] [810a2a90] do_async_page_fault+0x30/0xa0 [ 199.230037] [83a3eea8] async_page_fault+0x28/0x30 [ 199.230037] [819f357b] ? debug_object_activate+0x6b/0x1b0 [ 199.230037] [819f3586] ? debug_object_activate+0x76/0x1b0 [ 199.230037] [8111af13] ? lock_timer_base.isra.19+0x33/0x70 [ 199.230037] [8111d45f] mod_timer_pinned+0x9f/0x260 [ 199.230037] [811c5ff4] rcu_eqs_enter_common+0x894/0x970 [ 199.230037] [839dc2ac] ? init_post+0x75/0xc8 [ 199.230037] [85abfed5] ? kernel_init+0x1e1/0x1e1 [ 199.230037] [811c63df] rcu_eqs_enter+0xaf/0xc0 [ 199.230037] [811c64c5] rcu_user_enter+0xd5/0x140 [ 199.230037] [8107d0fd] syscall_trace_leave+0xfd/0x150 [ 199.230037] [83a3f7af] int_check_syscall_exit_work+0x34/0x3d [ 199.230037] ---[ end trace a582c3a264d5bd1a ]--- Ok, we can't decently protect against any kind of exception messing up everything in the middle of RCU APIs anyway. The only solution is to find out what cause this page fault in mod_timer_pinned() and work around that. Anybody, an idea? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCHSET 00/15] perf report: Add support to accumulate hist periods
On Fri, Sep 28, 2012 at 09:07:57AM +0200, Stephane Eranian wrote: On Fri, Sep 28, 2012 at 7:49 AM, Namhyung Kim namhy...@kernel.org wrote: Hi Frederic, On Fri, 28 Sep 2012 01:01:48 +0200, Frederic Weisbecker wrote: When Arun was working on this, I asked him to explore if it could make sense to reuse the -b, --branch-stack perf report option. Because after all, this feature is doing about the same than -b except it's using callchains instead of full branch tracing. But callchains are branches. Just a limited subset of all branches taken on excecution. So you can probably reuse some interface and even ground code there. What do you think? Umm.. first of all, I'm not familiar with the branch stack thing. It's intel-specific, right? The kernel API is NOT specific to Intel. It is abstracted to be portable across architecture. The implementation only exists on certain Intel X86 processors. Also I don't understand what exactly you want here. What kind of interface did you say? Can you elaborate it bit more? Not clear to me either. And AFAIK branch stack can collect much more branch information than just callstacks. Can we differentiate which is which easily? Is there any limitation on using it? What if callstacks are not sync'ed with branch stacks - is it possible though? First of all branch stack is not a branch tracing mechanism. This is a branch sampling mechanism. Not all branches are captured. Only the last N consecutive branches leading to a PMU interrupt are captured in each sample. Yes, the branch stack mechanism as it exists on Intel processors can capture more then call branches. It is HW based and provides a branch type filter. Filtering capability is exposed at the API level in a generic fashion. The hw filter is based on opcodes. Call branches all cover call, syscall instructions. As such, the branch stack mechanism cannot be used to capture callstacks to shared libraries, simply because there a a non call instruction in the trampoline. To obtain a better quality callstack you have instead to sample return branches. So yes, callstacks are not sync'ed with branch stack even if limited to call branches. You're right. One doesn't simply sample callchains on top of branch tracing. Not easily at least. But that's not what we want here. We want the other way round: use callchains as branch sampling. And a callchain _is_ a branch sampling. Just a specialized one. PERF_SAMPLE_BRANCH_STACK either records only calls, only ret, or everything, or You can define the filter with -j option. Now callchains can be considered as the result of a specific -j filter option. It's just a high level filtering. ie: not just based on opcode types but on semantic post-processing. As if we applied a specific filter on a pure branch tracing that cancelled calls that had matching ret. But in the end, what we have is just branches. Some branch layout that is biased, that already passed through a semantic wheel, still it's just _branches_. Note I'm not arguing about adding a -j callchain option, just trying to show you that callchains are not really different from other filtered source of branch sampling. But I think it'd be good if the branch stack can be changed to call stack in general. Did you mean this? That's not going to happen. The mechanism is much more generic than that. Quite frankly, I don't understand Frederic's motivation here. The mechanism are not quite the same. So, considering that callchains are just branches, why can't we use them as a branch source, just like PERF_SAMPLE_BRANCH_STACK data samples, that we can reuse in perf report -b. Look at commit b50311dc2ac1c04ad19163c2359910b25e16caf6 perf report: Add support for taken branch sampling. It's doing (except for a few details like the period weight of branch samples) the same than in Namhyung patch, just with PERF_SAMPLE_BRANCH_STACK instead of callchains. I don't understand what justifies this duplication. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCHSET 00/15] perf report: Add support to accumulate hist periods
On Fri, Sep 28, 2012 at 02:49:55PM +0900, Namhyung Kim wrote: Hi Frederic, On Fri, 28 Sep 2012 01:01:48 +0200, Frederic Weisbecker wrote: When Arun was working on this, I asked him to explore if it could make sense to reuse the -b, --branch-stack perf report option. Because after all, this feature is doing about the same than -b except it's using callchains instead of full branch tracing. But callchains are branches. Just a limited subset of all branches taken on excecution. So you can probably reuse some interface and even ground code there. What do you think? Umm.. first of all, I'm not familiar with the branch stack thing. It's intel-specific, right? Also I don't understand what exactly you want here. What kind of interface did you say? Can you elaborate it bit more? Look at commit b50311dc2ac1c04ad19163c2359910b25e16caf6 perf report: Add support for taken branch sampling. It's doing almost the same than you do, just using PERF_SAMPLE_BRANCH_STACK instead of callchains. And AFAIK branch stack can collect much more branch information than just callstacks. That's not a problem. Callchains are just a high-level filtered source of branch samples. You don't need full branches to use -b. Just use the flavour of branch samples you want to make the sense you want on your branch sampling. Can we differentiate which is which easily? Sure. If you have both sources in your perf.data (PERF_SAMPLE_BRANCH_STACK and callchains), ask the user which one he wants. Otherwise defaults to what's there. Is there any limitation on using it? What if callstacks are not sync'ed with branch stacks - is it possible though? It' better to make both sources mutually exclusive. Otherwise it's going to be over-complicated. But I think it'd be good if the branch stack can be changed to call stack in general. Did you mean this? That's a different. We might be able to post-process branch tracing and build a callchain on top of it (following calls and ret). May be we will one day. But they are different issues altogether. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rcu: eqs related warnings in linux-next
2012/9/29 Sasha Levin levinsasha...@gmail.com: Maybe I could help here a bit. lappy linux # addr2line -i -e vmlinux 8111d45f /usr/src/linux/kernel/timer.c:549 /usr/src/linux/include/linux/jump_label.h:101 /usr/src/linux/include/trace/events/timer.h:44 /usr/src/linux/kernel/timer.c:601 /usr/src/linux/kernel/timer.c:734 /usr/src/linux/kernel/timer.c:886 Which means that it was about to: debug_object_activate(timer, timer_debug_descr); I can't find anything in the debug object code that might fault. I was suspecting some per cpu allocated memory: per cpu allocation sometimes use vmalloc which uses lazy paging using faults. But I can't find such thing there. May be there is some faulting specific to KVM... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rcu: eqs related warnings in linux-next
On Sat, Sep 29, 2012 at 06:37:37AM -0700, Paul E. McKenney wrote: On Sat, Sep 29, 2012 at 02:25:04PM +0200, Frederic Weisbecker wrote: 2012/9/29 Sasha Levin levinsasha...@gmail.com: Maybe I could help here a bit. lappy linux # addr2line -i -e vmlinux 8111d45f /usr/src/linux/kernel/timer.c:549 /usr/src/linux/include/linux/jump_label.h:101 /usr/src/linux/include/trace/events/timer.h:44 /usr/src/linux/kernel/timer.c:601 /usr/src/linux/kernel/timer.c:734 /usr/src/linux/kernel/timer.c:886 Which means that it was about to: debug_object_activate(timer, timer_debug_descr); Understood and agreed, hence my severe diagnostic patch. I can't find anything in the debug object code that might fault. I was suspecting some per cpu allocated memory: per cpu allocation sometimes use vmalloc which uses lazy paging using faults. But I can't find such thing there. May be there is some faulting specific to KVM... Sasha, is the easily reproducible? If so, could you please try the previous patch? It will likely give us more information on where this bug really lives. (Yes, it might totally obscure the bug, but in that case we will just need to try some other perturbation.) Isn't your patch actually removing the timer? But if so, we won't fault anymore, or may be you want to check if we fault also outside the timer? Just in case, I'm posting a second patch that dumps the regs when we fault in the middle of an RCU user mode API. This way we can find the precise rip where we fault: --- From db4ef9708e606754ac8a3f83b9f293383d263108 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker fweis...@gmail.com Date: Sat, 29 Sep 2012 14:16:09 +0200 Subject: [PATCH] rcu: Debug nasty rcu user mode API recursion Add some debug code to chase down the origin of the fault. Not-Signed-off-by: Frederic Weisbecker fweis...@gmail.com --- arch/x86/mm/fault.c |1 + include/linux/rcupdate.h |1 + kernel/rcutree.c | 32 kernel/rcutree.h |1 + 4 files changed, 35 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a530b23..a5f0eb5 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1232,6 +1232,7 @@ good_area: dotraplinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { + rcu_check_user_recursion(regs); exception_enter(regs); __do_page_fault(regs, error_code); exception_exit(regs); diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 7c968e4..14ba908 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -199,6 +199,7 @@ extern void rcu_user_enter_after_irq(void); extern void rcu_user_exit_after_irq(void); extern void rcu_user_hooks_switch(struct task_struct *prev, struct task_struct *next); +extern void rcu_check_user_recursion(struct pt_regs *regs); #else static inline void rcu_user_enter(void) { } static inline void rcu_user_exit(void) { } diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 4fb2376..63b84f5 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -405,6 +405,20 @@ void rcu_idle_enter(void) EXPORT_SYMBOL_GPL(rcu_idle_enter); #ifdef CONFIG_RCU_USER_QS +void rcu_check_user_recursion(struct pt_regs *regs) +{ + unsigned long flags; + static int printed; + + local_irq_save(flags); + if (__this_cpu_read(rcu_dynticks.recursion) !printed) { + printed = 1; + printk(Found recursion\n); + show_regs(regs); + } + local_irq_restore(flags); +} + /** * rcu_user_enter - inform RCU that we are resuming userspace. * @@ -433,10 +447,20 @@ void rcu_user_enter(void) local_irq_save(flags); rdtp = __get_cpu_var(rcu_dynticks); + if (WARN_ON_ONCE(rdtp-recursion)) { + local_irq_restore(flags); + return; + } + + rdtp-recursion = true; + barrier(); + if (!rdtp-ignore_user_qs !rdtp-in_user) { rdtp-in_user = true; rcu_eqs_enter(true); } + rdtp-recursion = false; + local_irq_restore(flags); } @@ -590,10 +614,18 @@ void rcu_user_exit(void) local_irq_save(flags); rdtp = __get_cpu_var(rcu_dynticks); + if (WARN_ON_ONCE(rdtp-recursion)) { + local_irq_restore(flags); + return; + } + + rdtp-recursion = true; + barrier(); if (rdtp-in_user) { rdtp-in_user = false; rcu_eqs_exit(true); } + rdtp-recursion = false; local_irq_restore(flags); } diff --git a/kernel/rcutree.h b/kernel/rcutree.h index 5faf05d..1bde9d5 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -103,6 +103,7 @@ struct rcu_dynticks { int tick_nohz_enabled_snap; /* Previously seen value from sysfs. */ #endif /* #ifdef
Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path
On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote: diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 915b876..11d62ff 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -34,6 +34,7 @@ #include asm/timer.h #include asm/desc.h #include asm/ldt.h +#include asm/syscall.h #include perf_event.h @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) userpg-time_offset = this_cpu_read(cyc2ns_offset) - now; } +#ifdef CONFIG_X86_64 +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) +{ + int kernel = !user_mode(regs); + + if (kernel) { + if (current-mm) + regs = task_pt_regs(current); + else + regs = NULL; + } Shouldn't the above stay in generic code? + + if (regs) { + memcpy(oregs, regs, sizeof(*regs)); + + /* + * If the perf event was triggered within the kernel code + * path, then it was either syscall or interrupt. While + * interrupt stores almost all user registers, the syscall + * fast path does not. At this point we can at least set + * rsp register right, which is crucial for dwarf unwind. + * + * The syscall_get_nr function returns -1 (orig_ax) for + * interrupt, and positive value for syscall. + * + * We have two race windows in here: + * + * 1) Few instructions from syscall entry until old_rsp is + *set. + * + * 2) In syscall/interrupt path from entry until the orig_ax + *is set. + * + * Above described race windows are fractional opposed to + * the syscall fast path, so we get much better results + * fixing rsp this way. That said, a race is there already: if the syscall is interrupted before SAVE_ARGS and co. I'm trying to scratch my head to find a solution to detect the race and bail out instead of recording erroneous values but I can't find one. Anyway this is still better than what we have now. Another solution could be to force syscall slow path and have some variable set there that tells us we are in a syscall and every regs have been saved. But we probably don't want to force syscall slow path... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] perf tools: Check existence of _get_comp_words_by_ref when bash completing
On Wed, Oct 03, 2012 at 12:21:32AM +0900, Namhyung Kim wrote: The '_get_comp_words_by_ref' function is available from the bash completion v1.2 so that earlier version emits following warning: $ perf reTAB_get_comp_words_by_ref: command not found Use older '_get_cword' method when the above function doesn't exist. May be only use _get_cword then, if it works everywhere? Cc: Frederic Weisbecker fweis...@gmail.com Cc: David Ahern dsah...@gmail.com Signed-off-by: Namhyung Kim namhy...@kernel.org --- tools/perf/bash_completion | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/tools/perf/bash_completion b/tools/perf/bash_completion index 1958fa539d0f..3d48cee1b5e5 100644 --- a/tools/perf/bash_completion +++ b/tools/perf/bash_completion @@ -1,12 +1,23 @@ # perf completion +function_exists() +{ + declare -F $1 /dev/null + return $? +} + have perf _perf() { - local cur cmd + local cur prev cmd COMPREPLY=() - _get_comp_words_by_ref cur prev + if function_exists _get_comp_words_by_ref; then + _get_comp_words_by_ref cur prev + else + cur=$(_get_cword) + prev=${COMP_WORDS[COMP_CWORD-1]} + fi cmd=${COMP_WORDS[0]} -- 1.7.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] perf tools: Bash completion update
On Wed, Oct 03, 2012 at 12:21:31AM +0900, Namhyung Kim wrote: Hi, This patchset improves bash completion support for perf tools. Some option names are really painful to type so here comes a support for completing those long option names. But I still think the --showcpuutilization option needs to be renamed (at least adding a couple of dashes in it). Thanks, Namhyung Acked-by: Frederic Weisbecker fweis...@gmail.com Thanks Namhyung! Namhyung Kim (3): perf tools: Check existence of _get_comp_words_by_ref when bash completing perf tools: Complete long option names of perf command perf tools: Long option completion support for each subcommands tools/perf/bash_completion | 36 +--- tools/perf/util/parse-options.c |8 tools/perf/util/parse-options.h |1 + 3 files changed, 38 insertions(+), 7 deletions(-) -- 1.7.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path
On Tue, Oct 02, 2012 at 06:06:26PM +0200, Jiri Olsa wrote: On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote: On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote: diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 915b876..11d62ff 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -34,6 +34,7 @@ #include asm/timer.h #include asm/desc.h #include asm/ldt.h +#include asm/syscall.h #include perf_event.h @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) userpg-time_offset = this_cpu_read(cyc2ns_offset) - now; } +#ifdef CONFIG_X86_64 +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) +{ + int kernel = !user_mode(regs); + + if (kernel) { + if (current-mm) + regs = task_pt_regs(current); + else + regs = NULL; + } Shouldn't the above stay in generic code? could be.. I guess I thought that having the regs retrieval plus the fixup at the same place feels better/compact ;) but could change that if needed Yeah please. I'm trying to scratch my head to find a solution to detect the race and bail out instead of recording erroneous values but I can't find one. Anyway this is still better than what we have now. Another solution could be to force syscall slow path and have some variable set there that tells us we are in a syscall and every regs have been saved. But we probably don't want to force syscall slow path... I was trying something like that as well, but the one I sent looks far less hacky to me.. :) Actually it's more hacky because it's less deterministic. But it's more simple, and doesn't hurt performances. Ok, let's start with that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus
2013/1/10 Russell King - ARM Linux li...@arm.linux.org.uk: On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote: On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote: Emulate NMIs on systems where they are not available by using timer interrupts on other cpus. Each cpu will use its softlockup hrtimer to check that the next cpu is processing hrtimer interrupts by verifying that a counter is increasing. This patch is useful on systems where the hardlockup detector is not available due to a lack of NMIs, for example most ARM SoCs. I have seen other cpus, like Sparc I think, create a 'virtual NMI' by reserving an IRQ line as 'special' (can not be masked). Not sure if that is something worth looking at here (or even possible). No it isn't, because that assumes that things like spin_lock_irqsave() won't mask that interrupt. We don't have the facility to do that. I believe sparc is doing something like this though. Look at arch/sparc/include/asm/irqflags_64.h, it seems NMIs are implemented there using an irq number that is not masked by this function. Not all archs can do that so easily I guess. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Request for tree inclusion
2012/12/3 Frederic Weisbecker fweis...@gmail.com: 2012/12/2 Stephen Rothwell s...@canb.auug.org.au: Well, these are a bit late (I expected Linus to release v3.7 today), but since Ingo has not piped in over the weekend, I have added them from today after the tip tree merge. Yeah sorry to submit that so late. Those branches are in pending pull requests to the -tip tree and I thought about relying on the propagation of -tip into -next as usual. But Ingo has been very busy with numa related work during this cycle. So until these branches get merged in -tip, I'm short-circuiting a bit the -next step before it becomes too late for the next merge window. I have called them fw-cputime, fs-sched and fw-nohz respectively and listed you as the only contact in case of problems. Ok. If these are to be long term trees included in linux-next, I would prefer that you use better branch names - otherwise, if they are just short term, please tell me to remove them when they are finished with. They are definitely short term. I'll tell you once these can be dropped. Thanks a lot! Hi Stephen! fw-cputime and fs-sched have been merged so you can now remove these branches from next. But fw-nohz remains. In the meantime I have created a branch named nohz/next: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/next This branch currently refers to fw-nohz HEAD (aka nohz/printk-v8) and this is also the place where I'll gather -next materials in the future instead of the multiple branches you're currently pulling. So could you please remove fw-nohz (nohz/printk-v8) as well from -next but include nohz/next instead? Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus
2013/1/11 Colin Cross ccr...@android.com: Emulate NMIs on systems where they are not available by using timer interrupts on other cpus. Each cpu will use its softlockup hrtimer to check that the next cpu is processing hrtimer interrupts by verifying that a counter is increasing. This patch is useful on systems where the hardlockup detector is not available due to a lack of NMIs, for example most ARM SoCs. Without this patch any cpu stuck with interrupts disabled can cause a hardware watchdog reset with no debugging information, but with this patch the kernel can detect the lockup and panic, which can result in useful debugging info. Signed-off-by: Colin Cross ccr...@android.com I believe this is pretty much what the RCU stall detector does already: checks for other CPUs being responsive. The only difference is on how it checks that. For RCU it's about checking for CPUs reporting quiescent states when requested to do so. In your case it's about ensuring the hrtimer interrupt is well handled. One thing you can do is to enqueue an RCU callback (cal_rcu()) every minute so you can force other CPUs to report quiescent states periodically and thus check for lockups. Now you'll face the same problem in the end: if you don't have NMIs, you won't have a very useful report. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus
2013/1/15 Colin Cross ccr...@android.com: On Mon, Jan 14, 2013 at 4:13 PM, Frederic Weisbecker fweis...@gmail.com wrote: I believe this is pretty much what the RCU stall detector does already: checks for other CPUs being responsive. The only difference is on how it checks that. For RCU it's about checking for CPUs reporting quiescent states when requested to do so. In your case it's about ensuring the hrtimer interrupt is well handled. One thing you can do is to enqueue an RCU callback (cal_rcu()) every minute so you can force other CPUs to report quiescent states periodically and thus check for lockups. That's a good point, I'll take a look at using that. A minute is too long, some SoCs have maximum HW watchdog periods of under 30 seconds, but a call_rcu every 10-20 seconds might be sufficient. Sure. And you can tune CONFIG_RCU_CPU_STALL_TIMEOUT accordingly. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus
2013/1/15 Colin Cross ccr...@android.com: On Mon, Jan 14, 2013 at 4:25 PM, Frederic Weisbecker fweis...@gmail.com wrote: 2013/1/15 Colin Cross ccr...@android.com: On Mon, Jan 14, 2013 at 4:13 PM, Frederic Weisbecker fweis...@gmail.com wrote: I believe this is pretty much what the RCU stall detector does already: checks for other CPUs being responsive. The only difference is on how it checks that. For RCU it's about checking for CPUs reporting quiescent states when requested to do so. In your case it's about ensuring the hrtimer interrupt is well handled. One thing you can do is to enqueue an RCU callback (cal_rcu()) every minute so you can force other CPUs to report quiescent states periodically and thus check for lockups. That's a good point, I'll take a look at using that. A minute is too long, some SoCs have maximum HW watchdog periods of under 30 seconds, but a call_rcu every 10-20 seconds might be sufficient. Sure. And you can tune CONFIG_RCU_CPU_STALL_TIMEOUT accordingly. After considering this, I think the hrtimer watchdog is more useful. RCU stalls are not usually panic events, and I wouldn't want to add a panic on every RCU stall. The lack of stack traces on the affected cpu makes a panic important. I'm planning to add an ARM DBGPCSR panic handler, which will be able to dump the PC of a stuck cpu even if it is not responding to interrupts. kexec or kgdb on panic might also allow some inspection of the stack on stuck cpu. Failing to process interrupts is a much more serious event than an RCU stall, and being able to detect them separately may be very valuable for debugging. RCU stalls can happen for different reasons: softlockup (failure to schedule another task), hardlockup (failure to process interrupts), or a bug in RCU itself. But if you have a hardlockup, it will report it. Now why do you need a panic in any case? I don't know DBGPCSR, is this a breakpoint register? How do you plan to use it remotely from the CPU that detects the lockup? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs
2013/1/5 Paul E. McKenney paul...@linux.vnet.ibm.com: On Sat, Jan 05, 2013 at 06:21:01PM +0100, Frederic Weisbecker wrote: Hi Paul, 2013/1/5 Paul E. McKenney paul...@linux.vnet.ibm.com: From: Paul Gortmaker paul.gortma...@windriver.com The wait_event() at the head of the rcu_nocb_kthread() can result in soft-lockup complaints if the CPU in question does not register RCU callbacks for an extended period. This commit therefore changes the wait_event() to a wait_event_interruptible(). Reported-by: Frederic Weisbecker fweis...@gmail.com Signed-off-by: Paul Gortmaker paul.gortma...@windriver.com Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com --- kernel/rcutree_plugin.h |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index f6e5ec2..43dba2d 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg) for (;;) { /* If not polling, wait for next batch of callbacks. */ if (!rcu_nocb_poll) - wait_event(rdp-nocb_wq, rdp-nocb_head); + wait_event_interruptible(rdp-nocb_wq, rdp-nocb_head); list = ACCESS_ONCE(rdp-nocb_head); if (!list) { schedule_timeout_interruptible(1); + flush_signals(current); Why is that needed? To satisfy my paranoia. ;-) And in case someone ever figures out some way to send a signal to a kthread. Ok. I don't want to cause any insomnia to anyone, so I won't insist ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/27] cputime: Allow dynamic switch between tick/virtual based cputime accounting
Hey Paul, 2013/1/4 Paul Gortmaker paul.gortma...@windriver.com: On 12-12-29 11:42 AM, Frederic Weisbecker wrote: Allow to dynamically switch between tick and virtual based cputime accounting. This way we can provide a kind of on-demand virtual based cputime accounting. In this mode, the kernel will rely on the user hooks subsystem to dynamically hook on kernel boundaries. This is in preparation for beeing able to stop the timer tick further idle. Doing so will depend on CONFIG_VIRT_CPU_ACCOUNTING which makes s/beeing/being/ -- also I know what you mean, but it may not be 100% clear to everyone -- perhaps ...for being able to stop the timer tick in more places than just the idle state. Thanks! Fixed for the next version! [...] +static inline bool vtime_accounting(void) { return false; } It wasn't 100% obvious what vtime_accounting() was doing until I saw its definition below. I wonder if it should be something like vtime_accounting_on() or vtime_accounting_enabled() instead? Agreed, I've renamed into vtime_accounting_enabled(). #endif #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN diff --git a/init/Kconfig b/init/Kconfig index dad2b88..307bc35 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -342,6 +342,7 @@ config VIRT_CPU_ACCOUNTING bool Deterministic task and CPU time accounting depends on HAVE_VIRT_CPU_ACCOUNTING || HAVE_CONTEXT_TRACKING select VIRT_CPU_ACCOUNTING_GEN if !HAVE_VIRT_CPU_ACCOUNTING + select VIRT_CPU_ACCOUNTING_NATIVE if HAVE_VIRT_CPU_ACCOUNTING help Select this option to enable more accurate task and CPU time accounting. This is done by reading a CPU counter on each @@ -366,11 +367,16 @@ endchoice config VIRT_CPU_ACCOUNTING_GEN select CONTEXT_TRACKING + depends on VIRT_CPU_ACCOUNTING HAVE_CONTEXT_TRACKING Should the 2nd half of this depends been already here, i.e. introduced with the prev. patch that created VIRT_CPU_ACCOUNTING_GEN? Yeah, Li Zhong suggested that I turn *_GEN and *_NATIVE options into distinct choices for the user. So I moved that part to the previous patch. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rcu] Remove unused code originally used for context tracking
2013/1/7 Paul E. McKenney paul...@linux.vnet.ibm.com: On Fri, Nov 30, 2012 at 02:19:22PM +0800, Li Zhong wrote: As new context tracking subsystem added, it seems ignore_user_qs and in_user defined in struct rcu_dynticks are no longer needed, so remove them. Signed-off-by: Li Zhong zh...@linux.vnet.ibm.com Hearing no objections from Frederic, I have queued this patch for 3.9 Thanks Paul! And feel free to add my ack. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] context_tracking: Add comments on interface and internals
2012/12/16 Ingo Molnar mi...@kernel.org: * Frederic Weisbecker fweis...@gmail.com wrote: + +/** + * context_tracking_task_switch - context switch the syscall hooks + * + * The context tracking uses the syscall slow path to implement its user-kernel + * boundaries hooks on syscalls. This way it doesn't impact the syscall fast + * path on CPUs that don't do context tracking. + * + * But we need to clear the flag on the previous task because it may later + * migrate to some CPU that doesn't do the context tracking. As such the TIF + * flag may not be desired there. If possible: s/hooks/callbacks 'hook' gives me the visual of a box match. YMMV. Ok, I'm fixing this. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: Remove broken check for skip clock update
rq-skip_clock_update shouldn't be negative. Thus the check in put_prev_task() is useless. It was probably intended to do the following check: if (prev-on_rq !rq-skip_clock_update) We only want to update the clock if the current task is not voluntarily sleeping: otherwise deactivate_task() already did the rq clock update in schedule(). But we want to ignore that update if a ttwu did it for us, in which case rq-skip_clock_update is 1. But update_rq_clock() already takes care of that so we can just remove the broken condition. Signed-off-by: Frederic Weisbecker fweis...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Peter Zijlstra pet...@infradead.org Cc: Steven Rostedt rost...@goodmis.org --- kernel/sched/core.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 15ba35e..8dfc461 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2886,7 +2886,7 @@ static inline void schedule_debug(struct task_struct *prev) static void put_prev_task(struct rq *rq, struct task_struct *prev) { - if (prev-on_rq || rq-skip_clock_update 0) + if (prev-on_rq) update_rq_clock(rq); prev-sched_class-put_prev_task(rq, prev); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/