from:"Frederic Weisbecker"

Re: [PATCH 1/4] workqueue: Create low-level unbound workqueues cpumask

2014-05-01 Thread Frederic Weisbecker

On Thu, May 01, 2014 at 11:02:31AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 01, 2014 at 05:01:17PM +0200, Frederic Weisbecker wrote:
> > > Another thing with naming is that I didn't anticipate having
> > > attributes at the top directory so the workqueue directories aren't
> > > namespaced.  Maybe we want to namespace top level knobs?
> > > "system_cpumask" maybe?  Any better ideas?
> > 
> > Not sure why you want that. It makes sense on directories grouping
> > file for different subsystem. But here?
> 
> Worried about possible conflicts with workqueue names if we end up
> with more attributes.

But they are already protected in their own directories. Also having
the same file attribute names in root and in individual directories
suggests that root attributes primes on the childs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] watchdog: print all locks on a softlock

2014-05-01 Thread Frederic Weisbecker

On Thu, May 01, 2014 at 03:17:20PM -0400, Don Zickus wrote:
> On Thu, May 01, 2014 at 02:55:35PM -0400, Eric Paris wrote:
> > If the CPU hits a softlockup this patch will also have it print the
> > information about all locks being held on the system.  This might help
> > determine if a lock is being held too long leading to this problem.
> 
> I am not sure this helps you.  A softlockup is the result of pre-emption
> disabled, ie the scheduler not being called after 60 seconds.  Holding a
> lock does not disable pre-emption usually.  So I don't think this is going
> to add anything.
> 
> Are you trying to debug a hung task?  The the hung_task thread checks to
> see if a task hasn't scheduled in 2 minutes or so.  That could be the
> result of long lock (but that output already dumps the lockdep stuff).

There may be some deadlocks that lockdep doesn't detect yet. 2 example:

1) spinlock <-> IPI dependency


CPU 0CPU 1

spin_lock_irq(A)
smp_send_function_single_async(CPU 1, func)
 //IPI
 func {
spin_lock(1)
 }

But this should be resolved with a virtual lock on the IPI functions.
I should try that.

2) rwlock <-> IPI

CPU 0CPU 1

read_lock(A)
 write_lock_irq(A)
smp_send_function_single(CPU 1, func)
 //IPI never happens

This one is much trickier.

Anyway those are the only scenario I know of but there may be more. When 
possible
we want to extend lockdep to detect new scenarios of deadlock but we don't have 
the
guarantee that it can detect everything.

So, could be useful...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] watchdog: print all locks on a softlock

2014-05-01 Thread Frederic Weisbecker

2014-05-01 22:09 GMT+02:00 Frederic Weisbecker :
> On Thu, May 01, 2014 at 03:17:20PM -0400, Don Zickus wrote:
>> On Thu, May 01, 2014 at 02:55:35PM -0400, Eric Paris wrote:
>> > If the CPU hits a softlockup this patch will also have it print the
>> > information about all locks being held on the system.  This might help
>> > determine if a lock is being held too long leading to this problem.
>>
>> I am not sure this helps you.  A softlockup is the result of pre-emption
>> disabled, ie the scheduler not being called after 60 seconds.  Holding a
>> lock does not disable pre-emption usually.  So I don't think this is going
>> to add anything.
>>
>> Are you trying to debug a hung task?  The the hung_task thread checks to
>> see if a task hasn't scheduled in 2 minutes or so.  That could be the
>> result of long lock (but that output already dumps the lockdep stuff).
>
> There may be some deadlocks that lockdep doesn't detect yet. 2 example:
>
> 1) spinlock <-> IPI dependency
>
>
> CPU 0CPU 1
> 
> spin_lock_irq(A)
> smp_send_function_single_async(CPU 1, func)
>  //IPI
>  func {
> spin_lock(1)
>  }
>
> But this should be resolved with a virtual lock on the IPI functions.
> I should try that.

So actually this one above shouldn't be a problem because the _async
version doesn't wait for the IPI to complete. But the below still
looks possible.

>
> 2) rwlock <-> IPI
>
> CPU 0CPU 1
> 
> read_lock(A)
>  write_lock_irq(A)
> smp_send_function_single(CPU 1, func)
>  //IPI never happens
>
> This one is much trickier.
>
> Anyway those are the only scenario I know of but there may be more. When 
> possible
> we want to extend lockdep to detect new scenarios of deadlock but we don't 
> have the
> guarantee that it can detect everything.
>
> So, could be useful...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tracing: Remove myself as a tracing maintainer

2014-05-03 Thread Frederic Weisbecker

It has been a while since I last sent a tracing patch. I always keep an
eye on tracing evolutions and contributions in general but given
how busy I am with nohz, isolation and more generally core cleanups stuff,
I seldom have time left to provide deep reviews of tracing patches nor
simply for reviews to begin with.

I've been very lucky to start kernel development on a very young subsystem
with tons of low hanging fruits back in 2008. Given that it deals with
a lot of tricky stuffs all around (sched, timers, irq, preemption, NMIs,
SMP, RCU, ) I basically learned everything there.

Steve has been doing most of the incredible work these last years.
Thanks a lot!

Of course consider me always available to help on tracing if any hard
days happen.

Cc: Arnaldo Carvalho de Melo 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Signed-off-by: Frederic Weisbecker 
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6dc67b1..b69be85 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9041,7 +9041,6 @@ F:drivers/char/tpm/
 
 TRACING
 M: Steven Rostedt 
-M: Frederic Weisbecker 
 M: Ingo Molnar 
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
 S: Maintained
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query]: tick-sched: why don't we stop tick when we are running idle task?

2014-04-11 Thread Frederic Weisbecker

On Fri, Apr 11, 2014 at 03:34:23PM +0530, Viresh Kumar wrote:
> On 10 April 2014 20:09, Frederic Weisbecker  wrote:
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 9f8af69..1e2d6b7 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -202,13 +202,16 @@ static void tick_nohz_restart_sched_tick(struct 
> > tick_sched *ts, ktime_t now);
> >  void __tick_nohz_full_check(void)
> >  {
> > struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> > +   unsigned long flags;
> >
> > +   local_irq_save(flags);
> 
> As we need to disable interrupts to read this variable, would it be
> better to just remove this completely and use is_idle_task(current)
> instead?

I don't get what you mean. The goal was get read of the hammer is_idle_task()
check, wasn't it?

Also irqs are disabled but this is fundamentaly not required as this can only be
called by IPIs which always have irqs disabled.

Hmm I should add a WARN_ON_(!irqs_disabled()) though just in case.

> 
> > if (tick_nohz_full_cpu(smp_processor_id())) {
> > -   if (ts->tick_stopped && !is_idle_task(current)) {
> > +   if (ts->tick_stopped && !ts->inidle)) {
> > if (!can_stop_full_tick())
> > tick_nohz_restart_sched_tick(ts, 
> > ktime_get());
> > }
> > }
> > +   local_irq_restore(flags);
> >  }
> 
> > If you like it I'll push it to Ingo.
> 
> Yes please. And thanks for the explanations. It was pretty useful.
> 
> I am looking to offload 1 second tick to timekeeping CPUs and so
> going through these frameworks. I don't have a working solution yet
> (even partially :)). Would send a RFC to you as soon as I get anything
> working.

I see. The only solution I can think of right now is to have the timekeeper call
sched_class(current[$CPU])::scheduler_tick() on behalf of all full dynticks 
CPUs.

This sounds costly but can be done once per sec for each CPUs. Not sure if 
Peterz will
like it but sending mockup RFC patches will tell us more about his opinion :)

Otherwise (and ideally) we need to make the scheduler code able to handle long 
periods without
calling scheduler_tick(). But this is a lot more plumbing. And the scheduler 
has gazillions
accounting stuffs to handle. Sounds like a big nightmare to take that direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query]: tick-sched: can idle_active be false in tick_nohz_idle_exit()?

2014-04-11 Thread Frederic Weisbecker

On Fri, Apr 11, 2014 at 03:24:11PM +0530, Viresh Kumar wrote:
> On 10 April 2014 20:26, Frederic Weisbecker  wrote:
> > When a dynticks idle CPU is woken up (typically with an IPI), 
> > tick_nohz_stop_idle()
> > is called on interrupt entry but, because this is a waking up IPI, 
> > tick_nohz_start_idle()
> > won't be called. The reason is that need_resched() prevents 
> > tick_nohz_irq_exit() to be
> > called in irq_exit().
> >
> > After all if we know that the CPU is going to exit the idle task, we don't 
> > need to account
> > any more idle time. We also don't need to retry to enter in dynticks idle 
> > mode since we
> > are going to restart the tick with tick_nohz_idle_exit().
> >
> > So in case of wake up IPIs, we may end up with !ts->idle_active in 
> > tick_nohz_idle_exit() :)
> >
> > I must confess this is not obvious.
> 
> I agree.. I didn't had a clue that this can happen. Good that I asked
> this question :)
> 
> > It confused me as well when I met that part. A small
> > comment in tick_nohz_idle_exit() would be welcome ;)
> 
> Looks hard. I tried for a small comment and this is the smallest I could get:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c2d868d..26cf5bb 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -925,6 +925,22 @@ void tick_nohz_idle_exit(void)
> 
> ts->inidle = 0;
> 
> +   /*
> +* Can idle_active be false here?
> +* Ideally this would be the sequence of calls:
> +* - tick_nohz_idle_enter(), i.e. idle_active = true;
> +* - local_irq_disable()
> +* - IDLE
> +* - wake up due to IPI or other interrupt
> +* - local_irq_enable()
> +* - tick_nohz_irq_enter(), i.e. idle_active = false;
> +* - tick_nohz_irq_exit(), i.e. idle_active = true; This is not called
> +*   in case of IPI's as need_resched() will prevent that in
> +*   tick_irq_exit(), as we don't need to account any more for idle 
> time
> +*   or try to enter dyntics mode (We are going to exit idle state).
> +*
> +* - tick_nohz_idle_exit()
> +*/
> if (ts->idle_active || ts->tick_stopped)
> now = ktime_get();

I'm sure we can summarize a lot of uninteresting details there. Many of what
you describe can be guessed after a simple review on the code. Let rather
focus on the trickies.

How about something like:

  /*
* ts->active can be 0 if a wake up IPI pulled us out of idle mode. 
When
* that happens we know we're exiting the idle task. Pending idle 
sleep
* time is flushed on irq entry then no more is accounted afterward.
* The need_resched() check on irq_exit() prevents from accounting 
more.
   */

> 
> 
> I am preparing a cleanup patchset (separate from the timer/hrtimers 36 patch
> set) for kernel/time/ and will add this change to that.. I am waiting
> for the merge
> window to close and Thomas to comment on my timers/hrtimers patches first.
> Only then will I send another 40 :)

Thanks! Note you can post before the merge window closes. Reviews are possible
anytime :)

> 
> --
> viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched/core: fix formatting issues in sched_can_stop_tick()

2014-04-14 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 08:38:38PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 14, 2014 at 09:47:41PM +0530, Viresh Kumar wrote:
> > sched_can_stop_tick() was using 7 spaces instead of 8 spaces or a 'tab' at 
> > the
> > beginning of each line. Which doesn't align with the Coding Guidelines.
> > 
> > Also it removes the *rq variable as it was used at only one place and hence 
> > we
> > can directly use this_rq() instead.
> > 
> > Signed-off-by: Viresh Kumar 
> > ---
> > I don't think rq = tihs_rq() has to be done before smp_mb(), in case yes 
> > sorry
> > for this patch :(
> > 
> >  kernel/sched/core.c | 16 ++--
> >  1 file changed, 6 insertions(+), 10 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 268a45e..13299c5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -666,18 +666,14 @@ static inline bool got_nohz_idle_kick(void)
> >  #ifdef CONFIG_NO_HZ_FULL
> >  bool sched_can_stop_tick(void)
> >  {
> > +   /* Make sure rq->nr_running update is visible after the IPI */
> > +   smp_rmb();
> >  
> > +   /* More than one running task need preemption */
> > +   if (this_rq()->nr_running > 1)
> > +   return false;
> >  
> > +   return true;
> >  }
> >  #endif /* CONFIG_NO_HZ_FULL */
> 
> AFAICT the smp_rmb() is entirely spurious, arch interrupts should ensure
> consistency on their own. That is:
> 
>   CPU 0 CPU 1
> 
> [w] X = 1
> IPI 1 
>   [r] r = X
> 
> Should act as if there was a full memory barrier, making it so that the
> read on CPU1 observes the write on CPU0.

Right, I have a pending patch for that:
https://git.kernel.org/cgit/linux/kernel/git/frederic/linux-dynticks.git/commit/?h=nohz/ipi&id=ca981d9f87fe0f113ad972098cfe181180b3675a
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL v2] nohz: Move nohz kick out of scheduler IPI

2014-04-14 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 01:22:17PM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker  wrote:
> 
> > On Thu, Apr 03, 2014 at 06:17:10PM +0200, Frederic Weisbecker wrote:
> > > Ingo, Thomas,
> > > 
> > > Please pull the timers/nohz-ipi-for-tip-v3 branch that can be found at:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > >   timers/nohz-ipi-for-tip-v3
> > 
> > Ping?
> 
> So this came a bit late - will it be fine for v3.16 as-is?

It was late on purpose due to dependencies on the block tree. But nevermind.

It's still fine to be pulled as is after -rc1 for 3.16.

Thanks!
 
> Thanks,
> 
>   Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 24/38] tick-sched: don't check tick_nohz_full_cpu() in __tick_nohz_task_switch()

2014-04-14 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 09:53:46PM +0530, Viresh Kumar wrote:
> __tick_nohz_task_switch() calls tick_nohz_full_kick(), which is already 
> checking
> tick_nohz_full_cpu() and so we don't need to repeat the same check here.
> 
> Remove it.
> 
> Signed-off-by: Viresh Kumar 

Ack.

> ---
>  kernel/time/tick-sched.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 256f4a3..5a99859 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -261,13 +261,9 @@ void __tick_nohz_task_switch(struct task_struct *tsk)
>  
>   local_irq_save(flags);
>  
> - if (!tick_nohz_full_cpu(smp_processor_id()))
> - goto out;
> -
>   if (tick_nohz_tick_stopped() && !can_stop_full_tick())
>   tick_nohz_full_kick();
>  
> -out:
>   local_irq_restore(flags);
>  }
>  
> -- 
> 1.7.12.rc2.18.g61b472e
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 28/38] tick-sched: remove parameters to {__}tick_nohz_task_switch() routines

2014-04-14 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 09:53:50PM +0530, Viresh Kumar wrote:
> tick_nohz_task_switch() and __tick_nohz_task_switch() routines get task_struct
> passed to them (always for the 'current' task), but they never use it. Remove
> it.
> 
> Signed-off-by: Viresh Kumar 

Ack.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 29/38] tick-sched: remove wrapper around __tick_nohz_task_switch()

2014-04-14 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 09:53:51PM +0530, Viresh Kumar wrote:
> __tick_nohz_task_switch() was called only from tick_nohz_task_switch() and 
> there
> is nothing much in tick_nohz_task_switch() as well. IOW, we don't need
> unnecessary wrapper over __tick_nohz_task_switch() to be there. Merge all code
> from __tick_nohz_task_switch() into tick_nohz_task_switch() and move it to
> tick-sched.c.
> 
> This also moves check for tick_nohz_tick_stopped() outside of irq_save()
> context.

No, the wrapper is there on purpose in order to optimize the full dynticks off 
case in
the context switch path with the jump label'ed check on 
tick_nohz_full_enabled().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 30/38] tick-sched: move nohz_full_buf[] inside tick_nohz_init()

2014-04-14 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 09:53:52PM +0530, Viresh Kumar wrote:
> nohz_full_buf[] is used at only one place, i.e. inside tick_nohz_init(). Make 
> it
> a local variable. Can move it out in case it is used in some other routines in
> future.

OTOH nohz_full_buf can have a big size and moving it to a local variable
will result in big increase in kernel stack consumption.

Keeping it static is safer.

Also it's already __initdata so it's released after bootup. No issue of
concurrent use either, so keeping it as is is fine.

> 
> Signed-off-by: Viresh Kumar 
> ---
>  kernel/time/tick-sched.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index d8b9a69..22b9505 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -321,13 +321,6 @@ static int tick_nohz_cpu_down_callback(struct 
> notifier_block *nfb,
>   return NOTIFY_OK;
>  }
>  
> -/*
> - * Worst case string length in chunks of CPU range seems 2 steps
> - * separations: 0,2,4,6,...
> - * This is NR_CPUS + sizeof('\0')
> - */
> -static char __initdata nohz_full_buf[NR_CPUS + 1];
> -
>  static int tick_nohz_init_all(void)
>  {
>   int err = -1;
> @@ -347,6 +340,12 @@ static int tick_nohz_init_all(void)
>  
>  void __init tick_nohz_init(void)
>  {
> + /*
> +  * Worst case string length in chunks of CPU range seems 2 steps
> +  * separations: 0,2,4,6,...
> +  * This is NR_CPUS + sizeof('\0')
> +  */
> + char nohz_full_buf[NR_CPUS + 1];
>   int cpu;
>  
>   if (!tick_nohz_full_running) {
> -- 
> 1.7.12.rc2.18.g61b472e
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/3] tracing: syscall_*regfunc() fixes

2014-04-14 Thread Frederic Weisbecker

On Sun, Apr 13, 2014 at 08:58:28PM +0200, Oleg Nesterov wrote:
> On 04/11, Oleg Nesterov wrote:
> >
> > On 04/11, Steven Rostedt wrote:
> > >
> > > Are you going to send a new series?
> >
> > Yes, will do. I will split 1/2, and I need to update the changelog
> > in 2/2.
> 
> Please see the patches.
> 
> Frederic! I am sorry, I decided to steal your patch to simplify the
> merging and avoid too many changes in the trivial syscall_*regfunc's.
> In case you do not object, could you add your sob into 2/3 ?

Sure, feel free to add it!

Thanks!

> Otherwise
> I can wait for your patch, then send v3 on top of it.
> 
> 
> 
> Hmm. This is off-topic, but CLONE_KERNEL should die, and kernel_init()
> should be spawned without CLONE_SIGHAND. I'll send another patch.
> 
> Oleg.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] tracing: fix syscall_*regfunc() vs copy_process() race

2014-04-14 Thread Frederic Weisbecker

On Sun, Apr 13, 2014 at 08:58:54PM +0200, Oleg Nesterov wrote:
> syscall_regfunc() and syscall_unregfunc() should set/clear
> TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
> with copy_process() and miss the new child which was not added to
> the process/thread lists yet.
> 
> Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
> under tasklist.
> 
> Signed-off-by: Oleg Nesterov 

Acked-by: Frederic Weisbecker 

> ---
>  include/trace/syscall.h |   15 +++
>  kernel/fork.c   |2 ++
>  2 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> index fed853f..291c282 100644
> --- a/include/trace/syscall.h
> +++ b/include/trace/syscall.h
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -32,4 +33,18 @@ struct syscall_metadata {
>   struct ftrace_event_call *exit_event;
>  };
>  
> +#ifdef CONFIG_TRACEPOINTS
> +static inline void syscall_tracepoint_update(struct task_struct *p)
> +{
> + if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
> + set_tsk_thread_flag(p, TIF_SYSCALL_TRACEPOINT);
> + else
> + clear_tsk_thread_flag(p, TIF_SYSCALL_TRACEPOINT);
> +}
> +#else
> +static inline void syscall_tracepoint_update(struct task_struct *p)
> +{
> +}
> +#endif
> +
>  #endif /* _TRACE_SYSCALL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a17621c..2d55a47 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1484,7 +1484,9 @@ static struct task_struct *copy_process(unsigned long 
> clone_flags,
>  
>   total_forks++;
>   spin_unlock(¤t->sighand->siglock);
> + syscall_tracepoint_update(p);
>   write_unlock_irq(&tasklist_lock);
> +
>   proc_fork_connector(p);
>   cgroup_post_fork(p);
>   if (clone_flags & CLONE_THREAD)
> -- 
> 1.5.5.1
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 29/38] tick-sched: remove wrapper around __tick_nohz_task_switch()

2014-04-15 Thread Frederic Weisbecker

On Tue, Apr 15, 2014 at 10:15:24AM +0530, Viresh Kumar wrote:
> On 15 April 2014 04:52, Frederic Weisbecker  wrote:
> > On Mon, Apr 14, 2014 at 09:53:51PM +0530, Viresh Kumar wrote:
> >> __tick_nohz_task_switch() was called only from tick_nohz_task_switch() and 
> >> there
> >> is nothing much in tick_nohz_task_switch() as well. IOW, we don't need
> >> unnecessary wrapper over __tick_nohz_task_switch() to be there. Merge all 
> >> code
> >> from __tick_nohz_task_switch() into tick_nohz_task_switch() and move it to
> >> tick-sched.c.
> >>
> >> This also moves check for tick_nohz_tick_stopped() outside of irq_save()
> >> context.
> >
> > No, the wrapper is there on purpose in order to optimize the full dynticks 
> > off case in
> > the context switch path with the jump label'ed check on 
> > tick_nohz_full_enabled().
> 
> Just to clarify, you are saying that:
> 
> Wrapper was there to save an extra function call when tick_nohz_full_enabled()
> returns false, as tick_nohz_task_switch() will be inlined ?

Yeah. But not just that.

Using an inline saves a function call and reduce the offline case to a simple
condition check. But there is also the jump label that reduce the condition 
check
to an unconditional jump in the off case.

To summarize, here's how calling tick_nohz_task_switch() maps to final C code:

finish_task_switch()
{
   //do things before calling tick_nohz_task_switch()...
   // call tick_nohz_task_switch
   goto offcase;
   if (tick_nohz_full_enabled())
   __tick_nohz_task_switch(tsk);
offcase:
  //end of call to tick_nohz_task_switch
  //do things before calling tick_nohz_task_switch()...
}

In the offcase, the code is like above. We don't even do the check, thanks to
the jump label code we unconditionally jump to what's next in 
finish_task_switch()
(there is actually nothing afterward but that's for the picture).

Now if there is at least a CPU that is full dynticks on boot, it is enabled
with context_tracking_cpu_set(). Then the jump label code patches the code in
finish_task_switch() to turn the goto offcase into a nop. Then the condition is
actually verified on every call to finish_task_switch().

So it goes beyond than just saving a function call.

> 
> In this case probably we can move !can_stop_full_tick() as well to the 
> wrapper ?

Do you mean moving all the code of __tick_nohz_task_switch() to 
tick_nohz_task_switch()?
I much prefer we don't do that. This is going to make can_stop_full_tick() a 
publicly
visible nohz internal. And it may uglify tick.h as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query]: tick-sched: why don't we stop tick when we are running idle task?

2014-04-15 Thread Frederic Weisbecker

On Mon, Apr 14, 2014 at 02:06:00PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 14, 2014 at 05:22:30PM +0530, Viresh Kumar wrote:
> > On 14 April 2014 17:17, Peter Zijlstra  wrote:
> > > What causes this tick? I was under the impression that once there's a
> > > single task (not doing any syscalls) and the above issues are sorted, no
> > > more tick would happen.
> > 
> > This is what Frederic told me earlier:
> > 
> > https://lkml.org/lkml/2014/2/13/238
> 
> That's a bit of a non-answer. I'm fairly sure its not a gazillion
> issues, since the actual scheduler tick doesn't actually do that much.
> 
> So start by enumerating what is actually required.

Ok, I'm a bit buzy with a conference right now but I'm going to summarize that
soonish.

> 
> The 2), which I suppose you're now trying to implement is I think
> entirely the wrong way. The tick really assumes it runs local, moving it
> to another CPU is insane.

There is probably a few things that assume local calls but last time
I checked I had the impression that it was fairly possible to call 
sched_class::task_tick()
remotely. rq is locked, no reference to "current", use rq accessors...

OTOH scheduler_tick() itself definetly requires local calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] workqueue: allow changing attributions of ordered workqueue

2014-04-15 Thread Frederic Weisbecker

On Tue, Apr 15, 2014 at 05:58:08PM +0800, Lai Jiangshan wrote:
> From 534f1df8a5a03427b0fc382150fbd34e05648a28 Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan 
> Date: Tue, 15 Apr 2014 17:52:19 +0800
> Subject: [PATCH] workqueue: allow changing attributions of ordered workqueue
> 
> Allow changing ordered workqueue's cpumask
> Allow changing ordered workqueue's nice value
> Allow registering ordered workqueue to SYSFS
> 
> Still disallow changing ordered workqueue's max_active which breaks ordered 
> guarantee
> Still disallow changing ordered workqueue's no_numa which breaks ordered 
> guarantee
> 
> Changing attributions will introduce new pwqs in a given workqueue, and
> old implement doesn't have any solution to handle multi-pwqs on ordered 
> workqueues,
> so changing attributions for ordered workqueues is disallowed.
> 
> After I investigated several solutions which try to handle multi-pwqs on 
> ordered workqueues,
> I found the solution which reuse max_active are the simplest.
> In this solution, the new introduced pwq's max_active is kept as *ZERO* until 
> it become
> the oldest pwq. Thus only one (the oldest) pwq is active, and ordered is 
> guarantee.
> 
> This solution forces ordered on higher level while the non-reentrancy is also
> forced. so we don't need to queue any work to its previous pool. And we 
> shouldn't
> do it. we must disallow any work repeatedly requeues itself back-to-back
> which keeps the oldest pwq stay and make newest(if have) pwq 
> starvation(non-active for ever).
> 
> Build test only.
> This patch depends on previous patch:
> workqueue: add __WQ_FREEZING and remove POOL_FREEZING
> 
> Frederic:
>   You can pick this patch to your updated patchset before
>   TJ apply it.
> 
> Signed-off-by: Lai Jiangshan 

So IUUC this is to allow registering ordered workqueue as WQ_SYSFS? But I think
we are going to follow Tejun's suggestion to have a low level cpumask which 
defines
the limit for all unbound workqueues. This indeed tremendously simplifies 
everyting.
I'll post the series soon.

But maybe I'm missing other requirements that are fixed by your patch?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 29/38] tick-sched: remove wrapper around __tick_nohz_task_switch()

2014-04-15 Thread Frederic Weisbecker

On Tue, Apr 15, 2014 at 03:23:37PM +0530, Viresh Kumar wrote:
> On 15 April 2014 14:43, Frederic Weisbecker  wrote:
> > Yeah. But not just that.
> >
> > Using an inline saves a function call and reduce the offline case to a 
> > simple
> > condition check. But there is also the jump label that reduce the condition 
> > check
> > to an unconditional jump in the off case.
> >
> > To summarize, here's how calling tick_nohz_task_switch() maps to final C 
> > code:
> >
> > finish_task_switch()
> > {
> >//do things before calling tick_nohz_task_switch()...
> >// call tick_nohz_task_switch
> >goto offcase;
> >if (tick_nohz_full_enabled())
> >__tick_nohz_task_switch(tsk);
> > offcase:
> >   //end of call to tick_nohz_task_switch
> >   //do things before calling tick_nohz_task_switch()...
> > }
> >
> > In the offcase, the code is like above. We don't even do the check, thanks 
> > to
> > the jump label code we unconditionally jump to what's next in 
> > finish_task_switch()
> > (there is actually nothing afterward but that's for the picture).
> >
> > Now if there is at least a CPU that is full dynticks on boot, it is enabled
> > with context_tracking_cpu_set(). Then the jump label code patches the code 
> > in
> > finish_task_switch() to turn the goto offcase into a nop. Then the 
> > condition is
> > actually verified on every call to finish_task_switch().
> >
> > So it goes beyond than just saving a function call.
> 
> Sorry, but my poor mind still couldn't understand what you are trying to
> tell me :(

Welcome to the club of the daily confused people.
I'm happy to hear I'm not alone :)

> 
> So lets clarify things one by one :)
> 
> - What do you mean by offcase? CONFIG_NO_HZ_FULL not configured
> into the kernel or it is configured but none of the CPUs is running in that
> mode?

So by offcase I mean CONFIG_NO_HZ_FULL=y but the nohz_full boot parameter
is empty, or simply not passed at all. And of course CONFIG_NO_HZ_FULL_ALL=n

This config is now likely on some distros because we want to make full
dynticks available for users who want it. But if it's not used (which is 99.999%
of the usecases), we want to minimize as much as possible its overhead.

Lets call that dynamic off-case.

> 
> - Also what does it correspond to in code: goto offcase; ? There is no labels
> or goto statements in code that I can see.. This is how the code looks to me.
> 
> > finish_task_switch()
> > {
> >//do things before calling tick_nohz_task_switch()...
> >// call tick_nohz_task_switch
> >if (tick_nohz_full_enabled())
> >__tick_nohz_task_switch(tsk);
> > }

Sure but check out the static_key_false() in the implementation of 
tick_nohz_full_enabled().
That's where the magic hides.

> 
> __tick_nohz_task_switch() may or maynot be available at all depending
> on CONFIG_NO_HZ_FULL is enabled into the kernel or not. But that
> was the case with tick_nohz_task_switch() as well in my patch. So
> shouldn't make a difference..
> 
> Again, sorry for not understanding what you are trying to explain here.
> I want to understand this once and for all and probably add a comment
> here as well :)

No problem, the jump label/static key code is quite tricky. And its use
can be easily missed, as in here.

Also its unfamous API naming (static_key_true/static_key_true) that is
anything but intuitive.

> 
> --
> viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching

2014-05-05 Thread Frederic Weisbecker

On Mon, May 05, 2014 at 08:26:38AM -0500, Josh Poimboeuf wrote:
> On Mon, May 05, 2014 at 10:55:37AM +0200, Ingo Molnar wrote:
> > 
> > * Josh Poimboeuf  wrote:
> > 
> > > [...]
> > > 
> > > kpatch checks the backtraces of all tasks in stop_machine() to 
> > > ensure that no instances of the old function are running when the 
> > > new function is applied.  I think the biggest downside of this 
> > > approach is that stop_machine() has to idle all other CPUs during 
> > > the patching process, so it inserts a small amount of latency (a few 
> > > ms on an idle system).
> > 
> > When live patching the kernel, how about achieving an even 'cleaner' 
> > state for all tasks in the system: to freeze all tasks, as the suspend 
> > and hibernation code (and kexec) does, via freeze_processes()?
> > 
> > That means no tasks in the system have any real kernel execution 
> > state, and there's also no problem with long-sleeping tasks, as 
> > freeze_processes() is supposed to be fast as well.
> > 
> > I.e. go for the most conservative live patching state first, and relax 
> > it only once the initial model is upstream and is working robustly.
> 
> I had considered doing this before, but the problem I found is that many
> kernel threads are unfreezable.  So we wouldn't be able to check whether
> its safe to replace any functions in use by those kernel threads.

OTOH many kernel threads are parkable. Which achieves kind of similar desired
behaviour: the kernel threads then aren't running.

And in fact we could implement freezing on top of park for kthreads.

But unfortunately there are still quite some of them which don't support 
parking.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-05 Thread Frederic Weisbecker

On Mon, May 05, 2014 at 02:37:06PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 16, 2014 at 12:40:01AM -0700, tip-bot for Frederic Weisbecker 
> wrote:
> > Commit-ID:  72aacf0259bb7d53b7a3b5b2f7bf982acaa52b61
> > Gitweb: 
> > http://git.kernel.org/tip/72aacf0259bb7d53b7a3b5b2f7bf982acaa52b61
> > Author: Frederic Weisbecker 
> > AuthorDate: Tue, 18 Mar 2014 21:12:53 +0100
> > Committer:  Frederic Weisbecker 
> > CommitDate: Thu, 3 Apr 2014 18:05:21 +0200
> > 
> > nohz: Move full nohz kick to its own IPI
> > 
> > Now that we have smp_queue_function_single() which can be used to
> > safely queue IPIs when interrupts are disabled and without worrying
> > about concurrent callers, lets use it for the full dynticks kick to
> > notify a CPU that it's exiting single task mode.
> > 
> > This unbloats a bit the scheduler IPI that the nohz code was abusing
> > for its cool "callable anywhere/anytime" properties.
> > 
> > Reviewed-by: Paul E. McKenney 
> > Cc: Andrew Morton 
> > Cc: Ingo Molnar 
> > Cc: Jens Axboe 
> > Cc: Kevin Hilman 
> > Cc: Paul E. McKenney 
> > Cc: Peter Zijlstra 
> > Cc: Thomas Gleixner 
> > Signed-off-by: Frederic Weisbecker 
> 
> So I suspect this is the patch that makes Ingo's machines unhappy, they
> appear to get stuck thusly:
> 
> [10513.382910] RIP: 0010:[]  [] 
> generic_exec_single+0x9a/0x180
> 
> [10513.481704]  [] smp_queue_function_single+0x42/0xa0
> [10513.488251]  [] tick_nohz_full_kick_cpu+0x50/0x80
> [10513.494661]  [] enqueue_task_fair+0x59e/0x6c0
> [10513.506469]  [] enqueue_task+0x3a/0x60
> [10513.511836]  [] __migrate_task+0x123/0x150
> [10513.523535]  [] migration_cpu_stop+0x1d/0x30
> [10513.529401]  [] cpu_stopper_thread+0x70/0x120
> 
> I'm not entirely sure how yet, but this is by far the most likely
> candidate. Ingo, if you still have the vmlinuz matching this trace (your
> hang2.txt) could you have a peek where that RIP lands?
> 
> If that is indeed the csd_lock() function, then this is it and
> something's buggered.

Aye!

> 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index c9007f2..4771063 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1225,7 +1225,7 @@ static inline void inc_nr_running(struct rq *rq)
> > if (tick_nohz_full_cpu(rq->cpu)) {
> > /* Order rq->nr_running write against the IPI */
> > smp_wmb();
> 
> FWIW that barrier is complete crap ;-)

Yeah, I'm queing the removal of that now :)

> 
> > -   smp_send_reschedule(rq->cpu);
> > +   tick_nohz_full_kick_cpu(rq->cpu);
> > }
> > }
> >  #endif
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 9f8af69..582d3f6 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -230,6 +230,27 @@ void tick_nohz_full_kick(void)
> > irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
> >  }
> >  
> > +static void nohz_full_kick_queue(struct queue_single_data *qsd)
> > +{
> > +   __tick_nohz_full_check();
> > +}
> > +
> > +static DEFINE_PER_CPU(struct queue_single_data, nohz_full_kick_qsd) = {
> > +   .func = nohz_full_kick_queue,
> > +};
> > +
> > +void tick_nohz_full_kick_cpu(int cpu)
> > +{
> > +   if (!tick_nohz_full_cpu(cpu))
> > +   return;
> > +
> > +   if (cpu == smp_processor_id()) {
> > +   irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
> > +   } else {
> > +   smp_queue_function_single(cpu, &per_cpu(nohz_full_kick_qsd, 
> > cpu));
> > +   }
> > +}
> 
> Should we instead do irq_work_queue_on() ?

I would really much prefer that yeah. But if we do that, expect some added 
overhead on the local
irq_work_queue() path though. irq_work_raise can't use local cmpxchg ops 
anymore.

Or we can have a different pending raise system for remote irq work.

I can try something.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-05 Thread Frederic Weisbecker

On Mon, May 05, 2014 at 03:31:13PM +0200, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 02:37:06PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 16, 2014 at 12:40:01AM -0700, tip-bot for Frederic Weisbecker 
> > wrote:
> > > Commit-ID:  72aacf0259bb7d53b7a3b5b2f7bf982acaa52b61
> > > Gitweb: 
> > > http://git.kernel.org/tip/72aacf0259bb7d53b7a3b5b2f7bf982acaa52b61
> > > Author: Frederic Weisbecker 
> > > AuthorDate: Tue, 18 Mar 2014 21:12:53 +0100
> > > Committer:  Frederic Weisbecker 
> > > CommitDate: Thu, 3 Apr 2014 18:05:21 +0200
> > > 
> > > nohz: Move full nohz kick to its own IPI
> > > 
> > > Now that we have smp_queue_function_single() which can be used to
> > > safely queue IPIs when interrupts are disabled and without worrying
> > > about concurrent callers, lets use it for the full dynticks kick to
> > > notify a CPU that it's exiting single task mode.
> > > 
> > > This unbloats a bit the scheduler IPI that the nohz code was abusing
> > > for its cool "callable anywhere/anytime" properties.
> > > 
> > > Reviewed-by: Paul E. McKenney 
> > > Cc: Andrew Morton 
> > > Cc: Ingo Molnar 
> > > Cc: Jens Axboe 
> > > Cc: Kevin Hilman 
> > > Cc: Paul E. McKenney 
> > > Cc: Peter Zijlstra 
> > > Cc: Thomas Gleixner 
> > > Signed-off-by: Frederic Weisbecker 
> > 
> > So I suspect this is the patch that makes Ingo's machines unhappy, they
> > appear to get stuck thusly:
> > 
> > [10513.382910] RIP: 0010:[]  [] 
> > generic_exec_single+0x9a/0x180
> > 
> > [10513.481704]  [] smp_queue_function_single+0x42/0xa0
> > [10513.488251]  [] tick_nohz_full_kick_cpu+0x50/0x80
> > [10513.494661]  [] enqueue_task_fair+0x59e/0x6c0
> > [10513.506469]  [] enqueue_task+0x3a/0x60
> > [10513.511836]  [] __migrate_task+0x123/0x150
> > [10513.523535]  [] migration_cpu_stop+0x1d/0x30
> > [10513.529401]  [] cpu_stopper_thread+0x70/0x120
> > 
> > I'm not entirely sure how yet, but this is by far the most likely
> > candidate. Ingo, if you still have the vmlinuz matching this trace (your
> > hang2.txt) could you have a peek where that RIP lands?
> > 
> > If that is indeed the csd_lock() function, then this is it and
> > something's buggered.
> 
> On a kernel build from your .config the +0x9a is indeed very close to
> that wait loop; of course 0x9a isn't even an instruction boundary for me
> so its all a bit of a guess.

Note the current ordering:

cmpxchg(&qsd->pending, 0, 1)   get ipi
csd_lock(qsd->csd) xchg(&qsd->pending, 1)
send ipi   csd_unlock(qsd->csd)


So there shouldn't be racing updaters. Also ipi sender shouldn't
race with ipi receiver, the update shouldn't always eventually see
the unlock happening.

OTOH the ordering above is anything but intuitive. In fact csd_lock()
is on the way here. smp_queue_function_single() doesn't need it at
all. So it's just yet another risk for a deadlock.

One more reason why I would much prefer irq_work_queue_on(). But
I'm also not very happy with the possible result since we are going
to maintain two different APIs set doing almost the same thing with
just slightly different constraints or semantics :-(
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-05 Thread Frederic Weisbecker

On Mon, May 05, 2014 at 04:58:15PM +0200, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 04:52:59PM +0200, Frederic Weisbecker wrote:
> > > Should we instead do irq_work_queue_on() ?
> > 
> > I would really much prefer that yeah. But if we do that, expect some added 
> > overhead on the local
> > irq_work_queue() path though. irq_work_raise can't use local cmpxchg ops 
> > anymore.
> > 
> > Or we can have a different pending raise system for remote irq work.
> > 
> > I can try something.
> 
> Loosing that local cmpxchg shouldn't be a problem, I don't  thnk this is
> a particularly hot path.

Then the conversion is easy since most of the irq work code should
already work for remote queuing.

I'll come up with a patch soon.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-05 Thread Frederic Weisbecker

On Mon, May 05, 2014 at 05:12:28PM +0200, Peter Zijlstra wrote:
> > Note the current ordering:
> > 
> > cmpxchg(&qsd->pending, 0, 1)   get ipi
> > csd_lock(qsd->csd) xchg(&qsd->pending, 1)
> > send ipi   csd_unlock(qsd->csd)
> > 
> > 
> > So there shouldn't be racing updaters. Also ipi sender shouldn't
> > race with ipi receiver, the update shouldn't always eventually see
> > the unlock happening.
> 
> Yeah, I've not spotted how this particular train wreck happens either.
> 
> The problem is reproduction, it took me 9 hours to confirm I could
> reproduce the problem on my machine. So how long to I run it with this
> patch reverted to show its gone..

Maybe it could be favoured cpu hotplug. Anyway converting to irq_work should
fix it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching

2014-05-05 Thread Frederic Weisbecker

On Mon, May 05, 2014 at 08:43:04PM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker  wrote:
> 
> > On Mon, May 05, 2014 at 08:26:38AM -0500, Josh Poimboeuf wrote:
> > > On Mon, May 05, 2014 at 10:55:37AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Josh Poimboeuf  wrote:
> > > > 
> > > > > [...]
> > > > > 
> > > > > kpatch checks the backtraces of all tasks in stop_machine() to 
> > > > > ensure that no instances of the old function are running when the 
> > > > > new function is applied.  I think the biggest downside of this 
> > > > > approach is that stop_machine() has to idle all other CPUs during 
> > > > > the patching process, so it inserts a small amount of latency (a few 
> > > > > ms on an idle system).
> > > > 
> > > > When live patching the kernel, how about achieving an even 'cleaner' 
> > > > state for all tasks in the system: to freeze all tasks, as the suspend 
> > > > and hibernation code (and kexec) does, via freeze_processes()?
> > > > 
> > > > That means no tasks in the system have any real kernel execution 
> > > > state, and there's also no problem with long-sleeping tasks, as 
> > > > freeze_processes() is supposed to be fast as well.
> > > > 
> > > > I.e. go for the most conservative live patching state first, and relax 
> > > > it only once the initial model is upstream and is working robustly.
> > > 
> > > I had considered doing this before, but the problem I found is 
> > > that many kernel threads are unfreezable.  So we wouldn't be able 
> > > to check whether its safe to replace any functions in use by those 
> > > kernel threads.
> > 
> > OTOH many kernel threads are parkable. Which achieves kind of 
> > similar desired behaviour: the kernel threads then aren't running.
> > 
> > And in fact we could implement freezing on top of park for kthreads.
> > 
> > But unfortunately there are still quite some of them which don't 
> > support parking.
> 
> Well, if distros are moving towards live patching (and they are!), 
> then it looks rather necessary to me that something scary as flipping 
> out live kernel instructions with substantially different code should 
> be as safe as possible, and only then fast.

Sure I won't argue that.

> 
> If a kernel refuses to patch with certain threads running, that will 
> drive those kernel threads being fixed and such. It's a deterministic, 
> recoverable, reportable bug situation, so fixing it should be fast.
> 
> We learned these robustness lessons the hard way with kprobes and 
> ftrace dynamic code patching... which are utterly simple compared to 
> live kernel patching!

Yeah, agreed. More rationale behind: we want to put the kthreads into
semantic sleeps, not just random sleeping point. This way we lower the
chances to execute new code messing up living state that is expecting old
code after random preemption or sleeping points.

But by semantic sleeps I mean more than just explicit calls to schedule()
as opposed to preemption points.
It also implies shutting down as well the living states handled by the kthread
such that some sort of re-initialization of the state is also needed when
the kthread gets back to run.

And that's exactly what good implementations of kthread park provide.

Consider kernel/watchdog.c as an example: when we park the lockup
detector kthread, it disables the perf event and the hrtimer before it goes
to actually park and sleep. When the kthread is later unparked, the kthread
restarts the hrtimer and the perf event.

If we live patch code that has obscure relations with perf or hrtimer here,
we lower a lot the chances for a crash when the watchdog kthread is parked.

So I'm in favour of parking all possible kthreads before live patching. Freezing
alone doesn't provide the same state shutdown than parking.

Now since parking looks more widely implemented than kthread freezing, we could
even think about implementing kthread freezing using parking as backend.

> 
> Thanks,
> 
>   Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching

2014-05-06 Thread Frederic Weisbecker

On Tue, May 06, 2014 at 07:12:11AM -0500, Josh Poimboeuf wrote:
> On Mon, May 05, 2014 at 11:49:23PM +0200, Frederic Weisbecker wrote:
> > On Mon, May 05, 2014 at 08:43:04PM +0200, Ingo Molnar wrote:
> > > If a kernel refuses to patch with certain threads running, that will 
> > > drive those kernel threads being fixed and such. It's a deterministic, 
> > > recoverable, reportable bug situation, so fixing it should be fast.
> > > 
> > > We learned these robustness lessons the hard way with kprobes and 
> > > ftrace dynamic code patching... which are utterly simple compared to 
> > > live kernel patching!
> > 
> > Yeah, agreed. More rationale behind: we want to put the kthreads into
> > semantic sleeps, not just random sleeping point. This way we lower the
> > chances to execute new code messing up living state that is expecting old
> > code after random preemption or sleeping points.
> > 
> > But by semantic sleeps I mean more than just explicit calls to schedule()
> > as opposed to preemption points.
> > It also implies shutting down as well the living states handled by the 
> > kthread
> > such that some sort of re-initialization of the state is also needed when
> > the kthread gets back to run.
> > 
> > And that's exactly what good implementations of kthread park provide.
> > 
> > Consider kernel/watchdog.c as an example: when we park the lockup
> > detector kthread, it disables the perf event and the hrtimer before it goes
> > to actually park and sleep. When the kthread is later unparked, the kthread
> > restarts the hrtimer and the perf event.
> > 
> > If we live patch code that has obscure relations with perf or hrtimer here,
> > we lower a lot the chances for a crash when the watchdog kthread is parked.
> > 
> > So I'm in favour of parking all possible kthreads before live patching. 
> > Freezing
> > alone doesn't provide the same state shutdown than parking.
> > 
> > Now since parking looks more widely implemented than kthread freezing, we 
> > could
> > even think about implementing kthread freezing using parking as backend.
> 
> The vast majority of kernel threads on my system don't seem to know
> anything about parking or freezing.  I see one kthread function which
> calls kthread_should_park(), which is smpboot_thread_fn(), used for
> ksoftirqd/*, migration/* and watchdog/*.  But there are many other
> kthread functions which seem to be parking ignorant, including:
> 
>   cpu_idle_loop
>   kthreadd
>   rcu_gp_kthread
>   worker_thread
>   rescuer_thread
>   devtmpfsd
>   hub_thread
>   kswapd
>   ksm_scan_thread
>   khugepaged
>   fsnotify_mark_destroy
>   scsi_error_handler
>   kauditd_thread
>   kjournald2
>   irq_thread
>   rfcomm_run

Yeah I now realize that only very few of them can park. Only infiniband, rcu,
stop_machine and the watchdog...

But that's still a good direction to take if we want to make the kernel
step by step more robust against live patching.

> 
> Maybe we could modify all these thread functions (and probably more) to
> be park and/or freezer capable.  But really it wouldn't make much of a
> difference IMO.  It would only protect careless users from a tiny
> percentage of all possible havoc that a careless user could create.
> 
> Live patching is a very sensitive and risky operation, and from a kernel
> standpoint we should make it as safe as we reasonably can.  But we can't
> do much about careless users.  Ultimately the risk is in the hands of
> the user and their choice of patches.  They need to absolutely
> understand all the implications of patching a particular function.

Kernel developers will be a tiny minority of users of live patching.

Very few sysadmins know about kernel internals. You can't really ask them
to judge if a patch is reasonably live patchable or not.

It's possible to appreciate a patch wrt. its size, or the fact that it came
through a stable tree, so it's at worst mildly invasive.

The only thing that could make live patching safe is that a community
eventually builds around it and carefully inspect patches in a stable release
then provide a selection of safely live patchable pieces.

Other than that, expect people to do crazy things.

> If the patch changes the way a function interacts with some external data,
> then they're starting to tempt fate and they need to be extra careful.
> This care needs to be taken for *all* kernel functions, not just for the
> few that are called from kernel threads.

It's actually very hard to tell if a given function isn't called by any
kernel thread.

> 
> Also, the top l

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-07 Thread Frederic Weisbecker

On Wed, May 07, 2014 at 05:17:35PM +0200, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 05:34:08PM +0200, Frederic Weisbecker wrote:
> > On Mon, May 05, 2014 at 05:12:28PM +0200, Peter Zijlstra wrote:
> > > > Note the current ordering:
> > > > 
> > > > cmpxchg(&qsd->pending, 0, 1)   get ipi
> > > > csd_lock(qsd->csd) xchg(&qsd->pending, 1)
> > > > send ipi   csd_unlock(qsd->csd)
> > > > 
> > > > 
> > > > So there shouldn't be racing updaters. Also ipi sender shouldn't
> > > > race with ipi receiver, the update shouldn't always eventually see
> > > > the unlock happening.
> > > 
> > > Yeah, I've not spotted how this particular train wreck happens either.
> > > 
> > > The problem is reproduction, it took me 9 hours to confirm I could
> > > reproduce the problem on my machine. So how long to I run it with this
> > > patch reverted to show its gone..
> > 
> > Maybe it could be favoured cpu hotplug. Anyway converting to irq_work should
> > fix it.
> 
> Ingo needs a commit msg for the revert of this patch; do you think you
> have time to look into _why_ this patch is broken and write such a
> thing?

I can try but I need to reproduce it. Do you have any clue on how to do so?
Also which HEAD were you guys using?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-07 Thread Frederic Weisbecker

On Wed, May 07, 2014 at 05:37:36PM +0200, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 05:29:24PM +0200, Frederic Weisbecker wrote:
> > On Wed, May 07, 2014 at 05:17:35PM +0200, Peter Zijlstra wrote:
> > > On Mon, May 05, 2014 at 05:34:08PM +0200, Frederic Weisbecker wrote:
> > > > On Mon, May 05, 2014 at 05:12:28PM +0200, Peter Zijlstra wrote:
> > > > > > Note the current ordering:
> > > > > > 
> > > > > > cmpxchg(&qsd->pending, 0, 1)   get ipi
> > > > > > csd_lock(qsd->csd) xchg(&qsd->pending, 1)
> > > > > > send ipi   csd_unlock(qsd->csd)
> > > > > > 
> > > > > > 
> > > > > > So there shouldn't be racing updaters. Also ipi sender shouldn't
> > > > > > race with ipi receiver, the update shouldn't always eventually see
> > > > > > the unlock happening.
> > > > > 
> > > > > Yeah, I've not spotted how this particular train wreck happens either.
> > > > > 
> > > > > The problem is reproduction, it took me 9 hours to confirm I could
> > > > > reproduce the problem on my machine. So how long to I run it with this
> > > > > patch reverted to show its gone..
> > > > 
> > > > Maybe it could be favoured cpu hotplug. Anyway converting to irq_work 
> > > > should
> > > > fix it.
> > > 
> > > Ingo needs a commit msg for the revert of this patch; do you think you
> > > have time to look into _why_ this patch is broken and write such a
> > > thing?
> > 
> > I can try but I need to reproduce it. Do you have any clue on how to do so?
> > Also which HEAD were you guys using?
> 
> Ha!, so I was running a tip/master with that commit in -- a few days
> ago, v3.15-rc4-1644-g5c658b0cdf22 might've been it.
> 
> Then I ran it on my dual socket AMD interlagos, with:
> 
> while :; make O=allyesconfig-build/ clean; make O=allyesconfig-build/
> -j96 -s; done
> 
> for 9 hours, and then got empty RCU stall warns and a bricked machine.
> 
> I might still have the .config, but I don't think there was anything
> particularly odd about the config other than having NOHZ_FULL enabled.
> 
> The only way I found this patch was by staring at some RCU stall warns
> Ingo managed to get, sometimes they actually got backtraces in them
> apparently.
> 
> According to Ingo the bigger the machine the faster it reproduces, but
> reproduction times, even for these 32 cpu machines, are in the many
> hours range.

Ok then, I'll try something.

But note that those commits aren't upstream yet and they are in a seperate
branch tip:timers/nohz with no other non-upstream commits.

And I work alone on this branch.

So we can as well zap these commits and replace them with the irq_work_on()
conversion (still preparing that).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] workqueue: Allow modifying low level unbound workqueue cpumask

2014-05-07 Thread Frederic Weisbecker

Allow to modify the low-level unbound workqueues cpumask through
sysfs. This is performed by traversing the entire workqueue list
and calling apply_workqueue_attrs() on the unbound workqueues.

Cc: Christoph Lameter 
Cc: Kevin Hilman 
Cc: Lai Jiangshan 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Tejun Heo 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/workqueue.c | 65 --
 1 file changed, 63 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2aa296d..5978cee 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects 
wq->maydays list */
 static LIST_HEAD(workqueues);  /* PL: list of all workqueues */
 static bool workqueue_freezing;/* PL: have wqs started 
freezing? */
 
-static cpumask_var_t wq_unbound_cpumask;
+static cpumask_var_t wq_unbound_cpumask; /* PL: low level cpumask for all 
unbound wqs */
 
 /* the per-cpu worker pools */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
@@ -4084,19 +4084,80 @@ static struct bus_type wq_subsys = {
.dev_groups = wq_sysfs_groups,
 };
 
+static int unbounds_cpumask_apply(cpumask_var_t cpumask)
+{
+   struct workqueue_struct *wq;
+   int ret;
+
+   lockdep_assert_held(&wq_pool_mutex);
+
+   list_for_each_entry(wq, &workqueues, list) {
+   struct workqueue_attrs *attrs;
+
+   if (!(wq->flags & WQ_UNBOUND))
+   continue;
+
+   attrs = wq_sysfs_prep_attrs(wq);
+   if (!attrs)
+   return -ENOMEM;
+
+   ret = apply_workqueue_attrs_locked(wq, attrs, cpumask);
+   free_workqueue_attrs(attrs);
+   if (ret)
+   break;
+   }
+
+   return 0;
+}
+
+static ssize_t unbounds_cpumask_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   cpumask_var_t cpumask;
+   int ret = -EINVAL;
+
+   if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+   return -ENOMEM;
+
+   ret = cpumask_parse(buf, cpumask);
+   if (ret)
+   goto out;
+
+   get_online_cpus();
+   if (cpumask_intersects(cpumask, cpu_online_mask)) {
+   mutex_lock(&wq_pool_mutex);
+   ret = unbounds_cpumask_apply(cpumask);
+   if (ret < 0) {
+   /* Warn if rollback itself fails */
+   
WARN_ON_ONCE(unbounds_cpumask_apply(wq_unbound_cpumask));
+   } else {
+   cpumask_copy(wq_unbound_cpumask, cpumask);
+   }
+   mutex_unlock(&wq_pool_mutex);
+   }
+   put_online_cpus();
+out:
+   free_cpumask_var(cpumask);
+   return ret ? ret : count;
+}
+
 static ssize_t unbounds_cpumask_show(struct device *dev,
 struct device_attribute *attr, char *buf)
 {
int written;
 
+   mutex_lock(&wq_pool_mutex);
written = cpumask_scnprintf(buf, PAGE_SIZE, wq_unbound_cpumask);
+   mutex_unlock(&wq_pool_mutex);
+
written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
 
return written;
 }
 
 static struct device_attribute wq_sysfs_cpumask_attr =
-   __ATTR(cpumask, 0444, unbounds_cpumask_show, NULL);
+   __ATTR(cpumask, 0644, unbounds_cpumask_show, unbounds_cpumask_store);
 
 static int __init wq_sysfs_init(void)
 {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] workqueue: Reorder sysfs code

2014-05-07 Thread Frederic Weisbecker

The sysfs code usually belongs to the botom of the file since it deals
with high level objects. In the workqueue code it's misplaced and such
that we'll need to work around functions references to allow the sysfs
code to call APIs like apply_workqueue_attrs().

Lets move that block further in the file, right above alloc_workqueue_key()
which reference it.

Suggested-by: Tejun Heo 
Cc: Christoph Lameter 
Cc: Kevin Hilman 
Cc: Lai Jiangshan 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Tejun Heo 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/workqueue.c | 626 ++---
 1 file changed, 313 insertions(+), 313 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c68e84f..e5d7719 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3137,319 +3137,6 @@ int execute_in_process_context(work_func_t fn, struct 
execute_work *ew)
 }
 EXPORT_SYMBOL_GPL(execute_in_process_context);
 
-#ifdef CONFIG_SYSFS
-/*
- * Workqueues with WQ_SYSFS flag set is visible to userland via
- * /sys/bus/workqueue/devices/WQ_NAME.  All visible workqueues have the
- * following attributes.
- *
- *  per_cpuRO bool : whether the workqueue is per-cpu or unbound
- *  max_active RW int  : maximum number of in-flight work items
- *
- * Unbound workqueues have the following extra attributes.
- *
- *  id RO int  : the associated pool ID
- *  nice   RW int  : nice value of the workers
- *  cpumaskRW mask : bitmask of allowed CPUs for the workers
- */
-struct wq_device {
-   struct workqueue_struct *wq;
-   struct device   dev;
-};
-
-static struct workqueue_struct *dev_to_wq(struct device *dev)
-{
-   struct wq_device *wq_dev = container_of(dev, struct wq_device, dev);
-
-   return wq_dev->wq;
-}
-
-static ssize_t per_cpu_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
-{
-   struct workqueue_struct *wq = dev_to_wq(dev);
-
-   return scnprintf(buf, PAGE_SIZE, "%d\n", (bool)!(wq->flags & 
WQ_UNBOUND));
-}
-static DEVICE_ATTR_RO(per_cpu);
-
-static ssize_t max_active_show(struct device *dev,
-  struct device_attribute *attr, char *buf)
-{
-   struct workqueue_struct *wq = dev_to_wq(dev);
-
-   return scnprintf(buf, PAGE_SIZE, "%d\n", wq->saved_max_active);
-}
-
-static ssize_t max_active_store(struct device *dev,
-   struct device_attribute *attr, const char *buf,
-   size_t count)
-{
-   struct workqueue_struct *wq = dev_to_wq(dev);
-   int val;
-
-   if (sscanf(buf, "%d", &val) != 1 || val <= 0)
-   return -EINVAL;
-
-   workqueue_set_max_active(wq, val);
-   return count;
-}
-static DEVICE_ATTR_RW(max_active);
-
-static struct attribute *wq_sysfs_attrs[] = {
-   &dev_attr_per_cpu.attr,
-   &dev_attr_max_active.attr,
-   NULL,
-};
-ATTRIBUTE_GROUPS(wq_sysfs);
-
-static ssize_t wq_pool_ids_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct workqueue_struct *wq = dev_to_wq(dev);
-   const char *delim = "";
-   int node, written = 0;
-
-   rcu_read_lock_sched();
-   for_each_node(node) {
-   written += scnprintf(buf + written, PAGE_SIZE - written,
-"%s%d:%d", delim, node,
-unbound_pwq_by_node(wq, node)->pool->id);
-   delim = " ";
-   }
-   written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
-   rcu_read_unlock_sched();
-
-   return written;
-}
-
-static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
-{
-   struct workqueue_struct *wq = dev_to_wq(dev);
-   int written;
-
-   mutex_lock(&wq->mutex);
-   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
-   mutex_unlock(&wq->mutex);
-
-   return written;
-}
-
-/* prepare workqueue_attrs for sysfs store operations */
-static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
-{
-   struct workqueue_attrs *attrs;
-
-   attrs = alloc_workqueue_attrs(GFP_KERNEL);
-   if (!attrs)
-   return NULL;
-
-   mutex_lock(&wq->mutex);
-   copy_workqueue_attrs(attrs, wq->unbound_attrs);
-   mutex_unlock(&wq->mutex);
-   return attrs;
-}
-
-static ssize_t wq_nice_store(struct device *dev, struct device_attribute *attr,
-const char *buf, size_t count)
-{
-   struct workqueue_struct *wq = dev_to_wq(dev);
-   struct workqueue_attrs *attrs;
-   int ret;
-
-   attrs = wq_sysfs_prep_attrs(wq);
-

[PATCH 6/6] workqueue: Record real per-workqueue cpumask

2014-05-07 Thread Frederic Weisbecker

The real cpumask set by the user on WQ_SYSFS workqueues fails to be
recorded as is: What is actually recorded as per workqueue attribute
is the per workqueue cpumask intersected with the global unbounds cpumask.

Eventually when the user overwrites a WQ_SYSFS cpumask and later read
this attibute, the value returned is not the last one written.

The other bad side effect is that widening the global unbounds cpumask
doesn't actually widen the unbound workqueues affinity because their
own cpumask has been schrinked.

In order to fix this, lets record the real per workqueue cpumask on the
workqueue struct. We restore this value when attributes are re-evaluated
later.

FIXME: Maybe I should rather invert that. Have the user set workqueue
cpumask on attributes and the effective one on the workqueue struct instead.
We'll just need some tweaking in order to make the attributes of lower layers
(pools, worker pools, worker, ...) to inherit the effective cpumask and not
the user one.

Cc: Christoph Lameter 
Cc: Kevin Hilman 
Cc: Lai Jiangshan 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Tejun Heo 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/workqueue.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5978cee..504cf0a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -248,6 +248,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */
 
struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs */
+   cpumask_var_t   saved_cpumask;  /* WQ: only for unbound wqs */
struct pool_workqueue   *dfl_pwq;   /* WQ: only for unbound wqs */
 
 #ifdef CONFIG_SYSFS
@@ -3694,6 +3695,7 @@ static int apply_workqueue_attrs_locked(struct 
workqueue_struct *wq,
mutex_lock(&wq->mutex);
 
copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
+   cpumask_copy(wq->saved_cpumask, attrs->cpumask);
 
/* save the previous pwq and install the new one */
for_each_node(node)
@@ -4326,6 +4328,11 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
if (!wq->unbound_attrs)
goto err_free_wq;
+
+   if (!alloc_cpumask_var(&wq->saved_cpumask, GFP_KERNEL))
+   goto err_free_wq;
+
+   cpumask_copy(wq->saved_cpumask, cpu_possible_mask);
}
 
va_start(args, lock_name);
@@ -4397,6 +4404,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
return wq;
 
 err_free_wq:
+   free_cpumask_var(wq->saved_cpumask);
free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
return NULL;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] workqueue: Create low-level unbound workqueues cpumask

2014-05-07 Thread Frederic Weisbecker

Create a cpumask that limit the affinity of all unbound workqueues.
This cpumask is controlled though a file at the root of the workqueue
sysfs directory.

It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
such that the effective cpumask applied for a given unbound workqueue is
the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
the new /sys/devices/virtual/workqueue/cpumask_unbounds file.

This patch implements the basic infrastructure and the read interface.
cpumask_unbounds is initially set to cpu_possible_mask.

Cc: Christoph Lameter 
Cc: Kevin Hilman 
Cc: Lai Jiangshan 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Tejun Heo 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/workqueue.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e5d7719..1252a8c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -293,6 +293,8 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects 
wq->maydays list */
 static LIST_HEAD(workqueues);  /* PL: list of all workqueues */
 static bool workqueue_freezing;/* PL: have wqs started 
freezing? */
 
+static cpumask_var_t wq_unbound_cpumask;
+
 /* the per-cpu worker pools */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
 cpu_worker_pools);
@@ -3674,7 +3676,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
/* make a copy of @attrs and sanitize it */
copy_workqueue_attrs(new_attrs, attrs);
-   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
+   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
 
/*
 * We may create multiple pwqs with differing cpumasks.  Make a
@@ -4071,9 +4073,29 @@ static struct bus_type wq_subsys = {
.dev_groups = wq_sysfs_groups,
 };
 
+static ssize_t unbounds_cpumask_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   int written;
+
+   written = cpumask_scnprintf(buf, PAGE_SIZE, wq_unbound_cpumask);
+   written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
+
+   return written;
+}
+
+static struct device_attribute wq_sysfs_cpumask_attr =
+   __ATTR(cpumask, 0444, unbounds_cpumask_show, NULL);
+
 static int __init wq_sysfs_init(void)
 {
-   return subsys_virtual_register(&wq_subsys, NULL);
+   int err;
+
+   err = subsys_virtual_register(&wq_subsys, NULL);
+   if (err)
+   return err;
+
+   return device_create_file(wq_subsys.dev_root, &wq_sysfs_cpumask_attr);
 }
 core_initcall(wq_sysfs_init);
 
@@ -5068,6 +5090,9 @@ static int __init init_workqueues(void)
 
WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
+   BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
+   cpumask_copy(wq_unbound_cpumask, cpu_possible_mask);
+
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] workqueue: Allow changing attributions of ordered workqueues

2014-05-07 Thread Frederic Weisbecker

From: Lai Jiangshan 

Changing the attributions of a workqueue imply the addition of new pwqs
to replace the old ones. But the current implementation doesn't handle
ordered workqueues because they can't carry multi-pwqs without breaking
ordering. Hence ordered workqueues currently aren't allowed to call
apply_workqueue_attrs().

This result in several special cases in the workqueue code to handle
ordered workqueues. And with the addition of global workqueue cpumask,
these special cases are going to spread out even further as the number
of callers of apply_workqueue_attrs() will be increasing.

So we want apply_workqueue_attrs() to be smarter and to be able to
handle all sort of unbound workqueues.

This solution propose to create new pwqs on ordered workqueues with
max_active initialized as zero. Then when the older pwq is finally
released, the new one becomes active and its max_active value is set to 1.
This way we make sure that a only a single pwq ever run at a given time
on an ordered workqueue.

This enforces ordering and non-reentrancy on higher level.
Note that ordered works then become exceptions and aren't subject to
previous pool requeue that usually guarantees reentrancy while works
requeue themselves back-to-back. Otherwise it could prevent a pool switch
from ever happening by delaying the release of the old pool forever and
never letting the new one in.

Now that we can change ordered workqueue attributes, lets allow them
to be registered as WQ_SYSFS and allow to change their nice and cpumask
values. Note that in order to preserve ordering guarantee, we still
disallow changing their max_active and no_numa values.

Signed-off-by: Lai Jiangshan 
Cc: Christoph Lameter 
Cc: Kevin Hilman 
Cc: Lai Jiangshan 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Tejun Heo 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/workqueue.c | 69 +-
 1 file changed, 42 insertions(+), 27 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c3f076f..c68e84f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1355,8 +1355,16 @@ retry:
 * If @work was previously on a different pool, it might still be
 * running there, in which case the work needs to be queued on that
 * pool to guarantee non-reentrancy.
+*
+* We guarantee that only one pwq is active on an ordered workqueue.
+* That alone enforces non-reentrancy for works. So ordered works don't
+* need to be requeued to their previous pool. Not to mention that
+* an ordered work requeing itself over and over on the same pool may
+* prevent a pwq from being released in case of a pool switch. The
+* newest pool in that case couldn't switch in and its pending works
+* would starve.
 */
-   last_pool = get_work_pool(work);
+   last_pool = wq->flags & __WQ_ORDERED ? NULL : get_work_pool(work);
if (last_pool && last_pool != pwq->pool) {
struct worker *worker;
 
@@ -3319,6 +3327,10 @@ static ssize_t wq_numa_store(struct device *dev, struct 
device_attribute *attr,
struct workqueue_attrs *attrs;
int v, ret;
 
+   /* Creating per-node pwqs breaks ordering guarantee. Keep no_numa = 1 */
+   if (WARN_ON(wq->flags & __WQ_ORDERED))
+   return -EINVAL;
+
attrs = wq_sysfs_prep_attrs(wq);
if (!attrs)
return -ENOMEM;
@@ -3379,14 +3391,6 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
struct wq_device *wq_dev;
int ret;
 
-   /*
-* Adjusting max_active or creating new pwqs by applyting
-* attributes breaks ordering guarantee.  Disallow exposing ordered
-* workqueues.
-*/
-   if (WARN_ON(wq->flags & __WQ_ORDERED))
-   return -EINVAL;
-
wq->wq_dev = wq_dev = kzalloc(sizeof(*wq_dev), GFP_KERNEL);
if (!wq_dev)
return -ENOMEM;
@@ -3708,6 +3712,13 @@ static void rcu_free_pwq(struct rcu_head *rcu)
container_of(rcu, struct pool_workqueue, rcu));
 }
 
+static struct pool_workqueue *oldest_pwq(struct workqueue_struct *wq)
+{
+   return list_last_entry(&wq->pwqs, struct pool_workqueue, pwqs_node);
+}
+
+static void pwq_adjust_max_active(struct pool_workqueue *pwq);
+
 /*
  * Scheduled on system_wq by put_pwq() when an unbound pwq hits zero refcnt
  * and needs to be destroyed.
@@ -3723,14 +3734,12 @@ static void pwq_unbound_release_workfn(struct 
work_struct *work)
if (WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND)))
return;
 
-   /*
-* Unlink @pwq.  Synchronization against wq->mutex isn't strictly
-* necessary on release but do it anyway.  It's easier to verify
-* and consistent with the linking path.
-*/
mutex_loc

[PATCH 4/6] workqueue: Split apply attrs code from its locking

2014-05-07 Thread Frederic Weisbecker

In order to allow overriding the unbound wqs low-level cpumask, we
need to be able to call apply_workqueue_attr() on all workqueues in
the pool list.

Now since traversing the pool list require to lock it, we can't currently
call apply_workqueue_attr() under the pool traversal.

So lets provide a version of apply_workqueue_attrs() that can be
called when the pool is already locked.

Suggested-by: Tejun Heo 
Cc: Christoph Lameter 
Cc: Kevin Hilman 
Cc: Lai Jiangshan 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Tejun Heo 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/workqueue.c | 77 +++---
 1 file changed, 44 insertions(+), 33 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1252a8c..2aa296d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3637,24 +3637,9 @@ static struct pool_workqueue 
*numa_pwq_tbl_install(struct workqueue_struct *wq,
return old_pwq;
 }
 
-/**
- * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
- * @wq: the target workqueue
- * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
- *
- * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
- * machines, this function maps a separate pwq to each NUMA node with
- * possibles CPUs in @attrs->cpumask so that work items are affine to the
- * NUMA node it was issued on.  Older pwqs are released as in-flight work
- * items finish.  Note that a work item which repeatedly requeues itself
- * back-to-back will stay on its current pwq.
- *
- * Performs GFP_KERNEL allocations.
- *
- * Return: 0 on success and -errno on failure.
- */
-int apply_workqueue_attrs(struct workqueue_struct *wq,
- const struct workqueue_attrs *attrs)
+static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
+   const struct workqueue_attrs *attrs,
+   cpumask_var_t unbounds_cpumask)
 {
struct workqueue_attrs *new_attrs, *tmp_attrs;
struct pool_workqueue **pwq_tbl, *dfl_pwq;
@@ -3676,7 +3661,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
/* make a copy of @attrs and sanitize it */
copy_workqueue_attrs(new_attrs, attrs);
-   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
+   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, unbounds_cpumask);
 
/*
 * We may create multiple pwqs with differing cpumasks.  Make a
@@ -3686,15 +3671,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
copy_workqueue_attrs(tmp_attrs, new_attrs);
 
/*
-* CPUs should stay stable across pwq creations and installations.
-* Pin CPUs, determine the target cpumask for each node and create
-* pwqs accordingly.
-*/
-   get_online_cpus();
-
-   mutex_lock(&wq_pool_mutex);
-
-   /*
 * If something goes wrong during CPU up/down, we'll fall back to
 * the default pwq covering whole @attrs->cpumask.  Always create
 * it even if we don't use it immediately.
@@ -3714,8 +3690,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
}
}
 
-   mutex_unlock(&wq_pool_mutex);
-
/* all pwqs have been created successfully, let's install'em */
mutex_lock(&wq->mutex);
 
@@ -3736,7 +3710,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
put_pwq_unlocked(pwq_tbl[node]);
put_pwq_unlocked(dfl_pwq);
 
-   put_online_cpus();
ret = 0;
/* fall through */
 out_free:
@@ -3750,14 +3723,52 @@ enomem_pwq:
for_each_node(node)
if (pwq_tbl && pwq_tbl[node] != dfl_pwq)
free_unbound_pwq(pwq_tbl[node]);
-   mutex_unlock(&wq_pool_mutex);
-   put_online_cpus();
 enomem:
ret = -ENOMEM;
goto out_free;
 }
 
 /**
+ * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
+ * @wq: the target workqueue
+ * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
+ *
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
+ *
+ * Performs GFP_KERNEL allocations.
+ *
+ * Return: 0 on success and -errno on failure.
+ */
+int apply_workqueue_attrs(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs)
+{
+   int ret;
+
+   /*
+* CPUs should stay stable across pwq creations and installations.
+

[RFC PATCH 0/6] workqueue: Introduce low-level unbound wq sysfs cpumask v2

2014-05-07 Thread Frederic Weisbecker

Hi,

This is the 2nd version of https://lkml.org/lkml/2014/4/24/427

Changes in this version include:

* Allow to call apply_workqueue_attrs() on ordered workqueues and remove
the associated special cases (thanks Lai!)

* Improve error handling

* Add some lockdep_assert_held() to check locking requirements

* Rename unbounds_cpumask file to cpumask (some prefix still in discusion
with Tejun).

* Better handle widening of cpumask (see 6th patch), but need to be done
more properly, see changelog.

* Rebase on top of Tejun's workqueue next branch

Thanks,
Frederic

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
core/workqueue-v4
---

Frederic Weisbecker (5):
  workqueue: Reorder sysfs code
  workqueue: Create low-level unbound workqueues cpumask
  workqueue: Split apply attrs code from its locking
  workqueue: Allow modifying low level unbound workqueue cpumask
  workqueue: Record real per-workqueue cpumask

Lai Jiangshan (1):
  workqueue: Allow changing attributions of ordered workqueues


 kernel/workqueue.c | 1672 
 1 file changed, 896 insertions(+), 776 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

2014-05-09 Thread Frederic Weisbecker

On Wed, May 07, 2014 at 09:07:14PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > On Wed, May 07, 2014 at 06:05:08PM +0200, Frederic Weisbecker wrote:
> > > So we can as well zap these commits and replace them with the 
> > > irq_work_on()
> > > conversion (still preparing that).
> > 
> > I hear you, but I think Ingo doesn't want to rebase the tree because its
> > public, but maybe he can make an exception..
> 
> Since tip:timers/nohz only contains these commits:
> 
>  fc1781cc66b1 Merge branch 'timers/nohz-ipi-for-tip-v3' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into 
> timers/nohz
>  72aacf0259bb nohz: Move full nohz kick to its own IPI
>  55d77c215c74 smp: Non busy-waiting IPI queue
> 
> we can zap them all, or even keep 55d77c215c74.
> 
> Frederic, which one would you prefer?

I think we can zap them all. But since I'm not yet sure how the new solution, 
based on irq work,
is going to be received, do you mind if we just wait for the new solution to be 
acked by Peterz?

If that's ok for you, I'll then do a pull request that includes the solution 
and zaps the old two
commits.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vmstat: On demand vmstat workers V4

2014-05-09 Thread Frederic Weisbecker

On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> On Fri, 9 May 2014, Christoph Lameter wrote:
> > On Fri, 9 May 2014, Thomas Gleixner wrote:
> > > I understand why you want to get this done by a housekeeper, I just
> > > did not understand why we need this whole move it around business is
> > > required.
> > 
> > This came about because of another objection against having it simply
> > fixed to a processor. After all that processor may be disabled etc etc.
> 
> I really regret that I did not pay more attention (though my cycle
> constraints simply do not allow it).
> 
> This is the typical overengineering failure: 
> 
>   Before we even have a working proof that we can solve the massive
>   complex basic problem with the price of a dedicated housekeeper, we
>   try to make the housekeeper itself a moving target with the price of
>   making the problem exponential(unknown) instead of simply unknown.
> 
> I really cannot figure out why a moving housekeeper would be a
> brilliant idea at all, but I'm sure there is some magic use case in
> some other disjunct universe.
> 
> Whoever complained and came up with the NOT SO brilliant idea to make
> the housekeeper a moving target, come please forth and explain:
> 
> - How this can be done without having a working solution with a
>   dedicated housekeeper in the first place
> 
> - How this can be done without knowing what implication it has w/o
>   seing the complexity of a dedicated housekeeper upfront.
> 
> Keep it simple has always been and still is the best engineering
> principle.
> 
> We all know that we can do large scale overhauls in a very controlled
> way if the need arises. But going for the most complex solution while
> not knowing whether the least complex solution is feasible at all is
> outright stupid or beyond.
> 
> Unless someone comes up with a reasonable explantion for all of this I
> put a general NAK on patches which are directed to kernel/time/*
> 
> Correction:
> 
> I'm taking patches right away which undo any damage which has been
> applied w/o me noticing because I trusted the responsible developers /
> maintainers.
> 
> Preferrably those patches arrive before my return from LinuxCon Japan.

Yeah my plan was to have a variable housekeeping CPU. In fact the reason
was more about power optimization: having all non-full-nohz CPUs able
to handle the timekeeping duty (and hence housekeeping) could help
further to balance timekeeping.

But then I sent a patchset with that in mind 
(https://lkml.org/lkml/2013/12/17/708)
but it was too complicated. Doing it correctly is too hard for now.

Anyway I agree that was overengineering at this stage.

Fortunately nothing has been applied. I was too busy with cleanups and 
workqueues
affinity.

So the "only" damage is on bad directions given to Christoph. But you know
how I use GPS...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vmstat: On demand vmstat workers V4

2014-05-09 Thread Frederic Weisbecker

On Fri, May 09, 2014 at 04:47:45PM -0700, Paul E. McKenney wrote:
> On Sat, May 10, 2014 at 12:57:15AM +0200, Thomas Gleixner wrote:
> If someone decides to make tick_do_timer_cpu non-constant in NO_HZ_FULL
> CPUs, they will break unless/until I make RCU deal with that sort
> of thing, at least for NO_HZ_FULL_SYSIDLE kernels.  ;-)
> 
> > We all know that we can do large scale overhauls in a very controlled
> > way if the need arises. But going for the most complex solution while
> > not knowing whether the least complex solution is feasible at all is
> > outright stupid or beyond.
> > 
> > Unless someone comes up with a reasonable explantion for all of this I
> > put a general NAK on patches which are directed to kernel/time/*
> > 
> > Correction:
> > 
> > I'm taking patches right away which undo any damage which has been
> > applied w/o me noticing because I trusted the responsible developers /
> > maintainers.
> > 
> > Preferrably those patches arrive before my return from LinuxCon Japan.
> 
> I could easily have missed something, but as far as I know, there is
> nothing in the current kernel that allows tick_do_timer_cpu to move in
> NO_HZ_FULL kernels.

Right.

So we agree that housekeeping/timekeeping is going to stay CPU 0 for now.

But I still have the plan to make the timekeeper use the full sysidle
facility in order to adaptively get to dynticks idle.

Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
completely periodic. It can't enter in dynticks idle mode because it
must maintain timekeeping on behalf of full dynticks CPUs. So that's
a power issue.

But Paul has a feature in RCU that lets us know when all CPUs are idle
and the timekeeper can finally sleep. Then when a full nohz CPU wakes
up from idle, it sends an IPI to the timekeeper if needed so the latter
restarts timekeeping maintainance.

It's not complicated to add to the timer code.
Most of the code is already there, in RCU, for a while already.

Are we keeping that direction? 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vmstat: On demand vmstat workers V4

2014-05-10 Thread Frederic Weisbecker

On Sat, May 10, 2014 at 02:31:28PM +0200, Thomas Gleixner wrote:
> On Sat, 10 May 2014, Frederic Weisbecker wrote:
> > But I still have the plan to make the timekeeper use the full sysidle
> > facility in order to adaptively get to dynticks idle.
> > 
> > Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> > completely periodic. It can't enter in dynticks idle mode because it
> > must maintain timekeeping on behalf of full dynticks CPUs. So that's
> > a power issue.
> > 
> > But Paul has a feature in RCU that lets us know when all CPUs are idle
> > and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> > up from idle, it sends an IPI to the timekeeper if needed so the latter
> > restarts timekeeping maintainance.
> >
> > It's not complicated to add to the timer code.
> > Most of the code is already there, in RCU, for a while already.
> > 
> > Are we keeping that direction? 
> 
> So the idea is that the timekeeper stays on cpu0, but if everything is
> idle it is allowed to take a long nap as well. So if some other cpu
> wakes up it updates timekeeping without taking over the time keeper
> duty and if it has work to do, it kicks cpu0 into gear. If it just
> goes back to sleep, then nothing to do.

Exactly! Except perhaps the last sentence "If it just goes back to sleep,
then nothing to do.", I didn't think about that although this special case
is quite frequent indeed when an interrupt fires on idle but no task is woken 
up.

Maybe I should move the code that fires the IPI to cpu0, if it is sleeping,
on irq exit (the plan was to do it right away on irq enter) and fire it
only if need_resched().

> 
> No objections from my side.

Great! Thanks for checking that!

> 
> Thanks,
> 
>   tglx
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vmstat: On demand vmstat workers V4

2014-05-10 Thread Frederic Weisbecker

On Sat, May 10, 2014 at 06:17:08PM -0700, Paul E. McKenney wrote:
> On Sat, May 10, 2014 at 03:14:25PM +0200, Frederic Weisbecker wrote:
> > On Sat, May 10, 2014 at 02:31:28PM +0200, Thomas Gleixner wrote:
> > > On Sat, 10 May 2014, Frederic Weisbecker wrote:
> > > > But I still have the plan to make the timekeeper use the full sysidle
> > > > facility in order to adaptively get to dynticks idle.
> > > > 
> > > > Reminder for others: in NO_HZ_FULL, the timekeeper (always CPU 0) stays
> > > > completely periodic. It can't enter in dynticks idle mode because it
> > > > must maintain timekeeping on behalf of full dynticks CPUs. So that's
> > > > a power issue.
> > > > 
> > > > But Paul has a feature in RCU that lets us know when all CPUs are idle
> > > > and the timekeeper can finally sleep. Then when a full nohz CPU wakes
> > > > up from idle, it sends an IPI to the timekeeper if needed so the latter
> > > > restarts timekeeping maintainance.
> > > >
> > > > It's not complicated to add to the timer code.
> > > > Most of the code is already there, in RCU, for a while already.
> > > > 
> > > > Are we keeping that direction? 
> > > 
> > > So the idea is that the timekeeper stays on cpu0, but if everything is
> > > idle it is allowed to take a long nap as well. So if some other cpu
> > > wakes up it updates timekeeping without taking over the time keeper
> > > duty and if it has work to do, it kicks cpu0 into gear. If it just
> > > goes back to sleep, then nothing to do.
> 
> Hmmm...  If RCU is supposed to ignore the fact that one of the other
> CPUs woke up momentarily, we will need to adjust things a bit.

Maybe not that much actually.

> 
> > Exactly! Except perhaps the last sentence "If it just goes back to sleep,
> > then nothing to do.", I didn't think about that although this special case
> > is quite frequent indeed when an interrupt fires on idle but no task is 
> > woken up.
> > 
> > Maybe I should move the code that fires the IPI to cpu0, if it is sleeping,
> > on irq exit (the plan was to do it right away on irq enter) and fire it
> > only if need_resched().
> 
> And of course if that code path contains any RCU read-side critical
> sections, RCU absolutely cannot ignore that CPU's momentary wakeup.

Sure the core RCU still needs to know that the CPU went out of dynticks the
time of the irq, so we keep the rcu_irq_enter/rcu_irq_exit calls.

But if the CPU only wakes up to serve an IRQ, it doesn't need to tell the RCU
sysidle detection about it. The irq entry fixup jiffies on dynticks idle mode,
this should be enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] irq_work: Allow remote queueing

2014-05-11 Thread Frederic Weisbecker

irq work currently only supports local callbacks. However its code
is mostly ready to run remote callbacks and we have some potential user.

The full nohz subsystem currently open codes its own remote irq work
on top of the scheduler ipi when it wants a CPU to revaluate its next
tick.

However this ad hoc solution bloats the scheduler IPI.

Lets just extend the irq work subsystem to handle this kind of user.
This shouldn't add noticeable overhead.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/irq_work.h |  2 ++
 kernel/irq_work.c| 29 +++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 429b1ba..511e7f7 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -33,6 +33,8 @@ void init_irq_work(struct irq_work *work, void (*func)(struct 
irq_work *))
 #define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { .func = (_f), }
 
 bool irq_work_queue(struct irq_work *work);
+bool irq_work_queue_on(struct irq_work *work, int cpu);
+
 void irq_work_run(void);
 void irq_work_run_tick(void);
 void irq_work_sync(struct irq_work *work);
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 0a554a6..3a7e163 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -55,12 +55,37 @@ void __weak arch_irq_work_raise(int cpu)
 */
 }
 
+static void irq_work_queue_raise(struct irq_work *work,
+struct llist_head *head, int cpu)
+{
+   if (llist_add(&work->llnode, head))
+   arch_irq_work_raise(cpu);
+}
+
+#ifdef CONFIG_HAVE_IRQ_WORK_IPI
 /*
  * Enqueue the irq_work @entry unless it's already pending
  * somewhere.
  *
  * Can be re-enqueued while the callback is still in progress.
  */
+bool irq_work_queue_on(struct irq_work *work, int cpu)
+{
+   /* Only queue if not already pending */
+   if (!irq_work_claim(work))
+   return false;
+
+   /* All work should have been flushed before going offline */
+   WARN_ON_ONCE(cpu_is_offline(cpu));
+   WARN_ON_ONCE(work->flags & IRQ_WORK_LAZY);
+
+   irq_work_queue_raise(work, &per_cpu(raised_list, cpu), cpu);
+
+   return true;
+}
+EXPORT_SYMBOL_GPL(irq_work_queue_on);
+#endif /* #endif CONFIG_HAVE_IRQ_WORK_IPI */
+
 bool irq_work_queue(struct irq_work *work)
 {
unsigned long flags;
@@ -78,8 +103,8 @@ bool irq_work_queue(struct irq_work *work)
 * for the next tick.
 */
if (!(work->flags & IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
-   if (llist_add(&work->llnode, &__get_cpu_var(raised_list)))
-   arch_irq_work_raise(smp_processor_id());
+   irq_work_queue_raise(work, &__get_cpu_var(raised_list),
+smp_processor_id());
} else {
llist_add(&work->llnode, &__get_cpu_var(lazy_list));
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] nohz: Use IPI implicit full barrier against rq->nr_running r/w

2014-05-11 Thread Frederic Weisbecker

A full dynticks CPU is allowed to stop its tick when a single task runs.
Meanwhile when a new task gets enqueued, the CPU must be notified so that
it restart its tick to maintain local fairness and other accounting details.

This notification is performed by way of an IPI. Then when the target
receives the IPI, we expect it to see the new value of rq->nr_running.

Hence the following ordering scenario:

   CPU 0   CPU 1

   write rq->running   get IPI
   smp_wmb()   smp_rmb()
   send IPIread rq->nr_running

But Paul Mckenney says that nowadays IPIs imply a full barrier on
all architectures. So we can safely remove this pair and rely on the
implicit barriers that come along IPI send/receive. Lets
just comment on this new assumption.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/sched/core.c  |  9 +
 kernel/sched/sched.h | 10 --
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 00ac248..0d78470 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -670,10 +670,11 @@ bool sched_can_stop_tick(void)
 
rq = this_rq();
 
-   /* Make sure rq->nr_running update is visible after the IPI */
-   smp_rmb();
-
-   /* More than one running task need preemption */
+   /*
+   * More than one running task need preemption.
+   * nr_running update is assumed to be visible
+   * after IPI is sent from wakers.
+   */
if (rq->nr_running > 1)
return false;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6089e00..219bfbd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1223,8 +1223,14 @@ static inline void inc_nr_running(struct rq *rq)
 #ifdef CONFIG_NO_HZ_FULL
if (rq->nr_running == 2) {
if (tick_nohz_full_cpu(rq->cpu)) {
-   /* Order rq->nr_running write against the IPI */
-   smp_wmb();
+   /*
+* Tick is needed if more than one task runs on a CPU.
+* Send the target an IPI to kick it out of nohz mode.
+*
+* We assume that IPI implies full memory barrier and 
the
+* new value of rq->nr_running is visible on reception
+* from the target.
+*/
tick_nohz_full_kick_cpu(rq->cpu);
}
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 0/5] nohz: Move nohz kick out of scheduler IPI, v3

2014-05-11 Thread Frederic Weisbecker

Hi,

So this version gives up with smp_queue_function_single() and extends
irq work to support remote queuing. As suggested by Peterz.

Comments?

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/nohz-irq-work

Thanks,
Frederic
---

Frederic Weisbecker (5):
  irq_work: Architecture support for remote irq work raise
  irq_work: Force non-lazy works on IPI
  irq_work: Allow remote queueing
  nohz: Move full nohz kick to its own IPI
  nohz: Use IPI implicit full barrier against rq->nr_running r/w


 arch/Kconfig   | 12 +++
 arch/alpha/kernel/time.c   |  3 +-
 arch/arm/Kconfig   |  1 +
 arch/arm/kernel/smp.c  |  4 +--
 arch/powerpc/kernel/time.c |  3 +-
 arch/sparc/kernel/pcr.c|  3 +-
 arch/x86/Kconfig   |  1 +
 arch/x86/kernel/irq_work.c | 10 ++
 include/linux/irq_work.h   |  3 ++
 include/linux/tick.h   |  9 -
 kernel/irq_work.c  | 87 +++---
 kernel/sched/core.c| 14 
 kernel/sched/sched.h   | 12 +--
 kernel/time/Kconfig|  2 ++
 kernel/time/tick-sched.c   | 10 +++---
 kernel/timer.c |  2 +-
 16 files changed, 118 insertions(+), 58 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] nohz: Move full nohz kick to its own IPI

2014-05-11 Thread Frederic Weisbecker

Now that the irq work subsystem can queue remote callbacks, it's
a perfect fit to safely queue IPIs when interrupts are disabled
without worrying about concurrent callers.

Lets use it for the full dynticks kick to notify a CPU that it's
exiting single task mode.

This unbloats a bit the scheduler IPI that the nohz code was abusing
for its cool "callable anywhere/anytime" properties.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/tick.h |  9 -
 kernel/sched/core.c  |  5 +
 kernel/sched/sched.h |  2 +-
 kernel/time/Kconfig  |  2 ++
 kernel/time/tick-sched.c | 10 ++
 5 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..8a4987f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -181,7 +181,13 @@ static inline bool tick_nohz_full_cpu(int cpu)
 
 extern void tick_nohz_init(void);
 extern void __tick_nohz_full_check(void);
-extern void tick_nohz_full_kick(void);
+extern void tick_nohz_full_kick_cpu(int cpu);
+
+static inline void tick_nohz_full_kick(void)
+{
+   tick_nohz_full_kick_cpu(smp_processor_id());
+}
+
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 #else
@@ -189,6 +195,7 @@ static inline void tick_nohz_init(void) { }
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
 static inline void __tick_nohz_full_check(void) { }
+static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..00ac248 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1500,9 +1500,7 @@ void scheduler_ipi(void)
 */
preempt_fold_need_resched();
 
-   if (llist_empty(&this_rq()->wake_list)
-   && !tick_nohz_full_cpu(smp_processor_id())
-   && !got_nohz_idle_kick())
+   if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
return;
 
/*
@@ -1519,7 +1517,6 @@ void scheduler_ipi(void)
 * somewhat pessimize the simple resched case.
 */
irq_enter();
-   tick_nohz_full_check();
sched_ttwu_pending();
 
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..6089e00 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1225,7 +1225,7 @@ static inline void inc_nr_running(struct rq *rq)
if (tick_nohz_full_cpu(rq->cpu)) {
/* Order rq->nr_running write against the IPI */
smp_wmb();
-   smp_send_reschedule(rq->cpu);
+   tick_nohz_full_kick_cpu(rq->cpu);
}
}
 #endif
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index f448513..27f1f63 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -101,6 +101,8 @@ config NO_HZ_FULL
depends on HAVE_CONTEXT_TRACKING
# VIRT_CPU_ACCOUNTING_GEN dependency
depends on HAVE_VIRT_CPU_ACCOUNTING_GEN
+   # tickless irq work
+   depends on HAVE_IRQ_WORK_IPI
select NO_HZ_COMMON
select RCU_USER_QS
select RCU_NOCB_CPU
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6558b7a..3d63944 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -224,13 +224,15 @@ static DEFINE_PER_CPU(struct irq_work, 
nohz_full_kick_work) = {
 };
 
 /*
- * Kick the current CPU if it's full dynticks in order to force it to
+ * Kick the CPU if it's full dynticks in order to force it to
  * re-evaluate its dependency on the tick and restart it if necessary.
  */
-void tick_nohz_full_kick(void)
+void tick_nohz_full_kick_cpu(int cpu)
 {
-   if (tick_nohz_full_cpu(smp_processor_id()))
-   irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
+   if (!tick_nohz_full_cpu(cpu))
+   return;
+
+   irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
 static void nohz_full_kick_ipi(void *info)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] irq_work: Architecture support for remote irq work raise

2014-05-11 Thread Frederic Weisbecker

We are going to extend irq work to support remote queuing.

So lets add a cpu argument to arch_irq_work_raise(). The architectures
willing to support that must then provide the backend to raise irq work
IPIs remotely.

Initial support is provided for x86 and ARM since they are easily
extended. The other archs that overwrite arch_irq_work_raise() seem
to use local clock interrupts and therefore need deeper rewrite of their
irq work support to implement remote raising.

Cc: Andrew Morton 
Cc: Benjamin Herrenschmidt 
Cc: David S. Miller 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Russell King 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 arch/Kconfig   | 12 
 arch/alpha/kernel/time.c   |  3 ++-
 arch/arm/Kconfig   |  1 +
 arch/arm/kernel/smp.c  |  4 ++--
 arch/powerpc/kernel/time.c |  3 ++-
 arch/sparc/kernel/pcr.c|  3 ++-
 arch/x86/Kconfig   |  1 +
 arch/x86/kernel/irq_work.c | 10 ++
 kernel/irq_work.c  |  4 ++--
 9 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 97ff872..3a38356 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -472,6 +472,18 @@ config HAVE_IRQ_EXIT_ON_IRQ_STACK
  This spares a stack switch and improves cache usage on softirq
  processing.
 
+config HAVE_IRQ_WORK_IPI
+   bool
+   help
+ Architecture supports raising irq work interrupts both locally and
+ remotely. Without this capability, we can only trigger local irq works
+ loosely handled by the generic timer tick with the bad implications
+ coming along: the irq work is subject to HZ latency and it runs under
+ the tick random locking scenario (possibly holding hrtimer lock).
+
+ This capability is required on configs running with a very minimized
+ tick rate like full dynticks.
+
 #
 # ABI hall of shame
 #
diff --git a/arch/alpha/kernel/time.c b/arch/alpha/kernel/time.c
index ee39cee..2ff0c61 100644
--- a/arch/alpha/kernel/time.c
+++ b/arch/alpha/kernel/time.c
@@ -60,8 +60,9 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 #define test_irq_work_pending()  __get_cpu_var(irq_work_pending)
 #define clear_irq_work_pending() __get_cpu_var(irq_work_pending) = 0
 
-void arch_irq_work_raise(void)
+void arch_irq_work_raise(int cpu)
 {
+   WARN_ON_ONCE(cpu != smp_processor_id());
set_irq_work_pending_flag();
 }
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index db3c541..7edce21 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -46,6 +46,7 @@ config ARM
select HAVE_HW_BREAKPOINT if (PERF_EVENTS && (CPU_V6 || CPU_V6K || 
CPU_V7))
select HAVE_IDE if PCI || ISA || PCMCIA
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_IRQ_WORK_IPI
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZ4
select HAVE_KERNEL_LZMA
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 7c4fada..042a800 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -454,10 +454,10 @@ void arch_send_call_function_single_ipi(int cpu)
 }
 
 #ifdef CONFIG_IRQ_WORK
-void arch_irq_work_raise(void)
+void arch_irq_work_raise(int cpu)
 {
if (is_smp())
-   smp_cross_call(cpumask_of(smp_processor_id()), IPI_IRQ_WORK);
+   smp_cross_call(cpumask_of(cpu), IPI_IRQ_WORK);
 }
 #endif
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..4de25f4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -464,9 +464,10 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 
 #endif /* 32 vs 64 bit */
 
-void arch_irq_work_raise(void)
+void arch_irq_work_raise(int cpu)
 {
preempt_disable();
+   WARN_ON_ONCE(cpu != smp_processor_id());
set_irq_work_pending_flag();
set_dec(1);
preempt_enable();
diff --git a/arch/sparc/kernel/pcr.c b/arch/sparc/kernel/pcr.c
index 269af58..0e5bfd9 100644
--- a/arch/sparc/kernel/pcr.c
+++ b/arch/sparc/kernel/pcr.c
@@ -43,8 +43,9 @@ void __irq_entry deferred_pcr_work_irq(int irq, struct 
pt_regs *regs)
set_irq_regs(old_regs);
 }
 
-void arch_irq_work_raise(void)
+void arch_irq_work_raise(int cpu)
 {
+   WARN_ON_ONCE(cpu != smp_processor_id());
set_softint(1 << PIL_DEFERRED_PCR_WORK);
 }
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..b06f3fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -130,6 +130,7 @@ config X86
select HAVE_CC_STACKPROTECTOR
select GENERIC_CPU_AUTOPROBE
select HAVE_ARCH_AUDITSYSCALL
+   select HAVE_IRQ_WORK_IPI
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 1de84e3..500ec1f 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -38,13 +38,7 @@ __visible void smp_trace_irq_work_interrupt(struct

[PATCH 2/5] irq_work: Force non-lazy works on IPI

2014-05-11 Thread Frederic Weisbecker

As we plan to handle the full nohz IPI using irq work, we need to
enforce non-lazy works outside the tick because it's called under
hrtimer lock. This is not desired from the nohz callback revaluating the
tick because it can take hrtimer lock itself.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/irq_work.h |  1 +
 kernel/irq_work.c| 58 ++--
 kernel/timer.c   |  2 +-
 3 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 19ae05d..429b1ba 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -34,6 +34,7 @@ void init_irq_work(struct irq_work *work, void (*func)(struct 
irq_work *))
 
 bool irq_work_queue(struct irq_work *work);
 void irq_work_run(void);
+void irq_work_run_tick(void);
 void irq_work_sync(struct irq_work *work);
 
 #ifdef CONFIG_IRQ_WORK
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 2559383..0a554a6 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -19,8 +19,8 @@
 #include 
 
 
-static DEFINE_PER_CPU(struct llist_head, irq_work_list);
-static DEFINE_PER_CPU(int, irq_work_raised);
+static DEFINE_PER_CPU(struct llist_head, lazy_list);
+static DEFINE_PER_CPU(struct llist_head, raised_list);
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -63,14 +63,14 @@ void __weak arch_irq_work_raise(int cpu)
  */
 bool irq_work_queue(struct irq_work *work)
 {
+   unsigned long flags;
+
/* Only queue if not already pending */
if (!irq_work_claim(work))
return false;
 
/* Queue the entry and raise the IPI if needed. */
-   preempt_disable();
-
-   llist_add(&work->llnode, &__get_cpu_var(irq_work_list));
+   local_irq_save(flags);
 
/*
 * If the work is not "lazy" or the tick is stopped, raise the irq
@@ -78,11 +78,13 @@ bool irq_work_queue(struct irq_work *work)
 * for the next tick.
 */
if (!(work->flags & IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
-   if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
+   if (llist_add(&work->llnode, &__get_cpu_var(raised_list)))
arch_irq_work_raise(smp_processor_id());
+   } else {
+   llist_add(&work->llnode, &__get_cpu_var(lazy_list));
}
 
-   preempt_enable();
+   local_irq_restore(flags);
 
return true;
 }
@@ -90,10 +92,7 @@ EXPORT_SYMBOL_GPL(irq_work_queue);
 
 bool irq_work_needs_cpu(void)
 {
-   struct llist_head *this_list;
-
-   this_list = &__get_cpu_var(irq_work_list);
-   if (llist_empty(this_list))
+   if (llist_empty(&__get_cpu_var(lazy_list)))
return false;
 
/* All work should have been flushed before going offline */
@@ -102,28 +101,18 @@ bool irq_work_needs_cpu(void)
return true;
 }
 
-static void __irq_work_run(void)
+static void __irq_work_run(struct llist_head *list)
 {
unsigned long flags;
struct irq_work *work;
-   struct llist_head *this_list;
struct llist_node *llnode;
 
-
-   /*
-* Reset the "raised" state right before we check the list because
-* an NMI may enqueue after we find the list empty from the runner.
-*/
-   __this_cpu_write(irq_work_raised, 0);
-   barrier();
-
-   this_list = &__get_cpu_var(irq_work_list);
-   if (llist_empty(this_list))
+   if (llist_empty(list))
return;
 
BUG_ON(!irqs_disabled());
 
-   llnode = llist_del_all(this_list);
+   llnode = llist_del_all(list);
while (llnode != NULL) {
work = llist_entry(llnode, struct irq_work, llnode);
 
@@ -155,11 +144,27 @@ static void __irq_work_run(void)
 void irq_work_run(void)
 {
BUG_ON(!in_irq());
-   __irq_work_run();
+   __irq_work_run(&__get_cpu_var(raised_list));
+   __irq_work_run(&__get_cpu_var(lazy_list));
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
 
 /*
+ * Run the lazy irq_work entries on this cpu from the tick. But let
+ * the IPI handle the others. Some works may require to work outside
+ * the tick due to its locking dependencies (hrtimer lock).
+ */
+void irq_work_run_tick(void)
+{
+   BUG_ON(!in_irq());
+#ifndef CONFIG_HAVE_IRQ_WORK_IPI
+   /* No IPI support, we don't have the choice... */
+   __irq_work_run(&__get_cpu_var(raised_list));
+#endif
+   __irq_work_run(&__get_cpu_var(lazy_list));
+}
+
+/*
  * Synchronize against the irq_work @entry, ensures the entry is not
  * currently in use.
  */
@@ -183,7 +188,8 @@ static int irq_work_cpu_notify(struct notifier_block *self,
/* Called from stop_machine */
if (WARN

Re: [PATCH 1/5] irq_work: Architecture support for remote irq work raise

2014-05-12 Thread Frederic Weisbecker

On Mon, May 12, 2014 at 09:56:50AM +0200, Peter Zijlstra wrote:
> On Mon, May 12, 2014 at 01:33:53AM +0200, Frederic Weisbecker wrote:
> > We are going to extend irq work to support remote queuing.
> > 
> > So lets add a cpu argument to arch_irq_work_raise(). The architectures
> > willing to support that must then provide the backend to raise irq work
> > IPIs remotely.
> > 
> > Initial support is provided for x86 and ARM since they are easily
> > extended. The other archs that overwrite arch_irq_work_raise() seem
> > to use local clock interrupts and therefore need deeper rewrite of their
> > irq work support to implement remote raising.
> > 
> 
> > Signed-off-by: Frederic Weisbecker 
> 
> Why not borrow the smp_call_function IPI for the remote bits? We could
> limit the 'safe from NMI' to the local works. And we validate this by
> putting a WARN_ON(in_nmi()) in irq_work_queue_on().

Right, but although I don't need it to be safe from NMI, I need it
to be callable concurrently and when irqs are disabled.

So we can't use smp_call_function_single() for that. But we can use the async
version in which case we must keep the irq work claim. But that's
about the same than smp_queue_function_single() we had previously
and we are back with our csd_lock issue.

> 
> At some later stage we could look at maybe merging the smp_function_call
> and irq_work into a single thing, where we have the irq_work interface
> as async and the smp_function_call() as sync interface.
> 
> But for now a quick 'hack' would be to call __irq_work_run() from
> generic_smp_call_function_single_interrupt().
> 
> That would leave arch_irq_work_raise() as the special NMI safe local IPI
> hook.

I don't know, calling irq_work_run() from there sounds like an overkill.

Or we can encapsulate:

struct remote_irq_work {
struct irq_work work;
struct call_single_data csd;
}

bool irq_work_queue_on(remote_work, cpu)
{
   if (irq_work_claim(remote_work.work))
  return false;
   remote_work.csd.func = irq_work_remote_interrupt;
   remotr_work.csd.info = work
   smp_call_function_single_async(&remote_work.csd, cpu);
}

void irq_work_remote_interrupt(void *info)
{
struct irq_work *work = info;

work->func(work);
}

And some tweaking to make csd_lock() out of the way. 
smp_call_function_single_async()
don't need it.

Or the two other solutions:

1) expand irq_work to support remote queuing. And really this hasn't added more
atomic operations nor smp barriers that the local queuing. The only complication
is that we need remote raise support from arch.

2) expand smp_call_function stuff with smp_queue_function_single(). As I did
previously. It's the same than the above irq_work_queue_on() above will need to 
be
conditional though.

The native remote support on irq_work sounds to me the most proper. But 
unfortunately
it requires support from arch that we already have with smp_function...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] irq_work: Architecture support for remote irq work raise

2014-05-12 Thread Frederic Weisbecker

On Mon, May 12, 2014 at 01:11:50PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2014-05-12 at 10:08 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-05-12 at 01:33 +0200, Frederic Weisbecker wrote:
> > > We are going to extend irq work to support remote queuing.
> > > 
> > > So lets add a cpu argument to arch_irq_work_raise(). The architectures
> > > willing to support that must then provide the backend to raise irq work
> > > IPIs remotely.
> > > 
> > > Initial support is provided for x86 and ARM since they are easily
> > > extended. The other archs that overwrite arch_irq_work_raise() seem
> > > to use local clock interrupts and therefore need deeper rewrite of their
> > > irq work support to implement remote raising.
> > 
> > Well, looks like it's time to turn it into an IPI... It gets a bit more
> > tricky because whether whacking the interrupt controller is safe to
> > do from an NMI is safe or not might depend on that irq controller
> > implementation...
> > 
> > It looks like XICS and MPIC should be safe though, so at least we
> > should be able to cover ppc64, but I'll leave ppc32 alone.
> 
> Correction... that's actually a bit more tricky. We might need an MMIO
> to trigger the IPI. That means potentially having to take a hash miss,
> and we certainly can't do that at NMI time at the moment.
> 
> We *could* hard disable interrupts (which blocks our NMIs since they
> arent't real NMIs, they are just a way to bypass our soft-disable state
> for perf interrupts) for hash_page, but that still makes me somewhat
> nervous.
> 
> Another option would be to add an ioremap flag of some description to
> be able to install bolted hash entries. (It already does so if called
> early enough during boot, so it might actually just work by accident but
> that's an undebuggable horror show waiting to happen if we ever change
> that).
> 
> So needs a bit more thinking on our side.

Yeah, well if we ever end up with native remote irq work, only local raise
will need to be NMI-safe. If that ever helps...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] irq_work: Architecture support for remote irq work raise

2014-05-12 Thread Frederic Weisbecker

On Mon, May 12, 2014 at 07:17:29PM +0200, Peter Zijlstra wrote:
> On Mon, May 12, 2014 at 06:26:49PM +0200, Frederic Weisbecker wrote:
> > On Mon, May 12, 2014 at 09:56:50AM +0200, Peter Zijlstra wrote:
> > > On Mon, May 12, 2014 at 01:33:53AM +0200, Frederic Weisbecker wrote:
> > > > We are going to extend irq work to support remote queuing.
> > > > 
> > > > So lets add a cpu argument to arch_irq_work_raise(). The architectures
> > > > willing to support that must then provide the backend to raise irq work
> > > > IPIs remotely.
> > > > 
> > > > Initial support is provided for x86 and ARM since they are easily
> > > > extended. The other archs that overwrite arch_irq_work_raise() seem
> > > > to use local clock interrupts and therefore need deeper rewrite of their
> > > > irq work support to implement remote raising.
> > > > 
> > > 
> > > > Signed-off-by: Frederic Weisbecker 
> > > 
> > > Why not borrow the smp_call_function IPI for the remote bits? We could
> > > limit the 'safe from NMI' to the local works. And we validate this by
> > > putting a WARN_ON(in_nmi()) in irq_work_queue_on().
> > 
> > Right, but although I don't need it to be safe from NMI, I need it
> > to be callable concurrently and when irqs are disabled.
> > 
> > So we can't use smp_call_function_single() for that. But we can use the 
> > async
> > version in which case we must keep the irq work claim. But that's
> > about the same than smp_queue_function_single() we had previously
> > and we are back with our csd_lock issue.
> 
> Who said anything about using smp_call_function_single()?

Ah shortcutting, doesn't look bad indeed.

> 
> 
> ---
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index a82170e2fa78..2fc9d8ece05a 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -61,7 +61,8 @@ void __weak arch_irq_work_raise(void)
>   *
>   * Can be re-enqueued while the callback is still in progress.
>   */
> -bool irq_work_queue(struct irq_work *work)
> +static __always_inline bool
> +__irq_work_queue_on(struct irq_work *work, int cpu)
>  {
>   /* Only queue if not already pending */
>   if (!irq_work_claim(work))
> @@ -78,16 +79,31 @@ bool irq_work_queue(struct irq_work *work)
>* for the next tick.
>*/
>   if (!(work->flags & IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
> - if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
> - arch_irq_work_raise();
> + if (cmpxchg(&__get_cpu_var(irq_work_raised, 0, 1) == 0)) {
> + if (cpu == smp_processor_id() || cpu == -1)
> + arch_irq_work_raise();
> + else
> + arch_send_call_function_single_ipi();
> + }

Ok that needs some more tuning with the raised flag and the destination list
to pick, but I get the idea.

>   }
>  
>   preempt_enable();
>  
>   return true;
>  }
> +
> +bool irq_work_queue(struct irq_work *work)
> +{
> + return __irq_work_queue_on(work, -1);
> +}
>  EXPORT_SYMBOL_GPL(irq_work_queue);
>  
> +bool irq_work_queue_on(struct irq_work *work, int cpu)
> +{
> + WARN_ON_ONCE(in_nmi());
> + return __irq_work_queue_on(work, cpu);
> +}
> +
>  bool irq_work_needs_cpu(void)
>  {
>   struct llist_head *this_list;
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 06d574e42c72..0fd53963c4fb 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -198,6 +198,12 @@ void generic_smp_call_function_single_interrupt(void)
>   csd->func(csd->info);
>   csd_unlock(csd);
>   }
> +
> + /*
> +  * First run the synchronous callbacks, people are waiting on them;
> +  * then run the async ones.
> +  */
> + irq_work_run();
>  }

Alright, I'm reiterating with that.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched,nohz: Change rq->nr_running always using wrappers

2014-05-12 Thread Frederic Weisbecker

On Fri, May 09, 2014 at 03:00:14AM +0400, Kirill Tkhai wrote:
> Sometimes nr_running may cross 2 but interrupt is not being
> sent to rq's cpu. In this case we don't reenable timer.
> Looks like, this may be a reason of rare unexpected effects,
> if nohz is enabled.
> 
> Patch replaces all places of direct changing of nr_running
> and makes add_nr_running() caring about crossing border.
> 
> Signed-off-by: Kirill Tkhai 
> CC: Frederic Weisbecker 
> CC: Peter Zijlstra 
> CC: Ingo Molnar 

Right I had that issue with throttling in my TODO list.

Thanks for fixing that:

Acked-by: Frederic Weisbecker 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] nohz: Use IPI implicit full barrier against rq->nr_running r/w

2014-05-13 Thread Frederic Weisbecker

A full dynticks CPU is allowed to stop its tick when a single task runs.
Meanwhile when a new task gets enqueued, the CPU must be notified so that
it restart its tick to maintain local fairness and other accounting details.

This notification is performed by way of an IPI. Then when the target
receives the IPI, we expect it to see the new value of rq->nr_running.

Hence the following ordering scenario:

   CPU 0   CPU 1

   write rq->running   get IPI
   smp_wmb()   smp_rmb()
   send IPIread rq->nr_running

But Paul Mckenney says that nowadays IPIs imply a full barrier on
all architectures. So we can safely remove this pair and rely on the
implicit barriers that come along IPI send/receive. Lets
just comment on this new assumption.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 kernel/sched/core.c  |  9 +
 kernel/sched/sched.h | 10 --
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb6dfad..a06cac1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -670,10 +670,11 @@ bool sched_can_stop_tick(void)
 
rq = this_rq();
 
-   /* Make sure rq->nr_running update is visible after the IPI */
-   smp_rmb();
-
-   /* More than one running task need preemption */
+   /*
+   * More than one running task need preemption.
+   * nr_running update is assumed to be visible
+   * after IPI is sent from wakers.
+   */
if (rq->nr_running > 1)
return false;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6089e00..219bfbd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1223,8 +1223,14 @@ static inline void inc_nr_running(struct rq *rq)
 #ifdef CONFIG_NO_HZ_FULL
if (rq->nr_running == 2) {
if (tick_nohz_full_cpu(rq->cpu)) {
-   /* Order rq->nr_running write against the IPI */
-   smp_wmb();
+   /*
+* Tick is needed if more than one task runs on a CPU.
+* Send the target an IPI to kick it out of nohz mode.
+*
+* We assume that IPI implies full memory barrier and 
the
+* new value of rq->nr_running is visible on reception
+* from the target.
+*/
tick_nohz_full_kick_cpu(rq->cpu);
}
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] irq_work: Let arch tell us if it can raise irq work

2014-05-13 Thread Frederic Weisbecker

We prepare for executing the full nohz kick through an irq work. But
if we do this as is, we'll run into conflicting tick locking: the tick
holds the hrtimer lock and the nohz kick may do so too.

So we need to be able to force the execution of some irq works (more
precisely the non-lazy ones) to the arch irq work interrupt if any.

As a start we need to know if the arch support sending its own self-IPIs
and doesn't rely on the tick to execute the works.

This solution proposes weak function. Of course it's ugly and deemed
only for a draft. The best would be to call a generic
irq_work_set_raisable() only once per arch.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Not-Yet-Signed-off-by: Frederic Weisbecker 
---
 arch/alpha/kernel/time.c   | 5 +
 arch/arm/kernel/smp.c  | 5 +
 arch/powerpc/kernel/time.c | 5 +
 arch/sparc/kernel/pcr.c| 5 +
 arch/x86/kernel/irq_work.c | 7 +++
 kernel/irq_work.c  | 5 +
 6 files changed, 32 insertions(+)

diff --git a/arch/alpha/kernel/time.c b/arch/alpha/kernel/time.c
index ee39cee..b30d7bd 100644
--- a/arch/alpha/kernel/time.c
+++ b/arch/alpha/kernel/time.c
@@ -65,6 +65,11 @@ void arch_irq_work_raise(void)
set_irq_work_pending_flag();
 }
 
+bool arch_irq_work_can_raise(void)
+{
+   return true;
+}
+
 #else  /* CONFIG_IRQ_WORK */
 
 #define test_irq_work_pending()  0
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 7c4fada..89ff3a3 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -459,6 +459,11 @@ void arch_irq_work_raise(void)
if (is_smp())
smp_cross_call(cpumask_of(smp_processor_id()), IPI_IRQ_WORK);
 }
+
+bool arch_irq_work_can_raise(void)
+{
+   return is_smp();
+}
 #endif
 
 static const char *ipi_types[NR_IPI] = {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..e5381e8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -472,6 +472,11 @@ void arch_irq_work_raise(void)
preempt_enable();
 }
 
+bool arch_irq_work_can_raise(void)
+{
+   return true;
+}
+
 #else  /* CONFIG_IRQ_WORK */
 
 #define test_irq_work_pending()0
diff --git a/arch/sparc/kernel/pcr.c b/arch/sparc/kernel/pcr.c
index 269af58..658f4bc 100644
--- a/arch/sparc/kernel/pcr.c
+++ b/arch/sparc/kernel/pcr.c
@@ -48,6 +48,11 @@ void arch_irq_work_raise(void)
set_softint(1 << PIL_DEFERRED_PCR_WORK);
 }
 
+bool arch_irq_work_can_raise(void)
+{
+   return true;
+}
+
 const struct pcr_ops *pcr_ops;
 EXPORT_SYMBOL_GPL(pcr_ops);
 
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 1de84e3..03e1ee4 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -48,3 +48,10 @@ void arch_irq_work_raise(void)
apic_wait_icr_idle();
 #endif
 }
+
+#ifdef CONFIG_X86_LOCAL_APIC
+bool arch_irq_work_can_raise(void)
+{
+   return cpu_has_apic;
+}
+#endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index a82170e..2a5aad4 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -55,6 +55,11 @@ void __weak arch_irq_work_raise(void)
 */
 }
 
+bool __weak arch_irq_work_can_raise(void)
+{
+   return false;
+}
+
 /*
  * Enqueue the irq_work @entry unless it's already pending
  * somewhere.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] irq_work: Force non-lazy works to the IPI

2014-05-13 Thread Frederic Weisbecker

As we plan to handle the full nohz IPI using irq work, we need to
enforce non-lazy works outside the tick because it's called under
hrtimer lock. This is not desired from the nohz callback revaluating the
tick because it can take hrtimer lock itself.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/irq_work.h |  1 +
 kernel/irq_work.c| 61 +++-
 kernel/timer.c   |  2 +-
 3 files changed, 36 insertions(+), 28 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 19ae05d..429b1ba 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -34,6 +34,7 @@ void init_irq_work(struct irq_work *work, void (*func)(struct 
irq_work *))
 
 bool irq_work_queue(struct irq_work *work);
 void irq_work_run(void);
+void irq_work_run_tick(void);
 void irq_work_sync(struct irq_work *work);
 
 #ifdef CONFIG_IRQ_WORK
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 2a5aad4..292a9ac 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -19,8 +19,8 @@
 #include 
 
 
-static DEFINE_PER_CPU(struct llist_head, irq_work_list);
-static DEFINE_PER_CPU(int, irq_work_raised);
+static DEFINE_PER_CPU(struct llist_head, raised_list);
+static DEFINE_PER_CPU(struct llist_head, lazy_list);
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -68,14 +68,14 @@ bool __weak arch_irq_work_can_raise(void)
  */
 bool irq_work_queue(struct irq_work *work)
 {
+   unsigned long flags;
+
/* Only queue if not already pending */
if (!irq_work_claim(work))
return false;
 
-   /* Queue the entry and raise the IPI if needed. */
-   preempt_disable();
-
-   llist_add(&work->llnode, &__get_cpu_var(irq_work_list));
+   /* Make sure an IRQ doesn't stop the tick concurrently */
+   local_irq_save(flags);
 
/*
 * If the work is not "lazy" or the tick is stopped, raise the irq
@@ -83,11 +83,13 @@ bool irq_work_queue(struct irq_work *work)
 * for the next tick.
 */
if (!(work->flags & IRQ_WORK_LAZY) || tick_nohz_tick_stopped()) {
-   if (!this_cpu_cmpxchg(irq_work_raised, 0, 1))
+   if (llist_add(&work->llnode, &__get_cpu_var(raised_list)))
arch_irq_work_raise();
+   } else {
+   llist_add(&work->llnode, &__get_cpu_var(lazy_list));
}
 
-   preempt_enable();
+   local_irq_restore(flags);
 
return true;
 }
@@ -95,10 +97,7 @@ EXPORT_SYMBOL_GPL(irq_work_queue);
 
 bool irq_work_needs_cpu(void)
 {
-   struct llist_head *this_list;
-
-   this_list = &__get_cpu_var(irq_work_list);
-   if (llist_empty(this_list))
+   if (llist_empty(&__get_cpu_var(lazy_list)))
return false;
 
/* All work should have been flushed before going offline */
@@ -107,28 +106,18 @@ bool irq_work_needs_cpu(void)
return true;
 }
 
-static void __irq_work_run(void)
+static void __irq_work_run(struct llist_head *list)
 {
unsigned long flags;
struct irq_work *work;
-   struct llist_head *this_list;
struct llist_node *llnode;
 
-
-   /*
-* Reset the "raised" state right before we check the list because
-* an NMI may enqueue after we find the list empty from the runner.
-*/
-   __this_cpu_write(irq_work_raised, 0);
-   barrier();
-
-   this_list = &__get_cpu_var(irq_work_list);
-   if (llist_empty(this_list))
+   if (llist_empty(list))
return;
 
BUG_ON(!irqs_disabled());
 
-   llnode = llist_del_all(this_list);
+   llnode = llist_del_all(list);
while (llnode != NULL) {
work = llist_entry(llnode, struct irq_work, llnode);
 
@@ -160,11 +149,28 @@ static void __irq_work_run(void)
 void irq_work_run(void)
 {
BUG_ON(!in_irq());
-   __irq_work_run();
+   __irq_work_run(&__get_cpu_var(raised_list));
+   __irq_work_run(&__get_cpu_var(lazy_list));
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
 
 /*
+ * Run the lazy irq_work entries on this cpu from the tick. But let
+ * the IPI handle the others. Some works may require to work outside
+ * the tick due to its locking dependencies (hrtimer lock).
+ */
+void irq_work_run_tick(void)
+{
+   BUG_ON(!in_irq());
+
+   if (!arch_irq_work_can_raise()) {
+   /* No IPI support, we don't have the choice... */
+   __irq_work_run(&__get_cpu_var(raised_list));
+   }
+   __irq_work_run(&__get_cpu_var(lazy_list));
+}
+
+/*
  * Synchronize against the irq_work @entry, ensures the entry is not
  * currently in use.
  */
@@ -188,7 +194,8 @@ static int irq_work_cpu_notify(struct notifier_block *self,

[PATCH 3/5] irq_work: Allow remote queueing

2014-05-13 Thread Frederic Weisbecker

irq work currently only supports local callbacks. However its code
is mostly ready to run remote callbacks and we have some potential user.

The full nohz subsystem currently open codes its own remote irq work
on top of the scheduler ipi when it wants a CPU to revaluate its next
tick. However this ad hoc solution bloats the scheduler IPI.

Lets just extend the irq work subsystem to support remote queuing on top
of the generic SMP IPI to handle this kind of user. This shouldn't add
noticeable overhead.

Suggested-by: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/irq_work.h |  2 ++
 kernel/irq_work.c| 17 +
 kernel/smp.c |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 429b1ba..511e7f7 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -33,6 +33,8 @@ void init_irq_work(struct irq_work *work, void (*func)(struct 
irq_work *))
 #define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { .func = (_f), }
 
 bool irq_work_queue(struct irq_work *work);
+bool irq_work_queue_on(struct irq_work *work, int cpu);
+
 void irq_work_run(void);
 void irq_work_run_tick(void);
 void irq_work_sync(struct irq_work *work);
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 292a9ac..98dab29 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -66,6 +66,23 @@ bool __weak arch_irq_work_can_raise(void)
  *
  * Can be re-enqueued while the callback is still in progress.
  */
+bool irq_work_queue_on(struct irq_work *work, int cpu)
+{
+   /* Only queue if not already pending */
+   if (!irq_work_claim(work))
+   return false;
+
+   /* All work should have been flushed before going offline */
+   WARN_ON_ONCE(cpu_is_offline(cpu));
+   WARN_ON_ONCE(work->flags & IRQ_WORK_LAZY);
+
+   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+   native_send_call_func_single_ipi(cpu);
+
+   return true;
+}
+EXPORT_SYMBOL_GPL(irq_work_queue_on);
+
 bool irq_work_queue(struct irq_work *work)
 {
unsigned long flags;
diff --git a/kernel/smp.c b/kernel/smp.c
index 06d574e..f5edb96 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -3,6 +3,7 @@
  *
  * (C) Jens Axboe  2008
  */
+#include 
 #include 
 #include 
 #include 
@@ -198,6 +199,8 @@ void generic_smp_call_function_single_interrupt(void)
csd->func(csd->info);
csd_unlock(csd);
}
+
+   irq_work_run();
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 0/5] nohz: Move nohz kick out of scheduler IPI, v4

2014-05-13 Thread Frederic Weisbecker

So this version now implements remote irq works using the generic IPI
interrupt available in most archs, as suggested by Peterz.

Keep in mind that the first patch is really just a draft to build the
mockup. It needs to be turned into an internal state set on boot or so.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/nohz-irq-work-v2

Thanks,
Frederic
---

Frederic Weisbecker (5):
  irq_work: Let arch tell us if it can raise irq work
  irq_work: Force non-lazy works to the IPI
  irq_work: Allow remote queueing
  nohz: Move full nohz kick to its own IPI
  nohz: Use IPI implicit full barrier against rq->nr_running r/w


 arch/alpha/kernel/time.c   |  5 +++
 arch/arm/kernel/smp.c  |  5 +++
 arch/powerpc/kernel/time.c |  5 +++
 arch/sparc/kernel/pcr.c|  5 +++
 arch/x86/kernel/irq_work.c |  7 
 include/linux/irq_work.h   |  3 ++
 include/linux/tick.h   |  9 -
 kernel/irq_work.c  | 83 +++---
 kernel/sched/core.c| 14 
 kernel/sched/sched.h   | 12 +--
 kernel/smp.c   |  3 ++
 kernel/time/tick-sched.c   | 10 +++---
 kernel/timer.c |  2 +-
 13 files changed, 119 insertions(+), 44 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] nohz: Move full nohz kick to its own IPI

2014-05-13 Thread Frederic Weisbecker

Now that the irq work subsystem can queue remote callbacks, it's
a perfect fit to safely queue IPIs when interrupts are disabled
without worrying about concurrent callers.

Lets use it for the full dynticks kick to notify a CPU that it's
exiting single task mode.

This unbloats a bit the scheduler IPI that the nohz code was abusing
for its cool "callable anywhere/anytime" properties.

Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Kevin Hilman 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Viresh Kumar 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/tick.h |  9 -
 kernel/sched/core.c  |  5 +
 kernel/sched/sched.h |  2 +-
 kernel/time/tick-sched.c | 10 ++
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..8a4987f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -181,7 +181,13 @@ static inline bool tick_nohz_full_cpu(int cpu)
 
 extern void tick_nohz_init(void);
 extern void __tick_nohz_full_check(void);
-extern void tick_nohz_full_kick(void);
+extern void tick_nohz_full_kick_cpu(int cpu);
+
+static inline void tick_nohz_full_kick(void)
+{
+   tick_nohz_full_kick_cpu(smp_processor_id());
+}
+
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 #else
@@ -189,6 +195,7 @@ static inline void tick_nohz_init(void) { }
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
 static inline void __tick_nohz_full_check(void) { }
+static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9d8ece..fb6dfad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1500,9 +1500,7 @@ void scheduler_ipi(void)
 */
preempt_fold_need_resched();
 
-   if (llist_empty(&this_rq()->wake_list)
-   && !tick_nohz_full_cpu(smp_processor_id())
-   && !got_nohz_idle_kick())
+   if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
return;
 
/*
@@ -1519,7 +1517,6 @@ void scheduler_ipi(void)
 * somewhat pessimize the simple resched case.
 */
irq_enter();
-   tick_nohz_full_check();
sched_ttwu_pending();
 
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..6089e00 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1225,7 +1225,7 @@ static inline void inc_nr_running(struct rq *rq)
if (tick_nohz_full_cpu(rq->cpu)) {
/* Order rq->nr_running write against the IPI */
smp_wmb();
-   smp_send_reschedule(rq->cpu);
+   tick_nohz_full_kick_cpu(rq->cpu);
}
}
 #endif
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6558b7a..3d63944 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -224,13 +224,15 @@ static DEFINE_PER_CPU(struct irq_work, 
nohz_full_kick_work) = {
 };
 
 /*
- * Kick the current CPU if it's full dynticks in order to force it to
+ * Kick the CPU if it's full dynticks in order to force it to
  * re-evaluate its dependency on the tick and restart it if necessary.
  */
-void tick_nohz_full_kick(void)
+void tick_nohz_full_kick_cpu(int cpu)
 {
-   if (tick_nohz_full_cpu(smp_processor_id()))
-   irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
+   if (!tick_nohz_full_cpu(cpu))
+   return;
+
+   irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
 static void nohz_full_kick_ipi(void *info)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/6] workqueue: Record real per-workqueue cpumask

2014-05-13 Thread Frederic Weisbecker

On Thu, May 08, 2014 at 09:20:45PM +0800, Lai Jiangshan wrote:
> On Thu, May 8, 2014 at 12:37 AM, Frederic Weisbecker  
> wrote:
> > The real cpumask set by the user on WQ_SYSFS workqueues fails to be
> > recorded as is: What is actually recorded as per workqueue attribute
> > is the per workqueue cpumask intersected with the global unbounds cpumask.
> >
> > Eventually when the user overwrites a WQ_SYSFS cpumask and later read
> > this attibute, the value returned is not the last one written.
> >
> > The other bad side effect is that widening the global unbounds cpumask
> > doesn't actually widen the unbound workqueues affinity because their
> > own cpumask has been schrinked.
> >
> > In order to fix this, lets record the real per workqueue cpumask on the
> > workqueue struct. We restore this value when attributes are re-evaluated
> > later.
> >
> > FIXME: Maybe I should rather invert that. Have the user set workqueue
> > cpumask on attributes and the effective one on the workqueue struct instead.
> > We'll just need some tweaking in order to make the attributes of lower 
> > layers
> > (pools, worker pools, worker, ...) to inherit the effective cpumask and not
> > the user one.
> >
> > Cc: Christoph Lameter 
> > Cc: Kevin Hilman 
> > Cc: Lai Jiangshan 
> > Cc: Mike Galbraith 
> > Cc: Paul E. McKenney 
> > Cc: Tejun Heo 
> > Cc: Viresh Kumar 
> > Signed-off-by: Frederic Weisbecker 
> > ---
> >  kernel/workqueue.c | 8 
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 5978cee..504cf0a 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -248,6 +248,7 @@ struct workqueue_struct {
> > int saved_max_active; /* WQ: saved pwq 
> > max_active */
> >
> > struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs 
> > */
> > +   cpumask_var_t   saved_cpumask;  /* WQ: only for unbound wqs 
> > */
> 
> 
> Forgot to use it? or use it in next patches?

Hmm, no it's used below.

> 
> > struct pool_workqueue   *dfl_pwq;   /* WQ: only for unbound wqs 
> > */
> >
> >  #ifdef CONFIG_SYSFS
> > @@ -3694,6 +3695,7 @@ static int apply_workqueue_attrs_locked(struct 
> > workqueue_struct *wq,
> > mutex_lock(&wq->mutex);
> >
> > copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
> > +   cpumask_copy(wq->saved_cpumask, attrs->cpumask);
> 
> I think you can use ->unbound_attrs directly:
>   copy_workqueue_attrs(wq->unbound_attrs, attrs);
> 
> and update wq_update_unbound_numa():
>  copy_workqueue_attrs(tmp_attrs, wq->unbound_attrs);
>  cpumask_and(&tmp_attrs->cpumask, wq_unbound_cpumask)
> 
> use tmp_attr instead of wq->unbound_attrs in the left code of
> wq_update_unbound_numa()

But wq_update_unbound_numa() is only called on cpu hotplug operations
right? So this may have no effect after setting a cpumask in sysfs.

How about keeping the sysfs cpu in wq's unbound_attrs but pass the effective
one to pwq creation in apply_workqueue_attrs_locked.

And also do what you suggest in wq_update_unbound_numa for hotplug
operations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] workqueue: Allow modifying low level unbound workqueue cpumask

2014-05-13 Thread Frederic Weisbecker

On Thu, May 08, 2014 at 09:22:51PM +0800, Lai Jiangshan wrote:
> On Thu, May 8, 2014 at 12:37 AM, Frederic Weisbecker  
> wrote:
> > Allow to modify the low-level unbound workqueues cpumask through
> > sysfs. This is performed by traversing the entire workqueue list
> > and calling apply_workqueue_attrs() on the unbound workqueues.
> >
> > Cc: Christoph Lameter 
> > Cc: Kevin Hilman 
> > Cc: Lai Jiangshan 
> > Cc: Mike Galbraith 
> > Cc: Paul E. McKenney 
> > Cc: Tejun Heo 
> > Cc: Viresh Kumar 
> > Signed-off-by: Frederic Weisbecker 
> > ---
> >  kernel/workqueue.c | 65 
> > --
> >  1 file changed, 63 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 2aa296d..5978cee 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -293,7 +293,7 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects 
> > wq->maydays list */
> >  static LIST_HEAD(workqueues);  /* PL: list of all workqueues */
> >  static bool workqueue_freezing;/* PL: have wqs started 
> > freezing? */
> >
> > -static cpumask_var_t wq_unbound_cpumask;
> > +static cpumask_var_t wq_unbound_cpumask; /* PL: low level cpumask for all 
> > unbound wqs */
> >
> >  /* the per-cpu worker pools */
> >  static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool 
> > [NR_STD_WORKER_POOLS],
> > @@ -4084,19 +4084,80 @@ static struct bus_type wq_subsys = {
> > .dev_groups = wq_sysfs_groups,
> >  };
> >
> > +static int unbounds_cpumask_apply(cpumask_var_t cpumask)
> > +{
> > +   struct workqueue_struct *wq;
> > +   int ret;
> > +
> > +   lockdep_assert_held(&wq_pool_mutex);
> > +
> > +   list_for_each_entry(wq, &workqueues, list) {
> > +   struct workqueue_attrs *attrs;
> > +
> > +   if (!(wq->flags & WQ_UNBOUND))
> > +   continue;
> > +
> > +   attrs = wq_sysfs_prep_attrs(wq);
> > +   if (!attrs)
> > +   return -ENOMEM;
> > +
> > +   ret = apply_workqueue_attrs_locked(wq, attrs, cpumask);
> > +   free_workqueue_attrs(attrs);
> > +   if (ret)
> > +   break;
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> > +static ssize_t unbounds_cpumask_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > +   cpumask_var_t cpumask;
> > +   int ret = -EINVAL;
> > +
> > +   if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> > +   return -ENOMEM;
> > +
> > +   ret = cpumask_parse(buf, cpumask);
> > +   if (ret)
> > +   goto out;
> 
> 
>  cpumask_and(cpumask, cpumask, cpu_possible_mask);

Is it really useful? I mean in the end we only apply online bits.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Does perf support different length of user-space hw_breakpoint?

2014-05-13 Thread Frederic Weisbecker

On Tue, May 13, 2014 at 02:00:46PM +0200, Jiri Olsa wrote:
> On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
> > Hi guys,
> > 
> > Does perf support different length of user-space hw_breakpoint,
> > such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
> > HW_BREAKPOINT_LEN_8?
> > 
> > Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
> > by default from the source code and simple test.
> 
> right..
> 
> /*
>  * We should find a nice way to override the access length
>  * Provide some defaults for now
>  */
> if (attr.bp_type == HW_BREAKPOINT_X)
> attr.bp_len = sizeof(long);
> else
> attr.bp_len = HW_BREAKPOINT_LEN_4;
> 
> > 
> > May I have your opinions if I want to trace different bytes of
> > hw_breakpoint addr?
> 
> I guess that depends on what you want to do ;-)

Ah, I have a patchset from Jacob Shin and Suravee Suthikulpanit that does
that. Also it has been hanging around for too long by my fault. I'm posting
it now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/2] smp: Print more useful debug info upon receiving IPI on an offline CPU

2014-05-13 Thread Frederic Weisbecker

On Mon, May 12, 2014 at 02:06:49AM +0530, Srivatsa S. Bhat wrote:
> Today the smp-call-function code just prints a warning if we get an IPI on
> an offline CPU. This info is sufficient to let us know that something went
> wrong, but often it is very hard to debug exactly who sent the IPI and why,
> from this info alone.
> 
> In most cases, we get the warning about the IPI to an offline CPU, immediately
> after the CPU going offline comes out of the stop-machine phase and reenables
> interrupts. Since all online CPUs participate in stop-machine, the information
> regarding the sender of the IPI is already lost by the time we exit the
> stop-machine loop. So even if we dump the stack on each CPU at this point,
> we won't find anything useful since all of them will show the stack-trace of
> the stopper thread. So we need a better way to figure out who sent the IPI and
> why.
> 
> To achieve this, when we detect an IPI targeted to an offline CPU, loop 
> through
> the call-single-data linked list and print out the payload (i.e., the name
> of the function which was supposed to be executed by the target CPU). This
> would give us an insight as to who might have sent the IPI and help us debug
> this further.
> 
> Signed-off-by: Srivatsa S. Bhat 
> ---
> 
>  kernel/smp.c |   18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 06d574e..f864921 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -185,14 +185,24 @@ void generic_smp_call_function_single_interrupt(void)
>  {
>   struct llist_node *entry;
>   struct call_single_data *csd, *csd_next;
> + static bool warned;
> +
> + entry = llist_del_all(&__get_cpu_var(call_single_queue));
> + entry = llist_reverse_order(entry);
>  
>   /*
>* Shouldn't receive this interrupt on a cpu that is not yet online.
>*/
> - WARN_ON_ONCE(!cpu_online(smp_processor_id()));
> -
> - entry = llist_del_all(&__get_cpu_var(call_single_queue));
> - entry = llist_reverse_order(entry);
> + if (unlikely(!cpu_online(smp_processor_id()) && !warned)) {
> + warned = true;
> + WARN_ON(1);

More details may be better:

WARN_ONCE(1, "IPI on offline CPU");

> + /*
> +  * We don't have to use the _safe() variant here
> +  * because we are not invoking the IPI handlers yet.
> +  */
> + llist_for_each_entry(csd, entry, llist)
> + pr_warn("SMP IPI Payload: %pS \n", csd->func);

Payload is kind of vague. How about "IPI func %pS sent on offline CPU".

> + }
>  
>   llist_for_each_entry_safe(csd, csd_next, entry, llist) {
>   csd->func(csd->info);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/2] CPU hotplug, stop-machine: Plug race-window that leads to "IPI-to-offline-CPU"

2014-05-13 Thread Frederic Weisbecker

On Tue, May 13, 2014 at 02:32:00PM +0530, Srivatsa S. Bhat wrote:
> 
>  kernel/stop_machine.c |   39 ++-
>  1 file changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index 01fbae5..288f7fe 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -130,8 +130,10 @@ enum multi_stop_state {
>   MULTI_STOP_NONE,
>   /* Awaiting everyone to be scheduled. */
>   MULTI_STOP_PREPARE,
> - /* Disable interrupts. */
> - MULTI_STOP_DISABLE_IRQ,
> + /* Disable interrupts on CPUs not in ->active_cpus mask. */
> + MULTI_STOP_DISABLE_IRQ_INACTIVE,
> + /* Disable interrupts on CPUs in ->active_cpus mask. */
> + MULTI_STOP_DISABLE_IRQ_ACTIVE,
>   /* Run the function */
>   MULTI_STOP_RUN,
>   /* Exit */
> @@ -189,12 +191,39 @@ static int multi_cpu_stop(void *data)
>   do {
>   /* Chill out and ensure we re-read multi_stop_state. */
>   cpu_relax();
> +
> + /*
> +  * We use 2 separate stages to disable interrupts, namely
> +  * _INACTIVE and _ACTIVE, to ensure that the inactive CPUs
> +  * disable their interrupts first, followed by the active CPUs.
> +  *
> +  * This is done to avoid a race in the CPU offline path, which
> +  * can lead to receiving IPIs on the outgoing CPU *after* it
> +  * has gone offline.
> +  *
> +  * During CPU offline, we don't want the other CPUs to send
> +  * IPIs to the active_cpu (the outgoing CPU) *after* it has
> +  * disabled interrupts (because, then it will notice the IPIs
> +  * only after it has gone offline). We can prevent this by
> +  * making the other CPUs disable their interrupts first - that
> +  * way, they will run the stop-machine code with interrupts
> +  * disabled, and hence won't send IPIs after that point.
> +  */
> +
>   if (msdata->state != curstate) {
>   curstate = msdata->state;
>   switch (curstate) {
> - case MULTI_STOP_DISABLE_IRQ:
> - local_irq_disable();
> - hard_irq_disable();
> + case MULTI_STOP_DISABLE_IRQ_INACTIVE:
> + if (!is_active) {
> + local_irq_disable();
> + hard_irq_disable();
> + }
> + break;
> + case MULTI_STOP_DISABLE_IRQ_ACTIVE:
> + if (is_active) {
> + local_irq_disable();
> + hard_irq_disable();

I have no idea about possible IPI latencies due to hardware. But are we sure 
that a stop
machine transition state is enough to make sure we get a pending IPI? Shouldn't 
we have
some sort of IPI flush in between, like polling on call_single_queue?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 14/19] tick-sched: add comment about 'idle_active' in tick_nohz_idle_exit()

2014-04-21 Thread Frederic Weisbecker

On Mon, Apr 21, 2014 at 03:25:10PM +0530, Viresh Kumar wrote:
> The sequence of calls for dynticks CPUs is a bit confusing. Add a comment in
> tick_nohz_idle_exit() to mention it clearly. All information required is in
> commit and this conversation with Frederic.
> 
> https://lkml.org/lkml/2014/4/10/355
> 
> Suggested-by: Frederic Weisbecker 
> Signed-off-by: Viresh Kumar 
> ---
>  kernel/time/tick-sched.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 71f64ee..c3aed50 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -922,6 +922,22 @@ void tick_nohz_idle_exit(void)
>  
>   ts->inidle = 0;
>  
> + /*
> +  * Can idle_active be false here?
> +  * Ideally this would be the sequence of calls:
> +  * - tick_nohz_idle_enter(), i.e. idle_active = true;
> +  * - local_irq_disable()
> +  * - IDLE
> +  * - wake up due to IPI or other interrupt
> +  * - local_irq_enable()
> +  * - tick_nohz_irq_enter(), i.e. idle_active = false;
> +  * - tick_nohz_irq_exit(), i.e. idle_active = true; This is not called
> +  *   in case of IPI's as need_resched() will prevent that in
> +  *   tick_irq_exit(), as we don't need to account any more for idle time
> +  *   or try to enter dyntics mode (We are going to exit idle state).
> +  *
> +  * - tick_nohz_idle_exit()
> +  */
>   if (ts->idle_active || ts->tick_stopped)
>   now = ktime_get();

It's still over-detailed. Much of the above is easily deduced after common 
review. OTOH
I proposed to summarize there: https://lkml.org/lkml/2014/4/11/334
The below disambiguates it a bit further.

Now it's eventually getting as big as your comment ;-)

  /*
   * ts->idle_active drives the idle time which typically elapses in 
the idle loop
   * but breaks on IRQs interrupting idle loop.
   *
   * Hence ts->idle_active can be 1 here if we exit the idle loop 
without the help of
   * an IRQ. OTOH it can be 0 on idle exit if a wake up IPI pulled the 
CPU out of
   * the idle loop. Since we know that we'll be exiting the idle task 
after the wake
   * up IPI, all the pending idle sleep time is flushed on irq entry 
and no more is
   * accounted further thanks to the need_resched() check on irq_exit().
   */

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/11] watchdog: Simplify a little the IPI call

2014-02-24 Thread Frederic Weisbecker

In order to remotely restart the watchdog hrtimer, update_timers()
allocates a csd on the stack and pass it to __smp_call_function_single().

There is no partcular need, however, for a specific csd here. Lets
simplify that a little by calling smp_call_function_single()
which can already take care of the csd allocation by itself.

Acked-by: Don Zickus 
Reviewed-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Don Zickus 
Cc: Ingo Molnar 
Cc: Jan Kara 
Cc: Jens Axboe 
Cc: Michal Hocko 
Cc: Srivatsa S. Bhat 
Signed-off-by: Frederic Weisbecker 
---
 kernel/watchdog.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 4431610..01c6f97 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -505,7 +505,6 @@ static void restart_watchdog_hrtimer(void *info)
 
 static void update_timers(int cpu)
 {
-   struct call_single_data data = {.func = restart_watchdog_hrtimer};
/*
 * Make sure that perf event counter will adopt to a new
 * sampling period. Updating the sampling period directly would
@@ -515,7 +514,7 @@ static void update_timers(int cpu)
 * might be late already so we have to restart the timer as well.
 */
watchdog_nmi_disable(cpu);
-   __smp_call_function_single(cpu, &data, 1);
+   smp_call_function_single(cpu, restart_watchdog_hrtimer, NULL, 1);
watchdog_nmi_enable(cpu);
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/11] block: Stop abusing rq->csd.list in blk-softirq

2014-02-24 Thread Frederic Weisbecker

From: Jan Kara 

Abusing rq->csd.list for a list of requests to complete is rather ugly.
We use rq->queuelist instead which is much cleaner. It is safe because
queuelist is used by the block layer only for requests waiting to be
submitted to a device. Thus it is unused when irq reports the request IO
is finished.

Signed-off-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 block/blk-softirq.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 57790c1..b5c37d9 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -30,8 +30,8 @@ static void blk_done_softirq(struct softirq_action *h)
while (!list_empty(&local_list)) {
struct request *rq;
 
-   rq = list_entry(local_list.next, struct request, csd.list);
-   list_del_init(&rq->csd.list);
+   rq = list_entry(local_list.next, struct request, queuelist);
+   list_del_init(&rq->queuelist);
rq->q->softirq_done_fn(rq);
}
 }
@@ -45,9 +45,14 @@ static void trigger_softirq(void *data)
 
local_irq_save(flags);
list = this_cpu_ptr(&blk_cpu_done);
-   list_add_tail(&rq->csd.list, list);
+   /*
+* We reuse queuelist for a list of requests to process. Since the
+* queuelist is used by the block layer only for requests waiting to be
+* submitted to the device it is unused now.
+*/
+   list_add_tail(&rq->queuelist, list);
 
-   if (list->next == &rq->csd.list)
+   if (list->next == &rq->queuelist)
raise_softirq_irqoff(BLOCK_SOFTIRQ);
 
local_irq_restore(flags);
@@ -136,7 +141,7 @@ void __blk_complete_request(struct request *req)
struct list_head *list;
 do_local:
list = this_cpu_ptr(&blk_cpu_done);
-   list_add_tail(&req->csd.list, list);
+   list_add_tail(&req->queuelist, list);
 
/*
 * if the list only contains our just added request,
@@ -144,7 +149,7 @@ do_local:
 * entries there, someone already raised the irq but it
 * hasn't run yet.
 */
-   if (list->next == &req->csd.list)
+   if (list->next == &req->queuelist)
raise_softirq_irqoff(BLOCK_SOFTIRQ);
} else if (raise_blk_irq(ccpu, req))
goto do_local;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/11] smp: Consolidate the various smp_call_function_single() declensions

2014-02-24 Thread Frederic Weisbecker

__smp_call_function_single() and smp_call_function_single() share some
code that can be factorized: execute inline when the target is local,
check if the target is online, lock the csd, call generic_exec_single().

Lets move the common parts to generic_exec_single().

Reviewed-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jan Kara 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 kernel/smp.c | 80 +---
 1 file changed, 39 insertions(+), 41 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 5ff14e3..64bb0d4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -117,13 +117,43 @@ static void csd_unlock(struct call_single_data *csd)
csd->flags &= ~CSD_FLAG_LOCK;
 }
 
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
+
 /*
  * Insert a previously allocated call_single_data element
  * for execution on the given CPU. data must already have
  * ->func, ->info, and ->flags set.
  */
-static void generic_exec_single(int cpu, struct call_single_data *csd, int 
wait)
+static int generic_exec_single(int cpu, struct call_single_data *csd,
+  smp_call_func_t func, void *info, int wait)
 {
+   struct call_single_data csd_stack = { .flags = 0 };
+   unsigned long flags;
+
+
+   if (cpu == smp_processor_id()) {
+   local_irq_save(flags);
+   func(info);
+   local_irq_restore(flags);
+   return 0;
+   }
+
+
+   if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu))
+   return -ENXIO;
+
+
+   if (!csd) {
+   csd = &csd_stack;
+   if (!wait)
+   csd = &__get_cpu_var(csd_data);
+   }
+
+   csd_lock(csd);
+
+   csd->func = func;
+   csd->info = info;
+
if (wait)
csd->flags |= CSD_FLAG_WAIT;
 
@@ -143,6 +173,8 @@ static void generic_exec_single(int cpu, struct 
call_single_data *csd, int wait)
 
if (wait)
csd_lock_wait(csd);
+
+   return 0;
 }
 
 /*
@@ -168,8 +200,6 @@ void generic_smp_call_function_single_interrupt(void)
}
 }
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
-
 /*
  * smp_call_function_single - Run a function on a specific CPU
  * @func: The function to run. This must be fast and non-blocking.
@@ -181,12 +211,8 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct 
call_single_data, csd_data);
 int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 int wait)
 {
-   struct call_single_data d = {
-   .flags = 0,
-   };
-   unsigned long flags;
int this_cpu;
-   int err = 0;
+   int err;
 
/*
 * prevent preemption and reschedule on another processor,
@@ -203,26 +229,7 @@ int smp_call_function_single(int cpu, smp_call_func_t 
func, void *info,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
 && !oops_in_progress);
 
-   if (cpu == this_cpu) {
-   local_irq_save(flags);
-   func(info);
-   local_irq_restore(flags);
-   } else {
-   if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
-   struct call_single_data *csd = &d;
-
-   if (!wait)
-   csd = &__get_cpu_var(csd_data);
-
-   csd_lock(csd);
-
-   csd->func = func;
-   csd->info = info;
-   generic_exec_single(cpu, csd, wait);
-   } else {
-   err = -ENXIO;   /* CPU not online */
-   }
-   }
+   err = generic_exec_single(cpu, NULL, func, info, wait);
 
put_cpu();
 
@@ -285,9 +292,8 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  */
 int __smp_call_function_single(int cpu, struct call_single_data *csd, int wait)
 {
-   unsigned int this_cpu;
-   unsigned long flags;
int err = 0;
+   int this_cpu;
 
this_cpu = get_cpu();
/*
@@ -296,20 +302,12 @@ int __smp_call_function_single(int cpu, struct 
call_single_data *csd, int wait)
 * send smp call function interrupt to this cpu and as such deadlocks
 * can't happen.
 */
-   WARN_ON_ONCE(cpu_online(smp_processor_id()) && wait && irqs_disabled()
+   WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
 && !oops_in_progress);
 
-   if (cpu == this_cpu) {
-   local_irq_save(flags);
-   csd->func(csd->info);
-   local_irq_restore(flags);
-   } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
-   csd_lock(csd);
-   generic_exec_single(cpu, csd, wait);

[PATCH 04/11] smp: Iterate functions through llist_for_each_entry_safe()

2014-02-24 Thread Frederic Weisbecker

From: Jan Kara 

The IPI function llist iteration is open coded. Lets simplify this
with using an llist iterator.

Also we want to keep the iteration safe against possible
csd.llist->next value reuse from the IPI handler. At least the block
subsystem used to do such things so lets stay careful and use
llist_for_each_entry_safe().

Signed-off-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 kernel/smp.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index ffee35b..e3852de 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -151,7 +151,8 @@ static void generic_exec_single(int cpu, struct 
call_single_data *csd, int wait)
  */
 void generic_smp_call_function_single_interrupt(void)
 {
-   struct llist_node *entry, *next;
+   struct llist_node *entry;
+   struct call_single_data *csd, *csd_next;
 
/*
 * Shouldn't receive this interrupt on a cpu that is not yet online.
@@ -161,16 +162,9 @@ void generic_smp_call_function_single_interrupt(void)
entry = llist_del_all(&__get_cpu_var(call_single_queue));
entry = llist_reverse_order(entry);
 
-   while (entry) {
-   struct call_single_data *csd;
-
-   next = entry->next;
-
-   csd = llist_entry(entry, struct call_single_data, llist);
+   llist_for_each_entry_safe(csd, csd_next, entry, llist) {
csd->func(csd->info);
csd_unlock(csd);
-
-   entry = next;
}
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/11] smp: Single IPI cleanups v2

2014-02-24 Thread Frederic Weisbecker

Hi,

This version includes:

* Rename __smp_call_function_single to smp_call_function_single() as
  suggested by Christoph.
  
* Acks and reviewed-by added.

* Rebase against -rc4

Thanks.

---
Frederic Weisbecker (6):
  block: Remove useless IPI struct initialization
  smp: Consolidate the various smp_call_function_single() declensions
  smp: Move __smp_call_function_single() below its safe version
  watchdog: Simplify a little the IPI call
  smp: Remove wait argument from __smp_call_function_single()
  smp: Rename __smp_call_function_single() to
smp_call_function_single_async()

Jan Kara (5):
  block: Stop abusing csd.list for fifo_time
  block: Stop abusing rq->csd.list in blk-softirq
  smp: Iterate functions through llist_for_each_entry_safe()
  smp: Remove unused list_head from csd
  smp: Teach __smp_call_function_single() to check for offline cpus

 block/blk-mq.c|   2 +-
 block/blk-softirq.c   |  19 ---
 block/cfq-iosched.c   |   8 +--
 block/deadline-iosched.c  |   8 +--
 drivers/cpuidle/coupled.c |   2 +-
 include/linux/blkdev.h|   1 +
 include/linux/elevator.h  |  11 +---
 include/linux/smp.h   |   8 +--
 kernel/sched/core.c   |   2 +-
 kernel/smp.c  | 139 ++
 kernel/up.c   |   6 +-
 kernel/watchdog.c |   3 +-
 net/core/dev.c|   4 +-
 13 files changed, 98 insertions(+), 115 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/11] block: Stop abusing csd.list for fifo_time

2014-02-24 Thread Frederic Weisbecker

From: Jan Kara 

Block layer currently abuses rq->csd.list.next for storing fifo_time.
That is a terrible hack and completely unnecessary as well. Union
achieves the same space saving in a cleaner way.

Signed-off-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 block/cfq-iosched.c  | 8 
 block/deadline-iosched.c | 8 
 include/linux/blkdev.h   | 1 +
 include/linux/elevator.h | 6 --
 4 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 744833b..5873e4a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2367,10 +2367,10 @@ cfq_merged_requests(struct request_queue *q, struct 
request *rq,
 * reposition in fifo if next is older than rq
 */
if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
-   time_before(rq_fifo_time(next), rq_fifo_time(rq)) &&
+   time_before(next->fifo_time, rq->fifo_time) &&
cfqq == RQ_CFQQ(next)) {
list_move(&rq->queuelist, &next->queuelist);
-   rq_set_fifo_time(rq, rq_fifo_time(next));
+   rq->fifo_time = next->fifo_time;
}
 
if (cfqq->next_rq == next)
@@ -2814,7 +2814,7 @@ static struct request *cfq_check_fifo(struct cfq_queue 
*cfqq)
return NULL;
 
rq = rq_entry_fifo(cfqq->fifo.next);
-   if (time_before(jiffies, rq_fifo_time(rq)))
+   if (time_before(jiffies, rq->fifo_time))
rq = NULL;
 
cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
@@ -3927,7 +3927,7 @@ static void cfq_insert_request(struct request_queue *q, 
struct request *rq)
cfq_log_cfqq(cfqd, cfqq, "insert_request");
cfq_init_prio_data(cfqq, RQ_CIC(rq));
 
-   rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
+   rq->fifo_time = jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)];
list_add_tail(&rq->queuelist, &cfqq->fifo);
cfq_add_rq_rb(rq);
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqd->serving_group,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 9ef6640..a753df2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -106,7 +106,7 @@ deadline_add_request(struct request_queue *q, struct 
request *rq)
/*
 * set expire time and add to fifo list
 */
-   rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
+   rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
 }
 
@@ -174,9 +174,9 @@ deadline_merged_requests(struct request_queue *q, struct 
request *req,
 * and move into next position (next will be deleted) in fifo
 */
if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
-   if (time_before(rq_fifo_time(next), rq_fifo_time(req))) {
+   if (time_before(next->fifo_time, req->fifo_time)) {
list_move(&req->queuelist, &next->queuelist);
-   rq_set_fifo_time(req, rq_fifo_time(next));
+   req->fifo_time = next->fifo_time;
}
}
 
@@ -230,7 +230,7 @@ static inline int deadline_check_fifo(struct deadline_data 
*dd, int ddir)
/*
 * rq is expired!
 */
-   if (time_after_eq(jiffies, rq_fifo_time(rq)))
+   if (time_after_eq(jiffies, rq->fifo_time))
return 1;
 
return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..1e1fa3f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,6 +99,7 @@ struct request {
union {
struct call_single_data csd;
struct work_struct mq_flush_work;
+   unsigned long fifo_time;
};
 
struct request_queue *q;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 306dd8c..0bdfd46 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -202,12 +202,6 @@ enum {
 #define rq_end_sector(rq)  (blk_rq_pos(rq) + blk_rq_sectors(rq))
 #define rb_entry_rq(node)  rb_entry((node), struct request, rb_node)
 
-/*
- * Hack to reuse the csd.list list_head as the fifo time holder while
- * the request is in the io scheduler. Saves an unsigned long in rq.
- */
-#define rq_fifo_time(rq)   ((unsigned long) (rq)->csd.list.next)
-#define rq_set_fifo_time(rq,exp)   ((rq)->csd.list.next = (void *) (exp))
 #define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
 #define rq_fifo_clear(rq)  do {\
list_del_init(&(rq)->queuelist);\
-- 
1.8.3.1

--
To unsubscribe

[PATCH 02/11] block: Remove useless IPI struct initialization

2014-02-24 Thread Frederic Weisbecker

rq_fifo_clear() reset the csd.list through INIT_LIST_HEAD for no clear
purpose. The csd.list doesn't need to be initialized as a list head
because it's only ever used as a list node.

Lets remove this useless initialization.

Reviewed-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jan Kara 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/elevator.h | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 0bdfd46..df63bd3 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -203,10 +203,7 @@ enum {
 #define rb_entry_rq(node)  rb_entry((node), struct request, rb_node)
 
 #define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
-#define rq_fifo_clear(rq)  do {\
-   list_del_init(&(rq)->queuelist);\
-   INIT_LIST_HEAD(&(rq)->csd.list);\
-   } while (0)
+#define rq_fifo_clear(rq)  list_del_init(&(rq)->queuelist)
 
 #else /* CONFIG_BLOCK */
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/11] smp: Remove unused list_head from csd

2014-02-24 Thread Frederic Weisbecker

From: Jan Kara 

Now that we got rid of all the remaining code which fiddled with csd.list,
lets remove it.

Signed-off-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/smp.h | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 6ae004e..4991c6b 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -17,10 +17,7 @@ extern void cpu_idle(void);
 
 typedef void (*smp_call_func_t)(void *info);
 struct call_single_data {
-   union {
-   struct list_head list;
-   struct llist_node llist;
-   };
+   struct llist_node llist;
smp_call_func_t func;
void *info;
u16 flags;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/11] smp: Move __smp_call_function_single() below its safe version

2014-02-24 Thread Frederic Weisbecker

Move this function closer to __smp_call_function_single(). These functions
have very similar behavior and should be displayed in the same block
for clarity.

Reviewed-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jan Kara 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 kernel/smp.c | 64 ++--
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 64bb0d4..fa04ab9 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -237,6 +237,38 @@ int smp_call_function_single(int cpu, smp_call_func_t 
func, void *info,
 }
 EXPORT_SYMBOL(smp_call_function_single);
 
+/**
+ * __smp_call_function_single(): Run a function on a specific CPU
+ * @cpu: The CPU to run on.
+ * @csd: Pre-allocated and setup data structure
+ * @wait: If true, wait until function has completed on specified CPU.
+ *
+ * Like smp_call_function_single(), but allow caller to pass in a
+ * pre-allocated data structure. Useful for embedding @data inside
+ * other structures, for instance.
+ */
+int __smp_call_function_single(int cpu, struct call_single_data *csd, int wait)
+{
+   int err = 0;
+   int this_cpu;
+
+   this_cpu = get_cpu();
+   /*
+* Can deadlock when called with interrupts disabled.
+* We allow cpu's that are not yet online though, as no one else can
+* send smp call function interrupt to this cpu and as such deadlocks
+* can't happen.
+*/
+   WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
+&& !oops_in_progress);
+
+   err = generic_exec_single(cpu, csd, csd->func, csd->info, wait);
+   put_cpu();
+
+   return err;
+}
+EXPORT_SYMBOL_GPL(__smp_call_function_single);
+
 /*
  * smp_call_function_any - Run a function on any of the given cpus
  * @mask: The mask of cpus it can run on.
@@ -281,38 +313,6 @@ call:
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * __smp_call_function_single(): Run a function on a specific CPU
- * @cpu: The CPU to run on.
- * @csd: Pre-allocated and setup data structure
- * @wait: If true, wait until function has completed on specified CPU.
- *
- * Like smp_call_function_single(), but allow caller to pass in a
- * pre-allocated data structure. Useful for embedding @data inside
- * other structures, for instance.
- */
-int __smp_call_function_single(int cpu, struct call_single_data *csd, int wait)
-{
-   int err = 0;
-   int this_cpu;
-
-   this_cpu = get_cpu();
-   /*
-* Can deadlock when called with interrupts disabled.
-* We allow cpu's that are not yet online though, as no one else can
-* send smp call function interrupt to this cpu and as such deadlocks
-* can't happen.
-*/
-   WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
-&& !oops_in_progress);
-
-   err = generic_exec_single(cpu, csd, csd->func, csd->info, wait);
-   put_cpu();
-
-   return err;
-}
-EXPORT_SYMBOL_GPL(__smp_call_function_single);
-
-/**
  * smp_call_function_many(): Run a function on a set of other CPUs.
  * @mask: The set of cpus to run on (only runs on online subset).
  * @func: The function to run. This must be fast and non-blocking.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/11] smp: Teach __smp_call_function_single() to check for offline cpus

2014-02-24 Thread Frederic Weisbecker

From: Jan Kara 

Align __smp_call_function_single() with smp_call_function_single() so
that it also checks whether requested cpu is still online.

Signed-off-by: Jan Kara 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/smp.h |  3 +--
 kernel/smp.c| 11 +++
 kernel/up.c |  5 +++--
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 4991c6b..c39074c 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,8 +50,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);
 
-void __smp_call_function_single(int cpuid, struct call_single_data *data,
-   int wait);
+int __smp_call_function_single(int cpu, struct call_single_data *csd, int 
wait);
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/smp.c b/kernel/smp.c
index e3852de..5ff14e3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -276,18 +276,18 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
 /**
  * __smp_call_function_single(): Run a function on a specific CPU
  * @cpu: The CPU to run on.
- * @data: Pre-allocated and setup data structure
+ * @csd: Pre-allocated and setup data structure
  * @wait: If true, wait until function has completed on specified CPU.
  *
  * Like smp_call_function_single(), but allow caller to pass in a
  * pre-allocated data structure. Useful for embedding @data inside
  * other structures, for instance.
  */
-void __smp_call_function_single(int cpu, struct call_single_data *csd,
-   int wait)
+int __smp_call_function_single(int cpu, struct call_single_data *csd, int wait)
 {
unsigned int this_cpu;
unsigned long flags;
+   int err = 0;
 
this_cpu = get_cpu();
/*
@@ -303,11 +303,14 @@ void __smp_call_function_single(int cpu, struct 
call_single_data *csd,
local_irq_save(flags);
csd->func(csd->info);
local_irq_restore(flags);
-   } else {
+   } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
csd_lock(csd);
generic_exec_single(cpu, csd, wait);
+   } else {
+   err = -ENXIO;   /* CPU not online */
}
put_cpu();
+   return err;
 }
 EXPORT_SYMBOL_GPL(__smp_call_function_single);
 
diff --git a/kernel/up.c b/kernel/up.c
index 509403e..cdf03d1 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -22,14 +22,15 @@ int smp_call_function_single(int cpu, void (*func) (void 
*info), void *info,
 }
 EXPORT_SYMBOL(smp_call_function_single);
 
-void __smp_call_function_single(int cpu, struct call_single_data *csd,
-   int wait)
+int __smp_call_function_single(int cpu, struct call_single_data *csd,
+  int wait)
 {
unsigned long flags;
 
local_irq_save(flags);
csd->func(csd->info);
local_irq_restore(flags);
+   return 0;
 }
 EXPORT_SYMBOL(__smp_call_function_single);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/11] smp: Rename __smp_call_function_single() to smp_call_function_single_async()

2014-02-24 Thread Frederic Weisbecker

The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().

The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.

As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.

Lets rename the function and enhance the comments so that they reflect
these properties.

Suggested-by: Christoph Hellwig 
Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jan Kara 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 block/blk-mq.c|  2 +-
 block/blk-softirq.c   |  2 +-
 drivers/cpuidle/coupled.c |  2 +-
 include/linux/smp.h   |  2 +-
 kernel/sched/core.c   |  2 +-
 kernel/smp.c  | 19 +--
 kernel/up.c   |  4 ++--
 net/core/dev.c|  2 +-
 8 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 62154ed..6468a71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -353,7 +353,7 @@ void __blk_mq_complete_request(struct request *rq)
rq->csd.func = __blk_mq_complete_request_remote;
rq->csd.info = rq;
rq->csd.flags = 0;
-   __smp_call_function_single(ctx->cpu, &rq->csd);
+   smp_call_function_single_async(ctx->cpu, &rq->csd);
} else {
rq->q->softirq_done_fn(rq);
}
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 6345b7e..ebd6b6f 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -70,7 +70,7 @@ static int raise_blk_irq(int cpu, struct request *rq)
data->info = rq;
data->flags = 0;
 
-   __smp_call_function_single(cpu, data);
+   smp_call_function_single_async(cpu, data);
return 0;
}
 
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 0411594..cb6654b 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -323,7 +323,7 @@ static void cpuidle_coupled_poke(int cpu)
struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
 
if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poke_pending))
-   __smp_call_function_single(cpu, csd);
+   smp_call_function_single_async(cpu, csd);
 }
 
 /**
diff --git a/include/linux/smp.h b/include/linux/smp.h
index b410a1f..633f5ed 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,7 +50,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);
 
-int __smp_call_function_single(int cpu, struct call_single_data *csd);
+int smp_call_function_single_async(int cpu, struct call_single_data *csd);
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94e51f7..3fbaeba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -432,7 +432,7 @@ void hrtick_start(struct rq *rq, u64 delay)
if (rq == this_rq()) {
__hrtick_restart(rq);
} else if (!rq->hrtick_csd_pending) {
-   __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd);
+   smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
rq->hrtick_csd_pending = 1;
}
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index b767631..06d574e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -238,15 +238,22 @@ int smp_call_function_single(int cpu, smp_call_func_t 
func, void *info,
 EXPORT_SYMBOL(smp_call_function_single);
 
 /**
- * __smp_call_function_single(): Run a function on a specific CPU
+ * smp_call_function_single_async(): Run an asynchronous function on a
+ *  specific CPU.
  * @cpu: The CPU to run on.
  * @csd: Pre-allocated and setup data structure
  *
- * Like smp_call_function_single(), but allow caller to pass in a
- * pre-allocated data structure. Useful for embedding @data inside
- * other structures, for instance.
+ * Like smp_call_function_single(), but the call is asynchonous and
+ * can thus be done from contexts with disabled interrupts.
+ *
+ * The caller passes his own pre-allocated data structure
+ * (ie: embedded in an object) and is responsible for synchronizing it
+ * such that the IPIs performed on the @csd are strictly serialized.
+ *
+ * NOTE: Be careful, there is unfortunately no current debugging facility to
+ * validate the correctness of this serialization.
  */
-int __smp_call_function_single(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu

[PATCH 10/11] smp: Remove wait argument from __smp_call_function_single()

2014-02-24 Thread Frederic Weisbecker

The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.

This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.

This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.

Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:

1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
   in an object. It looks like a nice and convenient pattern at the first
   sight because we can then retrieve the object from the IPI handler easily.

   But actually it is a waste of memory space in the object since the csd
   can be allocated from the stack by smp_call_function_single(wait = 1)
   and the object can be passed an the IPI argument.

   Besides that, embedding the csd in an object is more error prone
   because the caller must take care of the serialization of the IPIs
   for this csd.

2) The user calls __smp_call_function_single(wait = 1) on a csd that
   is allocated on the stack. It's ok but smp_call_function_single()
   can do it as well and it already takes care of the allocation on the
   stack. Again it's more simple and less error prone.

Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.

There was a single user of that which has just been converted.
So lets remove this option to discourage further users.

Cc: Andrew Morton 
Cc: Christoph Hellwig 
Cc: Ingo Molnar 
Cc: Jan Kara 
Cc: Jens Axboe 
Signed-off-by: Frederic Weisbecker 
---
 block/blk-mq.c|  2 +-
 block/blk-softirq.c   |  2 +-
 drivers/cpuidle/coupled.c |  2 +-
 include/linux/smp.h   |  2 +-
 kernel/sched/core.c   |  2 +-
 kernel/smp.c  | 19 ---
 kernel/up.c   |  3 +--
 net/core/dev.c|  2 +-
 8 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fa9dd1..62154ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -353,7 +353,7 @@ void __blk_mq_complete_request(struct request *rq)
rq->csd.func = __blk_mq_complete_request_remote;
rq->csd.info = rq;
rq->csd.flags = 0;
-   __smp_call_function_single(ctx->cpu, &rq->csd, 0);
+   __smp_call_function_single(ctx->cpu, &rq->csd);
} else {
rq->q->softirq_done_fn(rq);
}
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index b5c37d9..6345b7e 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -70,7 +70,7 @@ static int raise_blk_irq(int cpu, struct request *rq)
data->info = rq;
data->flags = 0;
 
-   __smp_call_function_single(cpu, data, 0);
+   __smp_call_function_single(cpu, data);
return 0;
}
 
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index e952936..0411594 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -323,7 +323,7 @@ static void cpuidle_coupled_poke(int cpu)
struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
 
if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poke_pending))
-   __smp_call_function_single(cpu, csd, 0);
+   __smp_call_function_single(cpu, csd);
 }
 
 /**
diff --git a/include/linux/smp.h b/include/linux/smp.h
index c39074c..b410a1f 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -50,7 +50,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);
 
-int __smp_call_function_single(int cpu, struct call_single_data *csd, int 
wait);
+int __smp_call_function_single(int cpu, struct call_single_data *csd);
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6edbef2..94e51f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -432,7 +432,7 @@ void hrtick_start(struct rq *rq, u64 delay)
if (rq == this_rq()) {
__hrtick_restart(rq);
} else if (!rq->hrtick_csd_pending) {
-   __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd, 0);
+   __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd);
rq->hrtick_csd_pending = 1;
}
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index fa04ab9..b767631 100644
--- a/ker

Re: [PATCH 00/11] smp: Single IPI cleanups v2

2014-02-24 Thread Frederic Weisbecker

On Mon, Feb 24, 2014 at 09:20:53AM -0800, Jens Axboe wrote:
> On 2014-02-24 07:39, Frederic Weisbecker wrote:
> >Hi,
> >
> >This version includes:
> >
> >* Rename __smp_call_function_single to smp_call_function_single() as
> >   suggested by Christoph.
> >
> >* Acks and reviewed-by added.
> >
> >* Rebase against -rc4
> >
> >Thanks.
> 
> I'd be happy to take this in, as it's mostly centered around the blk
> cleanups.

I initially planned to push this tree to Ingo because I have some nohz patches
that will depend on this set.

But there is indeed quite some block changes there. So I should let these go
through your tree to avoid bad conflicts. If you never rebase your tree I can
work on top if it.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace

2014-02-25 Thread Frederic Weisbecker

On Tue, Feb 25, 2014 at 01:33:55PM +0100, Henrik Austad wrote:
> From: Henrik Austad 
> 
> Hi!
> 
> This is a rework of the preiovus patch based on the feedback gathered
> from the last round. I've split it up a bit, mostly to make it easier to
> single out the parts that require more attention (#4 comes to mind).
> 
> Being able to read (and possible force a specific CPU to handle all
> do_timer() updates) can be very handy when debugging a system and tuning
> for performance. It is not always easy to route interrupts to a specific
> core (or away from one, for that matter).

It's a bit vague as a reason for the patchset. Do we really need it?

Concerning the read-only part, if I want to know which CPU is handling the
timekeeping, I'd rather use tracing than a sysfs file. I can correlate
timekeeping update traces with other events. Especially as the timekeeping duty
can change hands and move to any CPU all the time. We really don't want to
poll on a sysfs file to get that information. It's not adapted and doesn't
carry any timestamp. It may be useful only if the timekeeping CPU is static.

Now looking at the write part. What kind of usecase do you have in mind?

It's also important to consider that, in the case of NO_HZ_IDLE, if you force
the timekeeping duty to a specific CPU, it won't be able to enter in dynticks
idle mode as long as any other CPU is running. Because those CPUs can make use 
of
jiffies or gettimeofday() and must have uptodate values. This involve quite
some complication like using the full system idle detection 
(CONFIG_NO_HZ_FULL_SYSIDLE)
to avoid races between timekeeper entering dynticks idle mode and other CPUs 
waking
up from idle. But the worst here is the powesaving issues resulting from the 
timekeeper
who can't sleep.

These issues are being dealt with in NO_HZ_FULL because we want the timekeeping 
duty
to be affine to the CPUs that are no full dynticks. But in the case of 
NO_HZ_IDLE,
I fear it's not going to be desirable.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V6 0/4] perf/x86/amd: AMD Family 16h Data Breakpoint Extensions

2014-02-25 Thread Frederic Weisbecker

On Tue, Feb 25, 2014 at 09:09:29AM -0600, Suravee Suthikulanit wrote:
> On 1/8/2014 7:52 PM, Frederic Weisbecker wrote:
> >This message has been archived. View the original item
> ><http://ausev2.amd.com/EnterpriseVault/ViewMessage.asp?VaultId=1EBD12133601C6E47868220A36CFE2B20111amdvault.amd.com&SavesetId=201402081892875~20140109015303~Z~D12198340FA0E3A8078F6DDADF1DAC51>
> >Hi Suravee,
> >
> >On Wed, Jan 08, 2014 at 01:00:36PM -0600, Suravee Suthikulanit wrote:
> > > Ping.  Are there any other concerns regarding this patch?
> > >
> > > Thank you,
> > >
> > > Suravee
> >
> >The patches look good. I'll apply the series and propose it to the perf
> >maintainers.
> >
> >Thanks!
> >
> 
> Frederic,
> 
> I am following up with this patch set.

Thanks for the reminder. I'm applying this right now and will propose for 3.15.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6 v2] Expose do_timer CPU as RW to userspace

2014-02-26 Thread Frederic Weisbecker

On Wed, Feb 26, 2014 at 09:16:03AM +0100, Henrik Austad wrote:
> On Tue, Feb 25, 2014 at 03:19:09PM +0100, Frederic Weisbecker wrote:
> > On Tue, Feb 25, 2014 at 01:33:55PM +0100, Henrik Austad wrote:
> > > From: Henrik Austad 
> > > 
> > > Hi!
> > > 
> > > This is a rework of the preiovus patch based on the feedback gathered
> > > from the last round. I've split it up a bit, mostly to make it easier to
> > > single out the parts that require more attention (#4 comes to mind).
> > > 
> > > Being able to read (and possible force a specific CPU to handle all
> > > do_timer() updates) can be very handy when debugging a system and tuning
> > > for performance. It is not always easy to route interrupts to a specific
> > > core (or away from one, for that matter).
> > 
> > It's a bit vague as a reason for the patchset. Do we really need it?
> 
> One case is to move the timekeeping away from cores I know have 
> interrupt-issues (in an embedded setup, it is not always easy to move 
> interrupts away).
> 
> Another is to remove jitter from cores doing either real-time work or heavy 
> workerthreads. The timekeeping update is pretty fast, but I do not see any 
> reason for letting timekeeping interfere with my workers if it does not 
> have to.

Ok. I'll get back to that below.
 
> > Concerning the read-only part, if I want to know which CPU is handling the
> > timekeeping, I'd rather use tracing than a sysfs file. I can correlate
> > timekeeping update traces with other events. Especially as the timekeeping 
> > duty
> > can change hands and move to any CPU all the time. We really don't want to
> > poll on a sysfs file to get that information. It's not adapted and doesn't
> > carry any timestamp. It may be useful only if the timekeeping CPU is static.
> 
> I agree that not having a timestamp will make it useless wrt to tracing, 
> but that was never the intention. By having a sysfs/sysctl value you can 
> quickly determine if the timekeeping is bound to a single core or if it is 
> handled everywhere.
> 
> Tracing will give you the most accurate result, but that's not always what 
> you want as tracing also provides an overhead (both in the kernel as well 
> as in the head of the user) using the sysfs/sysctl interface for grabbing 
> the CPU does not.
> 
> You can also use it to verify that the forced-cpu you just sat, did in fact 
> have the desired effect.
> 
> Another approach I was contemplating, was to let current_cpu return the 
> current mask CPUs where the timer is running, once you set it via 
> forced_cpu, it will narrow down to that particular core. Would that be more 
> useful for the RO approach outisde TICK_PERIODIC?

Ok so this is about checking which CPU the timekeeping is bound to.
But what do you diplay in the normal case (ie: when timekeeping is globally 
affine?)

-1 could be an option but hmm...

Wouldn't it be saner to use a cpumask of the timer affinity instead? This
is the traditional way we affine something in /proc or /sys

> 
> > Now looking at the write part. What kind of usecase do you have in mind?
> 
> Forcing the timer to run on single core only, and a core of my choosing at 
> that.
> 
> - Get timekeeping away from cores with bad interrupts (no, I cannot move 
>   them).
> - Avoid running timekeeping udpates on worker-cores.

Ok but what you're moving away is not the tick but the timekeeping duty, which
is only a part of the tick. A significant part but still just a part.

Does this all make sense outside the NO_HZ_FULL case?

> 
> > It's also important to consider that, in the case of NO_HZ_IDLE, if you 
> > force
> > the timekeeping duty to a specific CPU, it won't be able to enter in 
> > dynticks
> > idle mode as long as any other CPU is running. 
> 
> Yes, it will in effect be a TICK_PERIODIC core where I can configure which 
> core the timekeeping update will happen.

Ok, I missed that part. So when the timekeeping is affine to a specific CPU,
this CPU is prevented to enter into dynticks idle mode?

> 
> > Because those CPUs can make use of jiffies or gettimeofday() and must 
> > have uptodate values. This involve quite some complication like using the 
> > full system idle detection (CONFIG_NO_HZ_FULL_SYSIDLE) to avoid races 
> > between timekeeper entering dynticks idle mode and other CPUs waking up 
> > from idle. But the worst here is the powesaving issues resulting from the 
> > timekeeper who can't sleep.
> 
> Personally, when I force the timer to be bound to a specific CPU, I'm 
> pretty happy with the fact that it won't be allo

[PATCH 2/2] nohz: ensure users are aware boot CPU is not NO_HZ_FULL

2014-02-26 Thread Frederic Weisbecker

From: Paul Gortmaker 

This bit of information is in the Kconfig help text:

  "Note the boot CPU will still be kept outside the range to
  handle the timekeeping duty."

However neither the variable NO_HZ_FULL_ALL, or the prompt
convey this important detail, so lets add it to the prompt
to make it more explicitly obvious to the average user.

Acked-by: Paul E. McKenney 
Signed-off-by: Paul Gortmaker 
Cc: Ingo Molnar 
Cc: Paul Gortmaker 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1391711781-7466-1-git-send-email-paul.gortma...@windriver.com
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 3ce6e8c..f448513 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -124,7 +124,7 @@ config NO_HZ_FULL
 endchoice
 
 config NO_HZ_FULL_ALL
-   bool "Full dynticks system on all CPUs by default"
+   bool "Full dynticks system on all CPUs by default (except CPU 0)"
depends on NO_HZ_FULL
help
  If the user doesn't pass the nohz_full boot option to
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] timer: Spare IPI when deferrable timer is queued on idle remote targets

2014-02-26 Thread Frederic Weisbecker

From: Viresh Kumar 

When a timer is enqueued or modified on a remote target, the latter is
expected to see and handle this timer on its next tick. However if the
target is idle and CONFIG_NO_HZ_IDLE=y, the CPU may be sleeping tickless
and the timer may be ignored.

wake_up_nohz_cpu() takes care of that by setting TIF_NEED_RESCHED and
sending an IPI to idle targets so that the tick is reevaluated on the
idle loop through the tick_nohz_idle_*() APIs.

Now this is all performed regardless of the power properties of the
timer. If the timer is deferrable, idle targets don't need to be woken
up. Only the next buzy tick needs to care about it, and no IPI kick
is needed for that to happen.

So lets spare the IPI on idle targets when the timer is deferrable.

Meanwhile we keep the current behaviour on full dynticks targets. We can
spare IPIs on idle full dynticks targets as well but some tricky races
against idle_cpu() must be dealt all along to make sure that the timer
is well handled after idle exit. We can deal with that later since
NO_HZ_FULL already has more important powersaving issues.

Reported-by: Thomas Gleixner 
Signed-off-by: Viresh Kumar 
Cc: Ingo Molnar 
Cc: Paul Gortmaker 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/cakohpommz0tan2e6n76_g4zrzxd5vz1xfuzfxrp7gmxftni...@mail.gmail.com
Signed-off-by: Frederic Weisbecker 
---
 kernel/timer.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index accfd24..b75e789 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -939,8 +939,15 @@ void add_timer_on(struct timer_list *timer, int cpu)
 * with the timer by holding the timer base lock. This also
 * makes sure that a CPU on the way to stop its tick can not
 * evaluate the timer wheel.
+*
+* Spare the IPI for deferrable timers on idle targets though.
+* The next busy ticks will take care of it. Except full dynticks
+* require special care against races with idle_cpu(), lets deal
+* with that later.
 */
-   wake_up_nohz_cpu(cpu);
+   if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu))
+   wake_up_nohz_cpu(cpu);
+
spin_unlock_irqrestore(&base->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] timers update for 3.15

2014-02-26 Thread Frederic Weisbecker

Ingo,

Please pull the timers/core branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/core
---

* The patch from Viresh removes some unecessary scheduler IPIs that wake
  up the CPUs when deferrable timers are enqueued on remote targets.

  In practice I have seen on boot some of these IPIs from various
  sources:  MCE, vmstat/SLAB, cpufreq. They happen either on initcall
  or cpu hotplug. Since these timers are enqueued on all CPUs, there
  are some potential big rounds of IPIs that are spared with this patch.
  
  But it's just what I've seen on my own machine on boot. I expect some
  more scenarios where a few IPIs will be avoided depending on configs
  and usecases because we have some more users of deferrable timers.

* Kconfig text made clearer for full dynticks by Paul Gortmaker.

Thanks,
Frederic
---

Paul Gortmaker (1):
  nohz: ensure users are aware boot CPU is not NO_HZ_FULL

Viresh Kumar (1):
  timer: Spare IPI when deferrable timer is queued on idle remote targets


 kernel/time/Kconfig | 2 +-
 kernel/timer.c  | 9 -
 2 files changed, 9 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/9] block: Stop abusing rq->csd.list in blk-softirq

2014-01-30 Thread Frederic Weisbecker

On Thu, Jan 30, 2014 at 04:45:23PM +0100, Jan Kara wrote:
>   Hi,
> 
> On Thu 30-01-14 13:39:18, Frederic Weisbecker wrote:
> > I'm currently working on some cleanups on IPI code too and working on top
> > of these patches, just have a few comments:
>   Great, thanks!
> 
> > On Mon, Dec 23, 2013 at 09:39:23PM +0100, Jan Kara wrote:
> > > Abusing rq->csd.list for a list of requests to complete is rather ugly.
> > > Especially since using queuelist should be safe and much cleaner.
> > 
> > It would be nice to have a few more details that explain why doing so is 
> > safe
> > wrt a block request lifecycle. At least something that tells why 
> > rq->queuelist
> > can't be ever used concurrently by the time we send the IPI and we 
> > trigger/raise
> > the softirq.
>   Sure. Should I send the patch to you with an updated changelog and added
> comment you requested?

Yeah that would be nice!

For more precision, I would like to apply these:

1) block: Stop abusing csd.list for fifo_time
2) block: Stop abusing rq->csd.list in blk-softirq
3) kernel: use lockless list for smp_call_function_single()
4) smp: Teach __smp_call_function_single() to check for offline cpus

This is because I need to tweak a bit the IPI code for some nohz
functionnality. I'm not sure about the result but in any case these
llist based cleanups look very nice, so lets try to push them. I'm also
working on some consolidation between __smp_call_function_single()
and smp_call_function_single() since they share almost the same code.

Also this shouldn't conflict with Andrew's tree if he applies these as well
since -mm is based on -next. And the printk part should still go through his
tree I think.

> 
> > > Signed-off-by: Jan Kara 
> > > ---
> > >  block/blk-softirq.c | 12 ++--
> > >  1 file changed, 6 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> > > index 57790c1a97eb..7ea5534096d5 100644
> > > --- a/block/blk-softirq.c
> > > +++ b/block/blk-softirq.c
> > > @@ -30,8 +30,8 @@ static void blk_done_softirq(struct softirq_action *h)
> > >   while (!list_empty(&local_list)) {
> > >   struct request *rq;
> > >  
> > > - rq = list_entry(local_list.next, struct request, csd.list);
> > > - list_del_init(&rq->csd.list);
> > > + rq = list_entry(local_list.next, struct request, queuelist);
> > > + list_del_init(&rq->queuelist);
> > >   rq->q->softirq_done_fn(rq);
> > >   }
> > >  }
> > > @@ -45,9 +45,9 @@ static void trigger_softirq(void *data)
> > >  
> > >   local_irq_save(flags);
> > >   list = this_cpu_ptr(&blk_cpu_done);
> > > - list_add_tail(&rq->csd.list, list);
> > > + list_add_tail(&rq->queuelist, list);
> > 
> > And given that's an alternate use of rq->queuelist, perhaps add a comment
> > to unconfuse people.
>   Good idea, will do.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/9] block: Stop abusing rq->csd.list in blk-softirq

2014-01-31 Thread Frederic Weisbecker

On Thu, Jan 30, 2014 at 11:12:41PM +0100, Jan Kara wrote:
> On Thu 30-01-14 18:01:20, Frederic Weisbecker wrote:
> > On Thu, Jan 30, 2014 at 04:45:23PM +0100, Jan Kara wrote:
> > >   Hi,
> > > 
> > > On Thu 30-01-14 13:39:18, Frederic Weisbecker wrote:
> > > > I'm currently working on some cleanups on IPI code too and working on 
> > > > top
> > > > of these patches, just have a few comments:
> > >   Great, thanks!
> > > 
> > > > On Mon, Dec 23, 2013 at 09:39:23PM +0100, Jan Kara wrote:
> > > > > Abusing rq->csd.list for a list of requests to complete is rather 
> > > > > ugly.
> > > > > Especially since using queuelist should be safe and much cleaner.
> > > > 
> > > > It would be nice to have a few more details that explain why doing so 
> > > > is safe
> > > > wrt a block request lifecycle. At least something that tells why 
> > > > rq->queuelist
> > > > can't be ever used concurrently by the time we send the IPI and we 
> > > > trigger/raise
> > > > the softirq.
> > >   Sure. Should I send the patch to you with an updated changelog and added
> > > comment you requested?
> > 
> > Yeah that would be nice!
>   OK, the updated patch is attached.

Applied, thanks!

Note that the llist use in smp.c patch from Christoph has been merged
upstream today. But it keeps list_head in a union so I applied your changes
that:

1) remove list_head from smp.c
2) use llist_for_each_entry_safe()

in seperate delta patches. Anyway, I'll send the series soonish.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is it ok for deferrable timer wakeup the idle cpu?

2014-01-31 Thread Frederic Weisbecker

On Wed, Jan 29, 2014 at 10:57:59AM +0530, Preeti Murthy wrote:
> Hi,
> 
> On Thu, Jan 23, 2014 at 11:22 AM, Viresh Kumar  
> wrote:
> 
> > Hi Guys,
> >
> > So the first question is why cpufreq needs it and is it really stupid?
> > Yes, it is stupid but that's how its implemented since a long time. It does
> > so to get data about the load on CPUs, so that freq can be scaled up/down.
> >
> > Though there is a solution in discussion currently, which will take
> > inputs from scheduler and so these background timers would go away.
> > But we need to wait until that time.
> >
> > Now, why do we need that for every cpu, while that for a single cpu might
> > be enough? The answer is cpuidle here: What if the cpu responsible for
> > running timer goes to sleep? Who will evaluate the load then? And if we
> > make this timer run on one cpu in non-deferrable mode then that cpu
> > would be waken up again and again from idle. So, it was decided to have
> > a per-cpu deferrable timer. Though to improve efficiency, once it is fired
> > on any cpu, timer for all other CPUs are rescheduled, so that they don't
> > fire before 5ms (sampling time)..
> 
> How about simplifying this design by doing the below?
> 
> 1. Since anyway cpufreq governors monitor load on the cpu once every
> 5ms, *tie it with tick_sched_timer*, which also gets deferred when the cpu
> enters nohz_idle.
> 
> 2. To overcome the problem of running this job of monitoring the load
> on every cpu, have the *time keeping* cpu do it for you.
> 
> The time keeping cpu has the property that if it has to go to idle, it will do
> so and let the next cpu that runs the periodic timer become the time keeper.
> Hence no cpu is prevented from entering nohz_idle and the cpu that is busy
> and first executes periodic timer will take over as the time keeper.
> 
> The result would be:
> 
> 1. One cpu at any point in time will be monitoring cpu load, at every sched 
> tick
> as long as its busy. If it goes to sleep, then it gives up this duty
> and enters idle.
> The next cpu that runs the periodic timer becomes the cpu to monitor the load
> and will continue to do so as long as its busy. Hence we do not miss 
> monitoring
> the cpu load.

Well that's basically what an unbound deferrable timer does. It's deferrable so
it's doesn't prevent from entering dynticks idle mode and it's not affine to any
particular CPU so it's going to be tied to a buzy CPU according to the scheduler
(see get_nohz_timer_target()).

> 
> 2. This will avoid an additional timer for cpufreq.

That doesn't look like a problem.

> 
> 3. It avoids sending IPIs each time this timer gets modified since there is 
> just
> one CPU doing the monitoring.

If we fix the initial issue properly, we shouldn't need to send an IPI anymore.

> 
> 4. The downside to this could be that we are stretching the functions of the
>  periodic timer into the power management domain which does not seem like
> the right thing to do.

Indeed, that's what I'm worried about. The tick has grown into a Big Kernel 
Timer
where any subsystem can hook into for any kind of periodic event. This is why it
was not easy to implement full dynticks, and it's not even complete yet due
to the complicated dependencies involved.

> 
> Having said the above, the fix that Viresh has proposed along with the 
> nohz_full
> condition that Frederick added looks to solve this problem.

In any case I believe we want Viresh patch since there are other users
of deferrable timers that can profit from this.

So I'm queueing it.

> 
> But just a thought on if there is scope to improve this part of the
> cpufreq code.
> What do you all think?

I fear I don't know the problem well enough to display any serious advice.
It depends what kind of measurement is needed. For example, isn't there some
loads statistics that are already available from the scheduler that you could 
reuse?

The scheduler alone takes gazillions of different loads and power statistics 
taken
in interesting path such as the tick or sched switches. Aren't there some 
read-only metrics
that could be interesting?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] nohz: ensure users are aware boot CPU is not NO_HZ_FULL

2014-01-31 Thread Frederic Weisbecker

On Tue, Jan 28, 2014 at 04:06:30PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 28, 2014 at 12:10:49PM -0500, Paul Gortmaker wrote:
> > This bit of information is in the Kconfig help text:
> > 
> >   Note the boot CPU will still be kept outside the range to
> >   handle the timekeeping duty.
> > 
> > However neither the variable NO_HZ_FULL_ALL, or the prompt
> > convey this important detail, so lets add it to the prompt
> > to make it more explicitly obvious to the average user.
> > 
> > Cc: Frederic Weisbecker 
> > Cc: "Paul E. McKenney" 
> > Signed-off-by: Paul Gortmaker 
> 
> Acked-by: Paul E. McKenney 
> 
> > ---
> >  kernel/time/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> > index 3ce6e8c5f3fc..76a1fbef1fd8 100644
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -124,7 +124,7 @@ config NO_HZ_FULL
> >  endchoice
> > 
> >  config NO_HZ_FULL_ALL
> > -   bool "Full dynticks system on all CPUs by default"
> > +   bool "Full dynticks system on all (but boot) CPUs by default"

Looks good. Now we are all using "boot CPU" to refer to the CPU 0, but it may
not sound that obvious to everyone.

How about: "Full dynticks system on all CPUs by default (expect CPU 0)" ?

Sure this is a temporary behaviour until I get the sysidle detection use 
patchset in shape,
but indeed worth a big fat warning until we sort it out.

Thanks!

> > depends on NO_HZ_FULL
> > help
> >   If the user doesn't pass the nohz_full boot option to
> > -- 
> > 1.8.5.2
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/9] block: Stop abusing csd.list for fifo_time

2014-02-01 Thread Frederic Weisbecker

On Mon, Dec 23, 2013 at 09:39:22PM +0100, Jan Kara wrote:
> Block layer currently abuses rq->csd.list.next for storing fifo_time.
> That is a terrible hack and completely unnecessary as well. Union
> achieves the same space saving in a cleaner way.
> 
> Signed-off-by: Jan Kara 

Hi Jan,

Taken as is, the patch is fine and it builds.
But later when I finally get rid of csd->list in a subsequent patch,
rq_fifo_clear() callers break the build.

This is because rq_fifo_clear() initialize the csd->list and I'm not
sure how to fix that leftover because I am not clear about the purpose
of that INIT_LIST_HEAD(): is it to reset fifo time or to prepare for
an IPI to be queued?

All in one it looks buggy because if this is to prepare for the IPI,
it's useless as csd.list is not a list head but just a node. Otherwise if it
is to reset fifo_time it's wrong because INIT_LIST_HEAD doesn't initialize
to 0.

Anyway so I did a braindead fix by removing the whole INIT_LIST_HEAD()
from rq_fifo_clear(), see the patch below. Now to be honest I have no idea
what I'm doing.

---
>From 112bcbb73076dd1374541eec9b554410dd0e73e0 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 23 Dec 2013 21:39:22 +0100
Subject: [PATCH] block: Stop abusing csd.list for fifo_time

Block layer currently abuses rq->csd.list.next for storing fifo_time.
That is a terrible hack and completely unnecessary as well. Union
achieves the same space saving in a cleaner way.

Signed-off-by: Jan Kara 
Signed-off-by: Frederic Weisbecker 
---
 block/cfq-iosched.c  |  8 
 block/deadline-iosched.c |  8 
 include/linux/blkdev.h   |  1 +
 include/linux/elevator.h | 12 ++--
 4 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 744833b..5873e4a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2367,10 +2367,10 @@ cfq_merged_requests(struct request_queue *q, struct 
request *rq,
 * reposition in fifo if next is older than rq
 */
if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
-   time_before(rq_fifo_time(next), rq_fifo_time(rq)) &&
+   time_before(next->fifo_time, rq->fifo_time) &&
cfqq == RQ_CFQQ(next)) {
list_move(&rq->queuelist, &next->queuelist);
-   rq_set_fifo_time(rq, rq_fifo_time(next));
+   rq->fifo_time = next->fifo_time;
}
 
if (cfqq->next_rq == next)
@@ -2814,7 +2814,7 @@ static struct request *cfq_check_fifo(struct cfq_queue 
*cfqq)
return NULL;
 
rq = rq_entry_fifo(cfqq->fifo.next);
-   if (time_before(jiffies, rq_fifo_time(rq)))
+   if (time_before(jiffies, rq->fifo_time))
rq = NULL;
 
cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
@@ -3927,7 +3927,7 @@ static void cfq_insert_request(struct request_queue *q, 
struct request *rq)
cfq_log_cfqq(cfqd, cfqq, "insert_request");
cfq_init_prio_data(cfqq, RQ_CIC(rq));
 
-   rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
+   rq->fifo_time = jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)];
list_add_tail(&rq->queuelist, &cfqq->fifo);
cfq_add_rq_rb(rq);
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqd->serving_group,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 9ef6640..a753df2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -106,7 +106,7 @@ deadline_add_request(struct request_queue *q, struct 
request *rq)
/*
 * set expire time and add to fifo list
 */
-   rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
+   rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
 }
 
@@ -174,9 +174,9 @@ deadline_merged_requests(struct request_queue *q, struct 
request *req,
 * and move into next position (next will be deleted) in fifo
 */
if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
-   if (time_before(rq_fifo_time(next), rq_fifo_time(req))) {
+   if (time_before(next->fifo_time, req->fifo_time)) {
list_move(&req->queuelist, &next->queuelist);
-   rq_set_fifo_time(req, rq_fifo_time(next));
+   req->fifo_time = next->fifo_time;
}
}
 
@@ -230,7 +230,7 @@ static inline int deadline_check_fifo(struct deadline_data 
*dd, int ddir)
/*
 * rq is expired!
 */
-   if (time_after_eq(jiffies, rq_fifo_time(rq)))
+   if (time_after_eq(jiffies, rq->fifo_time))
return 1;

Re: [PATCH 1/9] block: Stop abusing csd.list for fifo_time

2014-02-03 Thread Frederic Weisbecker

On Mon, Feb 03, 2014 at 03:48:29PM +0100, Jan Kara wrote:
> On Sat 01-02-14 17:48:27, Frederic Weisbecker wrote:
> > On Mon, Dec 23, 2013 at 09:39:22PM +0100, Jan Kara wrote:
> > > Block layer currently abuses rq->csd.list.next for storing fifo_time.
> > > That is a terrible hack and completely unnecessary as well. Union
> > > achieves the same space saving in a cleaner way.
> > > 
> > > Signed-off-by: Jan Kara 
> > 
> > Hi Jan,
> > 
> > Taken as is, the patch is fine and it builds.
> > But later when I finally get rid of csd->list in a subsequent patch,
> > rq_fifo_clear() callers break the build.
> > 
> > This is because rq_fifo_clear() initialize the csd->list and I'm not
> > sure how to fix that leftover because I am not clear about the purpose
> > of that INIT_LIST_HEAD(): is it to reset fifo time or to prepare for
> > an IPI to be queued?
>   I'm convinced it is there to prepare IPI to be queued. So just removing
> the initialization as you did should be the right thing to do. You can
> easily verify that it is correct - you boot the kernel, switch to 'deadline'
> IO scheduler by doing
>   echo 'deadline' >/sys/block/sda/queue/scheduler
> and then do some IO. If it doesn't blow up, it is correct.

Ok that seem to work :)

> 
> > All in one it looks buggy because if this is to prepare for the IPI,
> > it's useless as csd.list is not a list head but just a node. Otherwise if it
> > is to reset fifo_time it's wrong because INIT_LIST_HEAD doesn't initialize
> > to 0.
>   Yup, I think it is useless.

Ok then I'll apply this change. I'm just moving it to a separate patch to lower
the chance of it being missed.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

2014-02-13 Thread Frederic Weisbecker

On Wed, Feb 12, 2014 at 04:33:11PM -0800, Paul E. McKenney wrote:
> 
> Fair point!  I wordsmithed it into the following.  Seem reasonable?
> 
>   Thanx, Paul
> 
> 
> 
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> 
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
> 
> Signed-off-by: Paul E. McKenney 
> Reviewed-by: Tejun Heo 
> Acked-by: Frederic Weisbecker 
> 
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt 
> b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..f3cd299fcc41 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
>  To reduce its OS jitter, do any of the following:
>  1.   Run your workload at a real-time priority, which will allow
>   preempting the kworker daemons.
> -2.   Do any of the following needed to avoid jitter that your
> +2.   A given workqueue can be made visible in the sysfs filesystem
> + by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
> + Such a workqueue can be confined to a given subset of the
> + CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
> + files.  The set of WQ_SYSFS workqueues can be displayed using
> + "ls sys/devices/virtual/workqueue".  That said, the workqueues
> + maintainer would like to caution people against indiscriminately
> + sprinkling WQ_SYSFS across all the workqueues.  The reason for
> + caution is that it is easy to add WQ_SYSFS, but because sysfs is
> + part of the formal user/kernel API, it can be nearly impossible
> + to remove it, even if its addition was a mistake.
> +3.   Do any of the following needed to avoid jitter that your
>   application cannot tolerate:
>   a.  Build your kernel with CONFIG_SLUB=y rather than
>   CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
> 

Perfect!!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - beta][PATCH] tracing: Introduce TRACE_MARKER() no argument trace event

2014-02-13 Thread Frederic Weisbecker

On Fri, Feb 07, 2014 at 11:23:57AM -0500, Steven Rostedt wrote:
> A while back ago, Wolfgang (and others) have asked about the ability to
> add a trace event that recorder no data other than the fact that the
> trace event was hit.
> 
> I've been reluctant to do so, but I noticed that one already exists in
> the iwlwifi tracepoints (iwlwifi_dev_irq) where it does a wasteful:
> 
> /* TP_printk("") doesn't compile */
> TP_printk("%d", 0)
> 
> The reason this is wasteful, is that there's a lot of code generated by
> the TRACE_EVENT() macros that end up basically being nops.
> 
> I figured I would instead create a TRACE_MARKER(name, print) that would
> be something like:
> 
> Added to tracepoint header:
> 
>  TRACE_MARKER(tpname, "Show this message");
> 
> Then you have:
> 
>   trace_tpname();
> 
> in the code.
> 
> Notice that the tracepoint function (trace_()) has no arguments.
> That's because the message is stored in the tracepoint (in one place)
> and is printed when the tracepoint is read. That is, the message isn't
> even recorded in the ring buffer.
> 
> It still shows up in the tracepoint format file:
> 
> name: tpname
> ID: 281
> format:
>   field:unsigned short common_type;   offset:0;   size:2; 
> signed:0;
>   field:unsigned char common_flags;   offset:2;   size:1; 
> signed:0;
>   field:unsigned char common_preempt_count;   offset:3;   size:1; 
> signed:0;
>   field:int common_pid;   offset:4;   size:4; signed:1;
> 
> 
> print fmt: "Show this message"
> 
> The TRACE_MARKER() is basically an optimized version of TRACE_EVENT()
> with no arguments. It can be enabled and disabled the same way as any
> other event, and stays within the system it was created in.
> 
> Is this worth doing?

It sounds worth yeah. I've run into this situation multiple times where I had
to pass 0 instead of nothing on a tracepoint.

Now about the name, why not TRACE_EVENT_EMPTY?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

2014-02-13 Thread Frederic Weisbecker

On Tue, Feb 11, 2014 at 02:22:43PM +0530, Viresh Kumar wrote:
> On 28 January 2014 18:53, Frederic Weisbecker  wrote:
> > No, when a single task is running on a full dynticks CPU, the tick is 
> > supposed to run
> > every seconds. I'm actually suprised it doesn't happen in your traces, did 
> > you tweak
> > something specific?
> 
> Why do we need this 1 second tick currently? And what will happen if I
> hotunplug that
> CPU and get it back? Would the timer for tick move away from CPU in
> question? I see
> that when I have changed this 1sec stuff to 300 seconds. But what
> would be impact
> of that? Will things still work normally?

So the problem resides in the gazillions accounting maintained in 
scheduler_tick() and
current->sched_class->task_tick().

The scheduler correctness depends on these to be updated regularly. If you 
deactivate
or increase the delay with very high values, the result is unpredictable. Just 
expect that
at least some scheduler feature will behave randomly, like load balancing for 
example or
simply local fairness issues.

So we have that 1 Hz max that makes sure that things are moving forward while 
keeping
a rate that should be still nice for HPC workloads. But we certainly want to 
find a
way to remove the need for any tick altogether for extreme real time workloads 
which
need guarantees rather than just optimizations.

I see two potential solutions for that:

1) Rework the scheduler accounting such that it is safe against full dynticks. 
That
was the initial plan but it's scary. The scheduler accountings is a huge maze. 
And I'm not
sure it's actually worth the complication.

2) Offload the accounting. For example we could imagine that the timekeeping 
could handle the
task_tick() calls on behalf of the full dynticks CPUs. At a small rate like 1 
Hz.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Add a new lockless wake-from-idle implementation

2014-02-13 Thread Frederic Weisbecker

2014-02-13 2:40 GMT+01:00 Andy Lutomirski :
> This is a strawman proposal to simplify the idle implementation, eliminate
> a race

Please describe the race in question.

>
> Benefits over current code:
>  - ttwu_queue_remote doesn't use an IPI unless needed
>  - The diffstat should speak for itself :)

Actually referring to the diffstat alone sounds dangerous here. Sure
this simplifies the code, I'm all for negative diffstat, and probably
it avoids a few IPIs, but this comes at the cost of added non-cheap
atomic_xchg() calls in some critical fastpath like resched_task()
path, the timer enqueue path and the inner idle loop.

So it's not like this all comes for free. I'm not saying we don't want
it but at least some serious measurements is needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - beta][PATCH] tracing: Introduce TRACE_MARKER() no argument trace event

2014-02-13 Thread Frederic Weisbecker

On Thu, Feb 13, 2014 at 10:25:52AM -0500, Steven Rostedt wrote:
> On Thu, 13 Feb 2014 15:26:42 +0100
> Frederic Weisbecker  wrote:
> 
> > > Is this worth doing?
> > 
> > It sounds worth yeah. I've run into this situation multiple times where I 
> > had
> 
> Is it?
> 
> > to pass 0 instead of nothing on a tracepoint.
> 
> What tracepoints?

The context tracking ones. Yeah TBH that sole user doesn't justify the feature 
:)

> 
> The problem I have with a tracepoint that does not pass any information
> what's so ever, is that it can be too tempting to use. Tracepoints do
> not have zero overhead when not used, but a negligible one. Too many
> tracepoints add up, and their foot print can start to be noticed.

Yeah, that maybe overused. I don't know, actually the problem hasn't arised
much.
 
> Point being, why have a tracepoint if you are not recording any data?
> Just do a function trace, or add a kprobe. That's easy enough.

Yeah but it's still easier to use static tracepoints. Especially when you
need to ask somebody to report some traces.

One of the point of static tracepoints is also to propose some key events
to the user. It matters when we do all-events wide tracing for example.

> 
> But that said, see below.
> 
> > 
> > Now about the name, why not TRACE_EVENT_EMPTY?
> 
> Because that's an ugly name ;-)
> 
> Also a bit misleading, because it sounds like the there's no items
> attached or something. It is too close to "list_empty()". My original
> name was TRACE_EVENT_NOARGS(), which could work too.
> 
> Now another possible solution is to just introduce a trace_marker()
> function that you could place anywhere. And then they could show up in
> trace/events/markers/file-func-line/
> 
> We could do something like:
> 
> struct trace_marker {
>   char *file;
>   char *func;
>   int line;
>   struct static_key key;
> };
> 
> #define trace_marker() __trace_marker(__FILE__,__func__, __LINE__)
> static inline void __trace_marker(const char *file,
>   const char *func, int line)
> {
>   static struct trace_marker marker
>   __attribute__((section("__trace_marker"))) = {
>   .file = file,
>   .func = func,
>   .line = line
>   };
> 
>   if (static_key_false(&marker.key))
>   trace_marker_call(&marker);
> }
> 
> As marker would be a static value, gcc should hard code the first
> parameter to it and make the call. Basically something like:
> 
>   mov $0x, %rdi
>   call trace_marker_call
> 
> If we really want to be efficient, we could extend jump labels for each
> arch, and remove the call completely.
> 
>   asm goto ("1:"
> ".byte NOP\n"
> ".pushsection __trace_marker_struct\n"
> "2:.quad 1b\n"
> ".quad %file\n"
> ".quad %func\n'
> ".word %line\n"
> ".popsection\n"
> ".pushsection __trace_marker_ptrs\n"
> ".quad 2b\n"
> ".popsection\n"
>: : file, func, line);
> 
> [ OK, I didn't follow the true format for asm syntax, but that's
> because I'm just trying to relay an idea, not actually make working
> code ]
> 
> Then the only thing that would be inserted in the code is a nop. We
> could then replace that nop with a call to a trampoline (similar to
> mcount) that can call a C function with the address that it came from
> so that the function could do a look up to find the matching marker to
> trace.
> 
> OK, this is probably a bit too much, but it is feasible. Most likely
> not worth the pain.

Anyway yeah, given the complication and the fact there are very few actual 
users, lets forget that feature.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] timer: Spare IPI when deferrable timer is queued on idle remote targets

2014-02-14 Thread Frederic Weisbecker

From: Viresh Kumar 

When a timer is enqueued or modified on a remote target, the latter is
expected to see and handle this timer on its next tick. However if the
target is idle and CONFIG_NO_HZ_IDLE=y, the CPU may be sleeping tickless
and the timer may be ignored.

wake_up_nohz_cpu() takes care of that by setting TIF_NEED_RESCHED and
sending an IPI to idle targets so that the tick is reevaluated on the
idle loop through the tick_nohz_idle_*() APIs.

Now this is all performed regardless of the power properties of the
timer. If the timer is deferrable, idle targets don't need to be woken
up. Only the next buzy tick needs to care about it, and no IPI kick
is needed for that to happen.

So lets spare the IPI on idle targets when the timer is deferrable.

Meanwhile we keep the current behaviour on full dynticks targets. We can
spare IPIs on idle full dynticks targets as well but some tricky races
against idle_cpu() must be dealt all along to make sure that the timer
is well handled after idle exit. We can deal with that later since
NO_HZ_FULL already has more important powersaving issues.

Reported-by: Thomas Gleixner 
Signed-off-by: Viresh Kumar 
Cc: Ingo Molnar 
Cc: Paul Gortmaker 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/cakohpommz0tan2e6n76_g4zrzxd5vz1xfuzfxrp7gmxftni...@mail.gmail.com
Signed-off-by: Frederic Weisbecker 
---
 kernel/timer.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index accfd24..881f883 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -939,8 +939,15 @@ void add_timer_on(struct timer_list *timer, int cpu)
 * with the timer by holding the timer base lock. This also
 * makes sure that a CPU on the way to stop its tick can not
 * evaluate the timer wheel.
+*
+* Spare the IPI for deferrable timers on idle targets though.
+* The next buzy ticks will take care of it. Except full dynticks
+* require special care against races with idle_cpu(), lets deal
+* with that later.
 */
-   wake_up_nohz_cpu(cpu);
+   if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu))
+   wake_up_nohz_cpu(cpu);
+
spin_unlock_irqrestore(&base->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_timer_on);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] dynticks: A few updates

2014-02-14 Thread Frederic Weisbecker

Hi,

A small pair of updates for the dynticks code. The patch from Viresh removes a
few scheduler IPIs that I have seen on boot which are there to wake up CPUs 
when some
deferrable timers are enqueued. Those were enqueued on all CPUs so there is a 
possible
big round of IPI. And those can be avoided with the first patch. On my machine 
the main
sources of these were MCE, vmstat/SLAB, cpufreq. They happen either on initcall 
or cpu hotplug.

But deferrable timers happen anytime, not just at boot. It depends on your 
config and
your load. So the positive impact of the first patch should be broader.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git
timers/core

Thanks.

---
Paul Gortmaker (1):
  nohz: ensure users are aware boot CPU is not NO_HZ_FULL

Viresh Kumar (1):
  timer: Spare IPI when deferrable timer is queued on idle remote
targets

 kernel/time/Kconfig | 2 +-
 kernel/timer.c  | 9 -
 2 files changed, 9 insertions(+), 2 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] nohz: ensure users are aware boot CPU is not NO_HZ_FULL

2014-02-14 Thread Frederic Weisbecker

From: Paul Gortmaker 

This bit of information is in the Kconfig help text:

  "Note the boot CPU will still be kept outside the range to
  handle the timekeeping duty."

However neither the variable NO_HZ_FULL_ALL, or the prompt
convey this important detail, so lets add it to the prompt
to make it more explicitly obvious to the average user.

Acked-by: Paul E. McKenney 
Signed-off-by: Paul Gortmaker 
Cc: Ingo Molnar 
Cc: Paul Gortmaker 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1391711781-7466-1-git-send-email-paul.gortma...@windriver.com
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 3ce6e8c..f448513 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -124,7 +124,7 @@ config NO_HZ_FULL
 endchoice
 
 config NO_HZ_FULL_ALL
-   bool "Full dynticks system on all CPUs by default"
+   bool "Full dynticks system on all CPUs by default (except CPU 0)"
depends on NO_HZ_FULL
help
  If the user doesn't pass the nohz_full boot option to
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

perf tools: Random cleanups

2014-01-14 Thread Frederic Weisbecker

Hi,

Just a bunch of non critical cleanups for comm and callchains. Based on latest 
acme:perf/core

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
perf/core-cleanups-for-acme

Thanks,
Frederic
---

Frederic Weisbecker (3):
  perf tools: Do proper comm override error handling
  perf tools: Spare double comparison of callchain first entry
  perf tools: Remove unnecessary callchain cursor state restore on unmatch


 tools/perf/util/callchain.c | 23 ++-
 tools/perf/util/comm.c  | 19 ++-
 tools/perf/util/comm.h  |  2 +-
 tools/perf/util/thread.c|  5 -
 4 files changed, 25 insertions(+), 24 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-14 Thread Frederic Weisbecker

When a new callchain child branch matches an existing one in the rbtree,
the comparison of its first entry is performed twice:

1) From append_chain_children() on branch lookup

2) If 1) reports a match, append_chain() then compares all entries of
the new branch against the matching node in the rbtree, and this
comparison includes the first entry of the new branch again.

Lets shortcut this by performing the whole comparison only from
append_chain() which then returns the result of the comparison between
the first entry of the new branch and the iterating node in the rbtree.
If the first entry matches, the lookup on the current level of siblings
stops and propagates to the children of the matching nodes.

This results in less comparisons performed by the CPU.

Signed-off-by: Frederic Weisbecker 
Cc: Adrian Hunter 
Cc: David Ahern 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
---
 tools/perf/util/callchain.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index e3970e3..e5ed16d 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -15,6 +15,8 @@
 #include 
 #include 
 
+#include "asm/bug.h"
+
 #include "hist.h"
 #include "util.h"
 #include "callchain.h"
@@ -356,19 +358,14 @@ append_chain_children(struct callchain_node *root,
/* lookup in childrens */
while (*p) {
s64 ret;
-   struct callchain_list *cnode;
 
parent = *p;
rnode = rb_entry(parent, struct callchain_node, rb_node_in);
-   cnode = list_first_entry(&rnode->val, struct callchain_list,
-list);
 
-   /* just check first entry */
-   ret = match_chain(node, cnode);
-   if (ret == 0) {
-   append_chain(rnode, cursor, period);
+   /* If at least first entry matches, rely to children */
+   ret = append_chain(rnode, cursor, period);
+   if (ret == 0)
goto inc_children_hit;
-   }
 
if (ret < 0)
p = &parent->rb_left;
@@ -394,6 +391,7 @@ append_chain(struct callchain_node *root,
u64 start = cursor->pos;
bool found = false;
u64 matches;
+   int cmp = 0;
 
/*
 * Lookup in the current node
@@ -408,7 +406,8 @@ append_chain(struct callchain_node *root,
if (!node)
break;
 
-   if (match_chain(node, cnode) != 0)
+   cmp = match_chain(node, cnode);
+   if (cmp)
break;
 
found = true;
@@ -418,9 +417,10 @@ append_chain(struct callchain_node *root,
 
/* matches not, relay no the parent */
if (!found) {
+   WARN_ONCE(!cmp, "Chain comparison error\n");
cursor->curr = curr_snap;
cursor->pos = start;
-   return -1;
+   return cmp;
}
 
matches = cursor->pos - start;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] perf tools: Do proper comm override error handling

2014-01-14 Thread Frederic Weisbecker

The comm overriding API ignores memory allocation failures by silently
keeping the previous and out of date comm.

As a result, the user may get buggy events without ever being notified
about the problem and its source.

Lets start to fix this by propagating the error from the API. Not all
callers may be doing proper error handling on comm set yet but this
is the first step toward it.

Signed-off-by: Frederic Weisbecker 
Cc: Adrian Hunter 
Cc: David Ahern 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
---
 tools/perf/util/comm.c   | 19 ++-
 tools/perf/util/comm.h   |  2 +-
 tools/perf/util/thread.c |  5 -
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/tools/perf/util/comm.c b/tools/perf/util/comm.c
index 67d1e40..f9e7776 100644
--- a/tools/perf/util/comm.c
+++ b/tools/perf/util/comm.c
@@ -94,19 +94,20 @@ struct comm *comm__new(const char *str, u64 timestamp)
return comm;
 }
 
-void comm__override(struct comm *comm, const char *str, u64 timestamp)
+int comm__override(struct comm *comm, const char *str, u64 timestamp)
 {
-   struct comm_str *old = comm->comm_str;
+   struct comm_str *new, *old = comm->comm_str;
 
-   comm->comm_str = comm_str__findnew(str, &comm_str_root);
-   if (!comm->comm_str) {
-   comm->comm_str = old;
-   return;
-   }
+   new = comm_str__findnew(str, &comm_str_root);
+   if (!new)
+   return -ENOMEM;
 
-   comm->start = timestamp;
-   comm_str__get(comm->comm_str);
+   comm_str__get(new);
comm_str__put(old);
+   comm->comm_str = new;
+   comm->start = timestamp;
+
+   return 0;
 }
 
 void comm__free(struct comm *comm)
diff --git a/tools/perf/util/comm.h b/tools/perf/util/comm.h
index 7a86e56..fac5bd5 100644
--- a/tools/perf/util/comm.h
+++ b/tools/perf/util/comm.h
@@ -16,6 +16,6 @@ struct comm {
 void comm__free(struct comm *comm);
 struct comm *comm__new(const char *str, u64 timestamp);
 const char *comm__str(const struct comm *comm);
-void comm__override(struct comm *comm, const char *str, u64 timestamp);
+int comm__override(struct comm *comm, const char *str, u64 timestamp);
 
 #endif  /* __PERF_COMM_H */
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index e394861..0358882 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -66,10 +66,13 @@ struct comm *thread__comm(const struct thread *thread)
 int thread__set_comm(struct thread *thread, const char *str, u64 timestamp)
 {
struct comm *new, *curr = thread__comm(thread);
+   int err;
 
/* Override latest entry if it had no specific time coverage */
if (!curr->start) {
-   comm__override(curr, str, timestamp);
+   err = comm__override(curr, str, timestamp);
+   if (err)
+   return err;
} else {
new = comm__new(str, timestamp);
if (!new)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] perf tools: Remove unnecessary callchain cursor state restore on unmatch

2014-01-14 Thread Frederic Weisbecker

If a new callchain branch doesn't match a single entry of the node that
it is given against comparison in append_chain(), then the cursor is
expected to be at the same position as it was before the comparison loop.

As such, there is no need to restore the cursor position on exit in case
of non matching branches.

Signed-off-by: Frederic Weisbecker 
Cc: Adrian Hunter 
Cc: David Ahern 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
---
 tools/perf/util/callchain.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index e5ed16d..85a54ad 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -386,7 +386,6 @@ append_chain(struct callchain_node *root,
 struct callchain_cursor *cursor,
 u64 period)
 {
-   struct callchain_cursor_node *curr_snap = cursor->curr;
struct callchain_list *cnode;
u64 start = cursor->pos;
bool found = false;
@@ -418,8 +417,6 @@ append_chain(struct callchain_node *root,
/* matches not, relay no the parent */
if (!found) {
WARN_ONCE(!cmp, "Chain comparison error\n");
-   cursor->curr = curr_snap;
-   cursor->pos = start;
return cmp;
}
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC v2 0/4] perf: IRQ-bound performance events

2014-01-14 Thread Frederic Weisbecker

On Tue, Jan 14, 2014 at 05:07:52PM +0100, Alexander Gordeev wrote:
> On Mon, Jan 13, 2014 at 04:50:37PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 04, 2014 at 07:22:32PM +0100, Alexander Gordeev wrote:
> > > Hello,
> > > 
> > > This is version 2 of RFC "perf: IRQ-bound performance events". That is an
> > > introduction of IRQ-bound performance events - ones that only count in a
> > > context of a hardware interrupt handler. Ingo suggested to extend this
> > > functionality to softirq and threaded handlers as well:
> > 
> > Hi Alexander,
> > 
> > I still strongly think we should use toggle events to achieve that:
> > https://lkml.org/lkml/2013/9/25/227
> 
> Hi Frederic,
> 
> The toggle events would not allow counting per-action in hardware interrupt
> context. The design I propose provisions any possible combination of actions/
> IRQs.

I think we could define one event per handler by using tracepoint filters.

> 
> I.e. if we had few drivers on IRQn and few drivers on IRQm we could assign
> an event to let's say ISR0, ISR2 on IRQn and ISR1 on IRQm.

Yeah that should be possible with tracepoints as well.

> Moreover, given that hardware context handlers are running with local
> interrupts disabled and therefore an IRQ-bound event could be enabled/
> disabled only from a single handler at a time - we just need to allocate
> a single hardware counter for any possible combination.

Hmm I don't get what you mean here. Why tracepoint defined event don't work in 
this scenario?

> 
> I think it would be ideal if the two approaches could be converged somehow,
> but I just do not know how at the moment. I believe the next step is to
> measure the overhead Andi mentioned. That well might be a showstopper for
> either or both approaches.
> 
> By contrast with the hardware context, the toggle events seem to able
> monitoring softirq in its current form.
> 
> As of the threaded context handlers, I have not come up with the idea how to
> make it yet, but it does not seem the toggle events are able eigher.

A per task event should do the trick for threaded irqs profiling.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] perf tools: Spare double comparison of callchain first entry

2014-01-15 Thread Frederic Weisbecker

On Wed, Jan 15, 2014 at 03:23:46PM +0900, Namhyung Kim wrote:
> On Tue, 14 Jan 2014 16:37:15 +0100, Frederic Weisbecker wrote:
> > When a new callchain child branch matches an existing one in the rbtree,
> > the comparison of its first entry is performed twice:
> >
> > 1) From append_chain_children() on branch lookup
> >
> > 2) If 1) reports a match, append_chain() then compares all entries of
> > the new branch against the matching node in the rbtree, and this
> > comparison includes the first entry of the new branch again.
> 
> Right.
> 
> >
> > Lets shortcut this by performing the whole comparison only from
> > append_chain() which then returns the result of the comparison between
> > the first entry of the new branch and the iterating node in the rbtree.
> > If the first entry matches, the lookup on the current level of siblings
> > stops and propagates to the children of the matching nodes.
> 
> Hmm..  it looks like that I thought directly calling append_chain() has
> some overhead - but it's not.

No that's a right concern. I worried as well because I wasn't sure if there
is more match than unmatch on the first entry. I'd tend to think that the first
entry endures unmatches most often, in which case calling match_chain() first
may be more efficient as a fast path (ie: calling append_chain() involves
one more function call and a few other details).

But eventually measurement hasn't shown significant difference before and
after the patch.

> 
> >
> > This results in less comparisons performed by the CPU.
> 
> Do you have any numbers?  I suspect it'd not be a big change, but just
> curious.

So I compared before/after the patchset (which include the cursor restore 
removal)
with:

1) Some big hackbench-like load that generates > 200 MB perf.data

perf record -g -- perf bench sched messaging -l $SOME_BIG_NUMBER

2) Compare before/after with the following reports:

perf stat perf report --stdio > /dev/null
perf stat perf report --stdio -s sym > /dev/null
perf stat perf report --stdio -G > /dev/null
perf stat perf report --stdio -g fractal,0.5,caller,address > /dev/null 

And most of the time I had < 0.01% difference on time completion in favour of 
the patchset
(which may be due to the removed cursor restore patch eventually).

So, all in one, there was no real interesting difference. If you want the true 
results I can definetly relaunch the tests.

> >
> > Signed-off-by: Frederic Weisbecker 
> 
> Reviewed-by: Namhyung Kim 

Thanks!

> 
> Thanks,
> Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 6 7 8 9 10 11 12 13 14 15 >

1001 - 1100 of 4554 matches

Mail list logo