Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Paul E. McKenney
On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> >
> > On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > > In short: unless a gp elapses between _exit() and _enter(), the next
> > > _enter() does nothing and avoids synchronize_sched().
> >
> > That does however make the entire scheme entirely writer biased;
> 
> Well, this makes the scheme "a bit more" writer biased, but this is
> exactly what we want in this case.
> 
> We do not block the readers after xxx_exit() entirely, but we do want
> to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
> transitions.

Yes -- should help -a- -lot- for bulk write-side operations, such as
onlining all CPUs at boot time.  ;-)

Thanx, Paul

> Lets even forget about disable_nonboot_cpus(), lets consider
> percpu_rwsem-like logic "in general".
> 
> Yes, it is heavily optimizied for readers. But if the writers come in
> a batch, or the same writer does down_write + up_write twice or more,
> I think state == FAST is pointless in between (if we can avoid it).
> This is the rare case (the writers should be rare), but if it happens
> it makes sense to optimize the writers too. And again, even
> 
>   for (;;) {
>   percpu_down_write();
>   percpu_up_write();
>   }
> 
> should not completely block the readers.
> 
> IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
> a win. If the next down_write/xxx_enter "knows" that the readers are
> still in SLOW mode because gp was not completed yet, why should we
> add the artificial delay?
> 
> As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
> outside of the loop, this is the same thing.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Oleg Nesterov
On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> > And again, even
> >
> > for (;;) {
> > percpu_down_write();
> > percpu_up_write();
> > }
> >
> > should not completely block the readers.
>
> Sure there's a tiny window, but don't forget that a reader will have to
> wait for the gp_state cacheline to transfer to shared state and the
> per-cpu refcount cachelines to be brought back into exclusive mode and
> the above can be aggressive enough that by that time we'll observe
> state == blocked again.

Sure, but don't forget that other callers of cpu_down() do a lot more
work before/after they actually call cpu_hotplug_begin/end().

> So I'll stick to waitcount -- as you can see in the patches I've just
> posted.

I still do not believe we need this waitcount "in practice" ;)

But even if I am right this is minor and we can reconsider this later,
so please forget.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Peter Zijlstra
On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> And again, even
> 
>   for (;;) {
>   percpu_down_write();
>   percpu_up_write();
>   }
> 
> should not completely block the readers.

Sure there's a tiny window, but don't forget that a reader will have to
wait for the gp_state cacheline to transfer to shared state and the
per-cpu refcount cachelines to be brought back into exclusive mode and
the above can be aggressive enough that by that time we'll observe
state == blocked again.

So I don't think that in practise a reader will get in.

Also, since the write side is exposed to userspace; you've got an
effective DoS.

So I'll stick to waitcount -- as you can see in the patches I've just
posted.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Oleg Nesterov
On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > In short: unless a gp elapses between _exit() and _enter(), the next
> > _enter() does nothing and avoids synchronize_sched().
>
> That does however make the entire scheme entirely writer biased;

Well, this makes the scheme "a bit more" writer biased, but this is
exactly what we want in this case.

We do not block the readers after xxx_exit() entirely, but we do want
to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
transitions.

Lets even forget about disable_nonboot_cpus(), lets consider
percpu_rwsem-like logic "in general".

Yes, it is heavily optimizied for readers. But if the writers come in
a batch, or the same writer does down_write + up_write twice or more,
I think state == FAST is pointless in between (if we can avoid it).
This is the rare case (the writers should be rare), but if it happens
it makes sense to optimize the writers too. And again, even

for (;;) {
percpu_down_write();
percpu_up_write();
}

should not completely block the readers.

IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
a win. If the next down_write/xxx_enter "knows" that the readers are
still in SLOW mode because gp was not completed yet, why should we
add the artificial delay?

As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
outside of the loop, this is the same thing.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Peter Zijlstra
On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> In short: unless a gp elapses between _exit() and _enter(), the next
> _enter() does nothing and avoids synchronize_sched().

That does however make the entire scheme entirely writer biased;
increasing the need for the waitcount thing I have. Otherwise we'll
starve pending readers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Peter Zijlstra
On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> > And given the construct; I'm not entirely sure you can do away with the
> > sync_sched() in between. While its clear to me you can merge the two
> > into one; leaving it out entirely doesn't seem right.
> 
> Could you explain?

Somehow I thought the fastpath got enabled; it doesn't since we never
hit GP_IDLE, so we don't actually need that.

You're right.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Oleg Nesterov
On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > On 10/01, Peter Zijlstra wrote:
> > >
> > > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I tend to agree with Srivatsa... Without a strong reason it would be 
> > > > better
> > > > to preserve the current logic: "some time after" should not be after the
> > > > next CPU_DOWN/UP*. But I won't argue too much.
> > >
> > > Nah, I think breaking it is the right thing :-)
> >
> > I don't really agree but I won't argue ;)
>
> The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
> guys who would need to complain, given that they seem to have the only
> use in 3.11.

mce_cpu_callback() is fine, it ignores POST_DEAD if CPU_TASKS_FROZEN.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Oleg Nesterov
On 10/02, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > > But note that you do not strictly need this change. Just kill 
> > > > cpuhp_waitcount,
> > > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we 
> > > > discuss in
> > > > another thread, this should likely "join" all synchronize_sched's.
> > >
> > > That would still be 4k * sync_sched() == terribly long.
> >
> > No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> > pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
>
> Hmm,. not in the version you posted; there xxx_enter() would only not do
> the sync_sched if there's a concurrent 'writer', in which case it will
> wait for it.

No, please see below.

> You only avoid the sync_sched in xxx_exit() and potentially join in the
> sync_sched() of a next xxx_begin().
>
> So with that scheme:
>
>   for (i= ; i<4096; i++) {
> xxx_begin();
> xxx_exit();
>   }
>
> Will get 4096 sync_sched() calls from the xxx_begin() and all but the
> last xxx_exit() will 'drop' the rcu callback.

No, the code above should call sync_sched() only once, no matter what
this code does between _enter and _exit. This was one of the points.

To clarify, of course I mean the "likely" case. Say, a long preemption
after _exit can lead to another sync_sched().

void xxx_enter(struct xxx_struct *xxx)
{
bool need_wait, need_sync;

spin_lock_irq(&xxx->xxx_lock);
need_wait = xxx->gp_count++;
need_sync = xxx->gp_state == GP_IDLE;
if (need_sync)
xxx->gp_state = GP_PENDING;
spin_unlock_irq(&xxx->xxx_lock);

BUG_ON(need_wait && need_sync);

if (need_sync) {
synchronize_sched();
xxx->gp_state = GP_PASSED;
wake_up_all(&xxx->gp_waitq);
} else if (need_wait) {
wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
} else {
BUG_ON(xxx->gp_state != GP_PASSED);
}
}

The 1st iteration:

xxx_enter() does synchronize_sched() and sets gp_state = GP_PASSED.

xxx_exit() starts the rcu callback, but gp_state is still PASSED.

all other iterations in the "likely" case:

xxx_enter() should likely come before the pending callback fires
and clears gp_state. In this case we only increment ->gp_count
(this "disables" the rcu callback) and do nothing more, gp_state
is still GP_PASSED.

xxx_exit() does another call_rcu_sched(), or does the
CP_PENDING -> CB_REPLAY change. The latter is the same as "start
another callback".

In short: unless a gp elapses between _exit() and _enter(), the next
_enter() does nothing and avoids synchronize_sched().

> And given the construct; I'm not entirely sure you can do away with the
> sync_sched() in between. While its clear to me you can merge the two
> into one; leaving it out entirely doesn't seem right.

Could you explain?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Srivatsa S. Bhat
On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this 
>>> proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
>>> certainly
>>> proved to be very useful in certain challenging situations (commit 
>>> 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to 
>>> undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
> 
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more 
> guarantees
> than that. I was just saying that the cpufreq code would need certain 
> additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
> 


Ok, so I thought a bit more about the changes you are proposing, and I agree
that they would be beneficial in the long run, especially given that it can
eventually lead to a more stream-lined hotplug process where different CPUs
can be hotplugged independently without waiting on each other, like you
mentioned in your other mail. So I'm fine with the new POST_DEAD guarantees
you are proposing - that they are run after unplug, and will be completed
before UP_PREPARE of the same CPU. And its also very convenient that we need
to fix only cpufreq to accommodate this change.

So below is a quick untested patch that modifies the cpufreq hotplug
callbacks appropriately. With this, cpufreq should be able to handle the
POST_DEAD changes, irrespective of whether we do that in the regular path
or in the suspend/resume path. (Because, I've restructured it in such a way
that the races that I had mentioned earlier are totally avoided. That is,
the POST_DEAD handler now performs only the bare-minimal final cleanup, which
doesn't race with or depend on anything else).



diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 04548f7..0a33c1a 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1165,7 +1165,7 @@ static int __cpufreq_remove_dev_prepare(struct device 
*dev,
bool frozen)
 {
unsigned int cpu = dev->id, cpus;
-   int new_cpu, ret;
+   int new_cpu, ret = 0;
unsigned long flags;
struct cpufreq_policy *policy;
 
@@ -1200,9 +1200,10 @@ static int __cpufreq_remove_dev_prepare(struct device 
*dev,
policy->governor->name, CPUFREQ_NAME_LEN);
 #endif
 
-   lock_policy_rwsem_read(cpu);
+   lock_policy_rwsem_write(cpu);
cpus = cpumask_weight(policy->cpus);
-   unlock_policy_rwsem_read(cpu);
+   cpumask_clear_cpu(cpu, policy->cpus);
+   unlock_policy_rwsem_write(cpu);
 
if (cpu != policy->cpu) {
if (!frozen)
@@ -1220,7 +1221,23 @@ static int __cpufreq_remove_dev_prepare(struct device 
*dev,
}
}
 
-   return 0;
+   /* If no target, nothing more to do */
+   if (!cpufreq_driver->target)
+   return 0;
+
+   /* If cpu is last user of policy, cleanup the policy governor */
+   if (cpus == 1) {
+   ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+   if (ret)
+   pr_err("%s: Failed to exit governor\n", __func__);
+   } else {
+   if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
+   (ret = __cpufreq_governor(policy, 
CPUFREQ_GOV_LIMITS))) {
+   pr_err("%s: Failed to start governor\n", __func__);
+   }
+   }
+
+   return ret;
 }
 
 static int __cpufreq_remove_dev_finish(struct device *dev,
@@ -1243,25 +1260,12 @@ static int __cpufreq_remove_dev_finish(struct device 
*dev,
return -EINVAL;
}
 
-   WARN_ON(lock_policy_rwsem_write(cpu));
+   WARN_ON(lock_policy_rwsem_read(cpu));
cpus = cpumask_weight(policy->cpus);
-
-   if (cpus > 1)
-   cpumask_clear_cpu(cpu, policy->cpus);
-   unlock_policy_rwsem_write(cpu);
+   unlock_policy_rwsem_read(cpu);
 
/* If cpu is last user of policy, free policy */
-   if (cpus == 1) {
-   if (cpufreq_driver->target) {
-   ret = __cpufreq_governor(policy,
-   CPUFREQ_GOV_POLICY_EXIT);
-   if (ret) {
-   pr_err("%s: Failed to exit governor\n",
-   __func__);
-   return ret;
-   }
-   }
-
+   

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-02 Thread Peter Zijlstra
On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > But note that you do not strictly need this change. Just kill 
> > > cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss 
> > > in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
> 
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

Hmm,. not in the version you posted; there xxx_enter() would only not do
the sync_sched if there's a concurrent 'writer', in which case it will
wait for it.

You only avoid the sync_sched in xxx_exit() and potentially join in the
sync_sched() of a next xxx_begin().

So with that scheme:

  for (i= ; i<4096; i++) {
xxx_begin();
xxx_exit();
  }

Will get 4096 sync_sched() calls from the xxx_begin() and all but the
last xxx_exit() will 'drop' the rcu callback.

And given the construct; I'm not entirely sure you can do away with the
sync_sched() in between. While its clear to me you can merge the two
into one; leaving it out entirely doesn't seem right.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Paul E. McKenney
On Thu, Sep 26, 2013 at 01:10:42PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> > A couple of nits and some commentary, but if there are races, they are
> > quite subtle.  ;-)
> 
> *whee*..
> 
> I made one little change in the logic; I moved the waitcount increment
> to before the __put_online_cpus() call, such that the writer will have
> to wait for us to wake up before trying again -- not for us to actually
> have acquired the read lock, for that we'd need to mess up
> __get_online_cpus() a bit more.
> 
> Complete patch below.

OK, looks like Oleg is correct, the cpuhp_seq can be dispensed with.

I still don't see anything wrong with it, so time for a serious stress
test on a large system.  ;-)

Additional commentary interspersed.

Thanx, Paul

> ---
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra 
> Date: Tue Sep 17 16:17:11 CEST 2013
> 
> The current implementation of get_online_cpus() is global of nature
> and thus not suited for any kind of common usage.
> 
> Re-implement the current recursive r/w cpu hotplug lock such that the
> read side locks are as light as possible.
> 
> The current cpu hotplug lock is entirely reader biased; but since
> readers are expensive there aren't a lot of them about and writer
> starvation isn't a particular problem.
> 
> However by making the reader side more usable there is a fair chance
> it will get used more and thus the starvation issue becomes a real
> possibility.
> 
> Therefore this new implementation is fair, alternating readers and
> writers; this however requires per-task state to allow the reader
> recursion.
> 
> Many comments are contributed by Paul McKenney, and many previous
> attempts were shown to be inadequate by both Paul and Oleg; many
> thanks to them for persisting to poke holes in my attempts.
> 
> Cc: Oleg Nesterov 
> Cc: Paul McKenney 
> Cc: Thomas Gleixner 
> Cc: Steven Rostedt 
> Signed-off-by: Peter Zijlstra 
> ---
>  include/linux/cpu.h   |   58 +
>  include/linux/sched.h |3 
>  kernel/cpu.c  |  209 
> +++---
>  kernel/sched/core.c   |2 
>  4 files changed, 208 insertions(+), 64 deletions(-)

I stripped the removed lines to keep my eyes from going buggy.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  struct device;
> 
> @@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_state;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader recursion */
> + /* The value was >= 1 and remains so, reordering causes no harm. */
> + if (current->cpuhp_ref++)
> + return;
> +
> + preempt_disable();
> + if (likely(!__cpuhp_state)) {
> + /* The barrier here is supplied by synchronize_sched(). */

I guess I shouldn't complain about the comment given where it came
from, but...

A more accurate comment would say that we are in an RCU-sched read-side
critical section, so the writer cannot both change __cpuhp_state from
readers_fast and start checking counters while we are here.  So if we see
!__cpuhp_state, we know that the writer won't be checking until we past
the preempt_enable() and that once the synchronize_sched() is done,
the writer will see anything we did within this RCU-sched read-side
critical section.

(The writer -can- change __cpuhp_state from readers_slow to readers_block
while we are in this read-side critical section and then start summing
counters, but that corresponds to a different "if" statement.)

> + __this_cpu_inc(__cpuhp_refcount);
> + } else {
> + __get_online_cpus(); /* Unconditional memory barrier. */
> + }
> + preempt_enable();
> + /*
> +  * The barrier() from preempt_enable() prevents the compiler from
> +  * bleeding the critical section out.
> +  */
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + /* The value was >= 1 and remains so, reordering causes no harm. */
> + if (--current->cpuhp_ref)
> + return;
> +
> + /*
> +  * The barrier() in preempt_disable() prevents the compiler from
> +  * bleeding the critical section out.
> +  */
> + preempt_disable();
> + if (likely(!__cpuhp_state)) {
> + /* The barrier here is supplied by synchronize_sched().  */

Same here, both for the implied self-criticism and the more complete story.

Due to the basic RCU gua

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Paul E. McKenney
On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > >
> > > I tend to agree with Srivatsa... Without a strong reason it would be 
> > > better
> > > to preserve the current logic: "some time after" should not be after the
> > > next CPU_DOWN/UP*. But I won't argue too much.
> >
> > Nah, I think breaking it is the right thing :-)
> 
> I don't really agree but I won't argue ;)

The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
guys who would need to complain, given that they seem to have the only
use in 3.11.

Thanx, Paul

> > > But note that you do not strictly need this change. Just kill 
> > > cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss 
> > > in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
> 
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
> 
> > > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler 
> > > outside
> > > of for_each_online_cpu().
> >
> > Right, that's more messy but would work if we cannot teach cpufreq (and
> > possibly others) to not rely on state you shouldn't rely on anyway.
> 
> Yes,
> 
> > I tihnk the only guarnatee POST_DEAD should have is that it should be
> > called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
> 
> See above... This makes POST_DEAD really "special" compared to other
> CPU_* events.
> 
> And again. Something like a global lock taken by CPU_DOWN_PREPARE and
> released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.
> 
> But I leave this to you and Srivatsa.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Srivatsa S. Bhat
On 10/01/2013 11:26 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
>> On 10/01, Peter Zijlstra wrote:
>>>
>>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
 However, as Oleg said, its definitely worth considering whether this 
 proposed
 change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
 certainly
 proved to be very useful in certain challenging situations (commit 
 1aee40ac9c
 explains one such example), so IMHO we should be very careful not to 
 undermine
 its utility.
>>>
>>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>>> called at some time after the unplug' with no further guarantees. And my
>>> patch preserves that.
>>
>> I tend to agree with Srivatsa... Without a strong reason it would be better
>> to preserve the current logic: "some time after" should not be after the
>> next CPU_DOWN/UP*. But I won't argue too much.
> 
> Nah, I think breaking it is the right thing :-)
> 
>> But note that you do not strictly need this change. Just kill 
>> cpuhp_waitcount,
>> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
>> another thread, this should likely "join" all synchronize_sched's.
> 
> That would still be 4k * sync_sched() == terribly long.
> 
>> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
>> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
>> of for_each_online_cpu().
> 
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.
> 
> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
> 

Conceptually, that hints at a totally per-cpu implementation of CPU hotplug,
in which what happens to one CPU doesn't affect the others in the hotplug
path.. and yeah, that sounds very tempting! ;-) but I guess that will
need to be preceded by a massive rework of many of the existing hotplug
callbacks ;-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Srivatsa S. Bhat
On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this 
>>> proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
>>> certainly
>>> proved to be very useful in certain challenging situations (commit 
>>> 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to 
>>> undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
> 
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more 
> guarantees
> than that. I was just saying that the cpufreq code would need certain 
> additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
> 

And an important reason why this change can be accommodated with not so much
trouble is because you are changing it only in the suspend/resume path, where
userspace has already been frozen, so all hotplug operations are initiated by
the suspend path and that path *alone* (and so we enjoy certain "simplifiers" 
that
we know before-hand, eg: all of them are CPU offline operations, happening one 
at
a time, in sequence) and we don't expect any "interference" to this routine ;-).
As a result the number and variety of races that we need to take care of tend to
be far lesser. (For example, we don't have to worry about the deadlock caused by
sysfs-writes that 1aee40ac9c was talking about).

On the other hand, if the proposal was to change the regular hotplug path as 
well
on the same lines, then I guess it would have been a little more difficult to
adjust to it. For example, in cpufreq, _dev_prepare() sends a STOP to the 
governor,
whereas a part of _dev_finish() sends a START to it; so we might have races 
there,
due to which we might proceed with CPU offline with a running governor, 
depending
on the exact timing of the events. Of course, this problem doesn't occur in the
suspend/resume case, and hence I didn't bring it up in my previous mail.

So this is another reason why I'm a little concerned about POST_DEAD: since this
is a change in semantics, it might be worth asking ourselves whether we'd still
want to go with that change, if we happened to be changing regular hotplug as
well, rather than just the more controlled environment of suspend/resume.
Yes, I know that's not what you proposed, but I feel it might be worth 
considering
its implications while deciding how to solve the POST_DEAD issue.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Srivatsa S. Bhat
On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>> However, as Oleg said, its definitely worth considering whether this proposed
>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
>> certainly
>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>> explains one such example), so IMHO we should be very careful not to 
>> undermine
>> its utility.
> 
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.
> 
> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
> doesn't explain it.
>

Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
than that. I was just saying that the cpufreq code would need certain additional
changes/restructuring to accommodate the change in the semantics brought about
by this patch. IOW, it won't work as it is, but it can certainly be fixed.

My other point (unrelated to cpufreq) was this: POST_DEAD of course means
that it will be called after unplug, with hotplug lock dropped. But it also
provides the guarantee (in the existing code), that a *new* hotplug operation
won't start until the POST_DEAD stage is also completed. This patch doesn't seem
to honor that part. The concern I have is in cases like those mentioned by
Oleg - say you take a lock at DOWN_PREPARE and want to drop it at POST_DEAD;
or some other requirement that makes it important to finish a full hotplug cycle
before moving on to the next one. I don't really have such a requirement in mind
at present, but I was just trying to think what we would be losing with this
change...

But to reiterate, I believe cpufreq can be reworked so that it doesn't depend
on things such as the above. But I wonder if dropping that latter guarantee
is going to be OK, going forward.

Regards,
Srivatsa S. Bhat
 
> What's wrong with leaving a cleanup handle in percpu storage and
> effectively doing:
> 
> struct cpu_destroy {
>   void (*destroy)(void *);
>   void *args;
> };
> 
> DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);
> 
>   POST_DEAD:
>   {
>   struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
>   if (x.destroy)
>   x.destroy(x.arg);
>   }
> 
> POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
> will succeed; it has to.
> 
> The cpufreq situation simply doesn't make any kind of sense to me.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Oleg Nesterov
On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> >
> > I tend to agree with Srivatsa... Without a strong reason it would be better
> > to preserve the current logic: "some time after" should not be after the
> > next CPU_DOWN/UP*. But I won't argue too much.
>
> Nah, I think breaking it is the right thing :-)

I don't really agree but I won't argue ;)

> > But note that you do not strictly need this change. Just kill 
> > cpuhp_waitcount,
> > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > another thread, this should likely "join" all synchronize_sched's.
>
> That would still be 4k * sync_sched() == terribly long.

No? the next xxx_enter() avoids sync_sched() if rcu callback is still
pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

> > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler 
> > outside
> > of for_each_online_cpu().
>
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.

Yes,

> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

See above... This makes POST_DEAD really "special" compared to other
CPU_* events.

And again. Something like a global lock taken by CPU_DOWN_PREPARE and
released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.

But I leave this to you and Srivatsa.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Peter Zijlstra
On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > > However, as Oleg said, its definitely worth considering whether this 
> > > proposed
> > > change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
> > > certainly
> > > proved to be very useful in certain challenging situations (commit 
> > > 1aee40ac9c
> > > explains one such example), so IMHO we should be very careful not to 
> > > undermine
> > > its utility.
> >
> > Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> > called at some time after the unplug' with no further guarantees. And my
> > patch preserves that.
> 
> I tend to agree with Srivatsa... Without a strong reason it would be better
> to preserve the current logic: "some time after" should not be after the
> next CPU_DOWN/UP*. But I won't argue too much.

Nah, I think breaking it is the right thing :-)

> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> another thread, this should likely "join" all synchronize_sched's.

That would still be 4k * sync_sched() == terribly long.

> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> of for_each_online_cpu().

Right, that's more messy but would work if we cannot teach cpufreq (and
possibly others) to not rely on state you shouldn't rely on anyway.

I tihnk the only guarnatee POST_DEAD should have is that it should be
called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Oleg Nesterov
On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > However, as Oleg said, its definitely worth considering whether this 
> > proposed
> > change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
> > certainly
> > proved to be very useful in certain challenging situations (commit 
> > 1aee40ac9c
> > explains one such example), so IMHO we should be very careful not to 
> > undermine
> > its utility.
>
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.

I tend to agree with Srivatsa... Without a strong reason it would be better
to preserve the current logic: "some time after" should not be after the
next CPU_DOWN/UP*. But I won't argue too much.

But note that you do not strictly need this change. Just kill cpuhp_waitcount,
then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
another thread, this should likely "join" all synchronize_sched's.

Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
of for_each_online_cpu().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Peter Zijlstra
On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> However, as Oleg said, its definitely worth considering whether this proposed
> change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
> certainly
> proved to be very useful in certain challenging situations (commit 1aee40ac9c
> explains one such example), so IMHO we should be very careful not to undermine
> its utility.

Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
called at some time after the unplug' with no further guarantees. And my
patch preserves that.

Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
doesn't explain it.

What's wrong with leaving a cleanup handle in percpu storage and
effectively doing:

struct cpu_destroy {
void (*destroy)(void *);
void *args;
};

DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);

POST_DEAD:
{
struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
if (x.destroy)
x.destroy(x.arg);
}

POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
will succeed; it has to.

The cpufreq situation simply doesn't make any kind of sense to me.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Srivatsa S. Bhat
On 10/01/2013 01:41 AM, Rafael J. Wysocki wrote:
> On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
>> On 09/28, Peter Zijlstra wrote:
>>>
>>> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>>>
 Please note that this wait_event() adds a problem... it doesn't allow
 to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
 does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
 in this case. We can solve this, but this wait_event() complicates
 the problem.
>>>
>>> That seems like a particularly easy fix; something like so?
>>
>> Yes, but...
>>
>>> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>>>
>>> +   cpu_hotplug_done();
>>> +
>>> +   for_each_cpu(cpu, frozen_cpus)
>>> +   cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
>>
>> This changes the protocol, I simply do not know if it is fine in general
>> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
>> currently it is possible that CPU_DOWN_PREPARE takes some global lock
>> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
>>
>> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
>> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
>> this notification if FROZEN. So yes, probably this is fine, but needs an
>> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
>> fine to call __cpufreq_remove_dev_prepare() twice without _finish().
> 
> To my eyes it will return -EBUSY when it tries to stop an already stopped
> governor, which will cause the entire chain to fail I guess.
>
> Srivatsa has touched that code most recently, so he should know better, 
> though.
> 

Yes it will return -EBUSY, but unfortunately it gets scarier from that
point onwards. When it gets an -EBUSY, __cpufreq_remove_dev_prepare() aborts
its work mid-way and returns, but doesn't bubble up the error to the CPU-hotplug
core. So the CPU hotplug code will continue to take that CPU down, with
further notifications such as CPU_DEAD, and chaos will ensue.

And we can't exactly "fix" this by simply returning the error code to 
CPU-hotplug
(since that would mean that suspend/resume would _always_ fail). Perhaps we can
teach cpufreq to ignore the error in this particular case (since the governor 
has
already been stopped and that's precisely what this function wanted to do as 
well),
but the problems don't seem to end there.

The other issue is that the CPUs in the policy->cpus mask are removed in the
_dev_finish() stage. So if that stage is post-poned like this, then 
_dev_prepare()
will get thoroughly confused since it also depends on seeing an updated
policy->cpus mask to decide when to nominate a new policy->cpu etc. (And the
cpu nomination code itself might start ping-ponging between CPUs, since none of
the CPUs would have been removed from the policy->cpus mask).

So, to summarize, this change to CPU hotplug code will break cpufreq (and
suspend/resume) as things stand today, but I don't think these problems are
insurmountable though.. 
 
However, as Oleg said, its definitely worth considering whether this proposed
change in semantics is going to hurt us in the future. CPU_POST_DEAD has 
certainly
proved to be very useful in certain challenging situations (commit 1aee40ac9c
explains one such example), so IMHO we should be very careful not to undermine
its utility.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Oleg Nesterov
On 10/01, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> > On 09/27, Oleg Nesterov wrote:
> > >
> > > I tried hard to find any hole in this version but failed, I believe it
> > > is correct.
> >
> > And I still believe it is. But now I am starting to think that we
> > don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).
>
> Here is one scenario that I believe requires cpuhp_seq:
>
> 1.Task 0 on CPU 0 increments its counter on entry.
>
> 2.Task 1 on CPU 1 starts summing the counters and gets to
>   CPU 4.  The sum thus far is 1 (Task 0).
>
> 3.Task 2 on CPU 2 increments its counter on entry.
>   Upon completing its entry code, it re-enables preemption.

afaics at this stage it should notice state = BLOCK and decrement
the same counter on the same CPU before it does preempt_enable().

Because:

> > 2. It is the reader which tries to take this lock and
> >noticed state == BLOCK. We could miss the result of
> >its inc(), but we do not care, this reader is going
> >to block.
> >
> >_If_ the reader could migrate between inc/dec, then
> >yes, we have a problem. Because that dec() could make
> >the result of per_cpu_sum() = 0. IOW, we could miss
> >inc() but notice dec(). But given that it does this
> >on the same CPU this is not possible.
> >
> > So why do we need cpuhp_seq?
>
> Good question, I will look again.

Thanks! much appreciated.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Oleg Nesterov
On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > > If you don't have cpuhp_seq, you need some other way to avoid
> > > counter overflow.  Which might be provided by limited number of
> > > tasks, or, on 64-bit systems, 64-bit counters.
> >
> > How so? PID space is basically limited to 30 bits, so how could we
> > overflow a 32bit reference counter?
>
> Nesting.

Still it seems that UINT_MAX / PID_MAX_LIMIT has enough room.

But again, OK lets make it ulong. The question is, how cpuhp_seq can
help and why we can't kill it.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Paul E. McKenney
On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> On 09/27, Oleg Nesterov wrote:
> >
> > I tried hard to find any hole in this version but failed, I believe it
> > is correct.
> 
> And I still believe it is. But now I am starting to think that we
> don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

Here is one scenario that I believe requires cpuhp_seq:

1.  Task 0 on CPU 0 increments its counter on entry.

2.  Task 1 on CPU 1 starts summing the counters and gets to
CPU 4.  The sum thus far is 1 (Task 0).

3.  Task 2 on CPU 2 increments its counter on entry.
Upon completing its entry code, it re-enables preemption.

4.  Task 2 is preempted, and starts running on CPU 5.

5.  Task 2 decrements its counter on exit.

6.  Task 1 continues summing.  Due to the fact that it saw Task 2's
exit but not its entry, the sum is zero.

One of cpuhp_seq's jobs is to prevent this scenario.

That said, bozo here still hasn't gotten to look at Peter's newest patch,
so perhaps it prevents this scenario some other way, perhaps by your
argument below.

> > We need to ensure 2 things:
> >
> > 1. The reader should notic state = BLOCK or the writer should see
> >inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> >__get_online_cpus() and in cpu_hotplug_begin().
> >
> >We do not care if the writer misses some inc(__cpuhp_refcount)
> >in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> >state = readers_block (and inc(cpuhp_seq) can't help anyway).
> 
> Yes!

OK, I will look over the patch with this in mind.

> > 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> >from __put_online_cpus() (note that the writer can miss the
> >corresponding inc() if it was done on another CPU, so this dec()
> >can lead to sum() == 0),
> 
> But this can't happen in this version? Somehow I forgot that
> __get_online_cpus() does inc/get under preempt_disable(), always on
> the same CPU. And thanks to mb's the writer should not miss the
> reader which has already passed the "state != BLOCK" check.
> 
> To simplify the discussion, lets ignore the "readers_fast" state,
> synchronize_sched() logic looks obviously correct. IOW, lets discuss
> only the SLOW -> BLOCK transition.
> 
>   cput_hotplug_begin()
>   {
>   state = BLOCK;
> 
>   mb();
> 
>   wait_event(cpuhp_writer,
>   per_cpu_sum(__cpuhp_refcount) == 0);
>   }
> 
> should work just fine? Ignoring all details, we have
> 
>   get_online_cpus()
>   {
>   again:
>   preempt_disable();
> 
>   __this_cpu_inc(__cpuhp_refcount);
> 
>   mb();
> 
>   if (state == BLOCK) {
> 
>   mb();
> 
>   __this_cpu_dec(__cpuhp_refcount);
>   wake_up_all(cpuhp_writer);
> 
>   preempt_enable();
>   wait_event(state != BLOCK);
>   goto again;
>   }
> 
>   preempt_enable();
>   }
> 
> It seems to me that these mb's guarantee all we need, no?
> 
> It looks really simple. The reader can only succed if it doesn't see
> BLOCK, in this case per_cpu_sum() should see the change,
> 
> We have
> 
>   WRITER  READER on CPU X
> 
>   state = BLOCK;  __cpuhp_refcount[X]++;
> 
>   mb();   mb();
> 
>   ...
>   count += __cpuhp_refcount[X];   if (state != BLOCK)
>   ... return;
> 
>   mb();
>   __cpuhp_refcount[X]--;
> 
> Either reader or writer should notice the STORE we care about.
> 
> If a reader can decrement __cpuhp_refcount, we have 2 cases:
> 
>   1. It is the reader holding this lock. In this case we
>  can't miss the corresponding inc() done by this reader,
>  because this reader didn't see BLOCK in the past.
> 
>  It is just the
> 
>   A == B == 0
>   CPU_0   CPU_1
>   -   -
>   A = 1;  B = 1;
>   mb();   mb();
>   b = B;  a = A;
> 
>  pattern, at least one CPU should see 1 in its a/b.
> 
>   2. It is the reader which tries to take this lock and
>  noticed state == BLOCK. We could miss the result of
>  its inc(), but we do not care, this reader is going
>  to block.
> 
>  _If_ the reader could migrate between inc/dec, then
>  yes, we have a problem. Because that dec() could make
>  the result of per_cpu_sum() = 0. IOW, we could miss
>  inc() but notice dec(). But given that it does this
>  on 

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Paul E. McKenney
On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > If you don't have cpuhp_seq, you need some other way to avoid
> > counter overflow.  Which might be provided by limited number of
> > tasks, or, on 64-bit systems, 64-bit counters.
> 
> How so? PID space is basically limited to 30 bits, so how could we
> overflow a 32bit reference counter?

Nesting.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Oleg Nesterov
On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> >
> > But please note another email, it seems to me we can simply kill
> > cpuhp_seq and all the barriers in cpuhp_readers_active_check().
>
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.

I don't think so. Overflows (espicially "unsigned") should be fine and
in fact we can't avoid them.

Say, a task does get() on CPU_0 and put() on CPU_1, after that we have

CTR[0] == 1, CTR[1] = (unsigned)-1

iow, the counter was already overflowed (underflowed). But this is fine,
all we care about is  CTR[0] + CTR[1] == 0, and this is only true because
of another overflow.

But probably you meant another thing,

> Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

perhaps you meant that max_threads * max_depth can overflow the counter?
I don't think so... but OK, perhaps this counter should be u_long.

But how cpuhp_seq can help?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Peter Zijlstra
On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.  Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

How so? PID space is basically limited to 30 bits, so how could we
overflow a 32bit reference counter?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Paul E. McKenney
On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> On 09/30, Paul E. McKenney wrote:
> >
> > On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > > On 09/26, Peter Zijlstra wrote:
> >
> > [ . . . ]
> >
> > > > > +static bool cpuhp_readers_active_check(void)
> > > > >  {
> > > > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > > +
> > > > > + smp_mb(); /* B matches A */
> > > > > +
> > > > > + /*
> > > > > +  * In other words, if we see __get_online_cpus() cpuhp_seq 
> > > > > increment,
> > > > > +  * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > > +  */
> > > > >
> > > > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > > + return false;
> > > > >
> > > > > + smp_mb(); /* D matches C */
> > > >
> > > > It seems that both barries could be smp_rmb() ? I am not sure the 
> > > > comments
> > > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> > >
> > > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > > there are no writes to order.
> > >
> > > Paul; is there any argument for the MB here as opposed to RMB; and if
> > > not should we change both these and SRCU?
> >
> > Given that these memory barriers execute only on the semi-slow path,
> > why add the complexity of moving from smp_mb() to either smp_rmb()
> > or smp_wmb()?  Straight smp_mb() is easier to reason about and more
> > robust against future changes.
> 
> But otoh this looks misleading, and the comments add more confusion.
> 
> But please note another email, it seems to me we can simply kill
> cpuhp_seq and all the barriers in cpuhp_readers_active_check().

If you don't have cpuhp_seq, you need some other way to avoid
counter overflow.  Which might be provided by limited number of
tasks, or, on 64-bit systems, 64-bit counters.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-10-01 Thread Oleg Nesterov
On 09/30, Paul E. McKenney wrote:
>
> On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
>
> [ . . . ]
>
> > > > +static bool cpuhp_readers_active_check(void)
> > > >  {
> > > > +   unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > +   smp_mb(); /* B matches A */
> > > > +
> > > > +   /*
> > > > +* In other words, if we see __get_online_cpus() cpuhp_seq 
> > > > increment,
> > > > +* we are guaranteed to also see its __cpuhp_refcount increment.
> > > > +*/
> > > >
> > > > +   if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > +   return false;
> > > >
> > > > +   smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> >
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB; and if
> > not should we change both these and SRCU?
>
> Given that these memory barriers execute only on the semi-slow path,
> why add the complexity of moving from smp_mb() to either smp_rmb()
> or smp_wmb()?  Straight smp_mb() is easier to reason about and more
> robust against future changes.

But otoh this looks misleading, and the comments add more confusion.

But please note another email, it seems to me we can simply kill
cpuhp_seq and all the barriers in cpuhp_readers_active_check().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-30 Thread Paul E. McKenney
On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:

[ . . . ]

> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > + smp_mb(); /* B matches A */
> > > +
> > > + /*
> > > +  * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +  * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +  */
> > >  
> > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > + return false;
> > >  
> > > + smp_mb(); /* D matches C */
> > 
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(), note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
> 
> Paul; is there any argument for the MB here as opposed to RMB; and if
> not should we change both these and SRCU?

Given that these memory barriers execute only on the semi-slow path,
why add the complexity of moving from smp_mb() to either smp_rmb()
or smp_wmb()?  Straight smp_mb() is easier to reason about and more
robust against future changes.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-30 Thread Rafael J. Wysocki
On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
> On 09/28, Peter Zijlstra wrote:
> >
> > On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> >
> > > Please note that this wait_event() adds a problem... it doesn't allow
> > > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > > in this case. We can solve this, but this wait_event() complicates
> > > the problem.
> >
> > That seems like a particularly easy fix; something like so?
> 
> Yes, but...
> 
> > @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
> >
> > +   cpu_hotplug_done();
> > +
> > +   for_each_cpu(cpu, frozen_cpus)
> > +   cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
> 
> This changes the protocol, I simply do not know if it is fine in general
> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
> currently it is possible that CPU_DOWN_PREPARE takes some global lock
> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
> 
> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
> this notification if FROZEN. So yes, probably this is fine, but needs an
> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
> fine to call __cpufreq_remove_dev_prepare() twice without _finish().

To my eyes it will return -EBUSY when it tries to stop an already stopped
governor, which will cause the entire chain to fail I guess.

Srivatsa has touched that code most recently, so he should know better, though.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-29 Thread Oleg Nesterov
On 09/27, Oleg Nesterov wrote:
>
> I tried hard to find any hole in this version but failed, I believe it
> is correct.

And I still believe it is. But now I am starting to think that we
don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

> We need to ensure 2 things:
>
> 1. The reader should notic state = BLOCK or the writer should see
>inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
>__get_online_cpus() and in cpu_hotplug_begin().
>
>We do not care if the writer misses some inc(__cpuhp_refcount)
>in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
>state = readers_block (and inc(cpuhp_seq) can't help anyway).

Yes!

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
>from __put_online_cpus() (note that the writer can miss the
>corresponding inc() if it was done on another CPU, so this dec()
>can lead to sum() == 0),

But this can't happen in this version? Somehow I forgot that
__get_online_cpus() does inc/get under preempt_disable(), always on
the same CPU. And thanks to mb's the writer should not miss the
reader which has already passed the "state != BLOCK" check.

To simplify the discussion, lets ignore the "readers_fast" state,
synchronize_sched() logic looks obviously correct. IOW, lets discuss
only the SLOW -> BLOCK transition.

cput_hotplug_begin()
{
state = BLOCK;

mb();

wait_event(cpuhp_writer,
per_cpu_sum(__cpuhp_refcount) == 0);
}

should work just fine? Ignoring all details, we have

get_online_cpus()
{
again:
preempt_disable();

__this_cpu_inc(__cpuhp_refcount);

mb();

if (state == BLOCK) {

mb();

__this_cpu_dec(__cpuhp_refcount);
wake_up_all(cpuhp_writer);

preempt_enable();
wait_event(state != BLOCK);
goto again;
}

preempt_enable();
}

It seems to me that these mb's guarantee all we need, no?

It looks really simple. The reader can only succed if it doesn't see
BLOCK, in this case per_cpu_sum() should see the change,

We have

WRITER  READER on CPU X

state = BLOCK;  __cpuhp_refcount[X]++;

mb();   mb();

...
count += __cpuhp_refcount[X];   if (state != BLOCK)
... return;

mb();
__cpuhp_refcount[X]--;

Either reader or writer should notice the STORE we care about.

If a reader can decrement __cpuhp_refcount, we have 2 cases:

1. It is the reader holding this lock. In this case we
   can't miss the corresponding inc() done by this reader,
   because this reader didn't see BLOCK in the past.

   It is just the

A == B == 0
CPU_0   CPU_1
-   -
A = 1;  B = 1;
mb();   mb();
b = B;  a = A;

   pattern, at least one CPU should see 1 in its a/b.

2. It is the reader which tries to take this lock and
   noticed state == BLOCK. We could miss the result of
   its inc(), but we do not care, this reader is going
   to block.

   _If_ the reader could migrate between inc/dec, then
   yes, we have a problem. Because that dec() could make
   the result of per_cpu_sum() = 0. IOW, we could miss
   inc() but notice dec(). But given that it does this
   on the same CPU this is not possible.

So why do we need cpuhp_seq?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-28 Thread Paul E. McKenney
On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> On 09/27, Peter Zijlstra wrote:
> >
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> >
> > > > +static bool cpuhp_readers_active_check(void)
> > > >  {
> > > > +   unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > +   smp_mb(); /* B matches A */
> > > > +
> > > > +   /*
> > > > +* In other words, if we see __get_online_cpus() cpuhp_seq 
> > > > increment,
> > > > +* we are guaranteed to also see its __cpuhp_refcount increment.
> > > > +*/
> > > >
> > > > +   if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > +   return false;
> > > >
> > > > +   smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(),
> 
> To avoid the confusion, I meant "those comments can't explain mb()s here,
> in cpuhp_readers_active_check()".
> 
> > > note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> And this cput_hotplug_ and synchronize_srcu() differ, see below.
> 
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB;
> 
> Yes, Paul, please ;)

Sorry to be slow -- I will reply by end of Monday Pacific time at the
latest.  I need to allow myself enough time so that it seems new...

Also I might try some mechanical proofs of parts of it.

Thanx, Paul

> > and if
> > not should we change both these and SRCU?
> 
> I guess that SRCU is more "complex" in this respect. IIUC,
> cpuhp_readers_active_check() needs "more" barriers because if
> synchronize_srcu() succeeds it needs to synchronize with the new readers
> which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
> srcu never blocks the readers, srcu_read_*() always succeeds.
> 
> 
> 
> Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
> ->c and ->seq. A plain this_cpu_inc() should be fine?
> 
> And since it disables preemption, why it can't use __this_cpu_inc() to inc
> ->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
> do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does 
> srcu_read_lock()
> it should also do _unlock.
> 
> But this is minor/offtopic.
> 
> > > >  void cpu_hotplug_done(void)
> > > >  {
> ...
> > > > +   /*
> > > > +* Wait for any pending readers to be running. This ensures 
> > > > readers
> > > > +* after writer and avoids writers starving readers.
> > > > +*/
> > > > +   wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > >  }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
> 
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
> 
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.
> 
> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-28 Thread Oleg Nesterov
On 09/28, Peter Zijlstra wrote:
>
> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>
> > Please note that this wait_event() adds a problem... it doesn't allow
> > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > in this case. We can solve this, but this wait_event() complicates
> > the problem.
>
> That seems like a particularly easy fix; something like so?

Yes, but...

> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>
> + cpu_hotplug_done();
> +
> + for_each_cpu(cpu, frozen_cpus)
> + cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);

This changes the protocol, I simply do not know if it is fine in general
to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
currently it is possible that CPU_DOWN_PREPARE takes some global lock
released by CPU_DOWN_FAILED or CPU_POST_DEAD.

Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
this notification if FROZEN. So yes, probably this is fine, but needs an
ack from cpufreq maintainers (cc'ed), for example to ensure that it is
fine to call __cpufreq_remove_dev_prepare() twice without _finish().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-28 Thread Peter Zijlstra
On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> ...
> > > > +   /*
> > > > +* Wait for any pending readers to be running. This ensures 
> > > > readers
> > > > +* after writer and avoids writers starving readers.
> > > > +*/
> > > > +   wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > >  }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
> 
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
> 
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.

Ah no; I just wanted to guarantee that any pending readers did get a
chance to run. And yes due to the two sync_sched() calls it seems
somewhat unlikely in practise.

> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.

That seems like a particularly easy fix; something like so?

---
 include/linux/cpu.h |1 
 kernel/cpu.c|   84 ++--
 2 files changed, 56 insertions(+), 29 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -109,6 +109,7 @@ enum {
 #define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN)
 #define CPU_DEAD_FROZEN(CPU_DEAD | CPU_TASKS_FROZEN)
 #define CPU_DYING_FROZEN   (CPU_DYING | CPU_TASKS_FROZEN)
+#define CPU_POST_DEAD_FROZEN   (CPU_POST_DEAD | CPU_TASKS_FROZEN)
 #define CPU_STARTING_FROZEN(CPU_STARTING | CPU_TASKS_FROZEN)
 
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -364,8 +364,7 @@ static int __ref take_cpu_down(void *_pa
return 0;
 }
 
-/* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+static int __ref __cpu_down(unsigned int cpu, int tasks_frozen)
 {
int err, nr_calls = 0;
void *hcpu = (void *)(long)cpu;
@@ -375,21 +374,13 @@ static int __ref _cpu_down(unsigned int
.hcpu = hcpu,
};
 
-   if (num_online_cpus() == 1)
-   return -EBUSY;
-
-   if (!cpu_online(cpu))
-   return -EINVAL;
-
-   cpu_hotplug_begin();
-
err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);
if (err) {
nr_calls--;
__cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
printk("%s: attempt to take down CPU %u failed\n",
__func__, cpu);
-   goto out_release;
+   return err;
}
smpboot_park_threads(cpu);
 
@@ -398,7 +389,7 @@ static int __ref _cpu_down(unsigned int
/* CPU didn't die: tell everyone.  Can't complain. */
smpboot_unpark_threads(cpu);
cpu_notify_nofail(CPU_DOWN_FAILED | mod, hcpu);
-   goto out_release;
+   return err;
}
BUG_ON(cpu_online(cpu));
 
@@ -420,10 +411,27 @@ static int __ref _cpu_down(unsigned int
 
check_for_tasks(cpu);
 
-out_release:
+   return err;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+{
+   unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
+   int err;
+
+   if (num_online_cpus() == 1)
+   return -EBUSY;
+
+   if (!cpu_online(cpu))
+   return -EINVAL;
+
+   cpu_hotplug_begin();
+   err = __cpu_down(cpu, tasks_frozen);
cpu_hotplug_done();
+
if (!err)
-   cpu_notify_nofail(CPU_POST_DEAD | mod, hcpu);
+   cpu_notify_nofail(CPU_POST_DEAD | mod, (void *)(long)cpu);
return err;
 }
 
@@ -447,30 +455,22 @@ int __ref cpu_down(unsigned int cpu)
 EXPORT_SYMBOL(cpu_down);
 #endif /*CONFIG_HOTPLUG_CPU*/
 
-/* Requires cpu_add_remove_lock to be held */
-static int _cpu_up(unsigned int cpu, int tasks_frozen)
+static int ___cpu_up(unsigned int cpu, int tasks_frozen)
 {
int ret, nr_calls = 0;
void *hcpu = (

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-28 Thread Oleg Nesterov
On 09/27, Peter Zijlstra wrote:
>
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
>
> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > + smp_mb(); /* B matches A */
> > > +
> > > + /*
> > > +  * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +  * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +  */
> > >
> > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > + return false;
> > >
> > > + smp_mb(); /* D matches C */
> >
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(),

To avoid the confusion, I meant "those comments can't explain mb()s here,
in cpuhp_readers_active_check()".

> > note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().

And this cput_hotplug_ and synchronize_srcu() differ, see below.

> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
>
> Paul; is there any argument for the MB here as opposed to RMB;

Yes, Paul, please ;)

> and if
> not should we change both these and SRCU?

I guess that SRCU is more "complex" in this respect. IIUC,
cpuhp_readers_active_check() needs "more" barriers because if
synchronize_srcu() succeeds it needs to synchronize with the new readers
which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
srcu never blocks the readers, srcu_read_*() always succeeds.



Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
->c and ->seq. A plain this_cpu_inc() should be fine?

And since it disables preemption, why it can't use __this_cpu_inc() to inc
->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
it should also do _unlock.

But this is minor/offtopic.

> > >  void cpu_hotplug_done(void)
> > >  {
...
> > > + /*
> > > +  * Wait for any pending readers to be running. This ensures readers
> > > +  * after writer and avoids writers starving readers.
> > > +  */
> > > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > >  }
> >
> > OK, to some degree I can understand "avoids writers starving readers"
> > part (although the next writer should do synchronize_sched() first),
> > but could you explain "ensures readers after writer" ?
>
> Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> does cpu_hotplug_done() and wakes all pending readers. If for some
> reason A doesn't schedule to inc ref until B again executes
> cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> any progress.

Yes, yes, thanks, this is clear. But this explains "writers starving readers".
And let me repeat, if B again executes cpu_hotplug_begin() it will do
another synchronize_sched() before it sets BLOCK, so I am not sure we
need this "in practice".

I was confused by "ensures readers after writer", I thought this means
we need the additional synchronization with the readers which are going
to increment cpuhp_waitcount, say, some sort of barries.

Please note that this wait_event() adds a problem... it doesn't allow
to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
in this case. We can solve this, but this wait_event() complicates
the problem.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-27 Thread Peter Zijlstra
On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > But if the readers does see BLOCK it will not be an active reader no
> > more; and thus the writer doesn't need to observe and wait for it.
> 
> I meant they both can block, but please ignore. Today I simply can't
> understand what I was thinking about yesterday.

I think we all know that state all too well ;-)

> I tried hard to find any hole in this version but failed, I believe it
> is correct.

Yay!

> But, could you help me to understand some details?

I'll try, but I'm not too bright atm myself :-)

> > +void __get_online_cpus(void)
> > +{
> > +again:
> > +   /* See __srcu_read_lock() */
> > +   __this_cpu_inc(__cpuhp_refcount);
> > +   smp_mb(); /* A matches B, E */
> > +   __this_cpu_inc(cpuhp_seq);
> > +
> > +   if (unlikely(__cpuhp_state == readers_block)) {
> 
> Note that there is no barrier() after inc(seq) and __cpuhp_state
> check, this inc() can be "postponed" till ...
> 
> > +void __put_online_cpus(void)
> >  {
> > +   /* See __srcu_read_unlock() */
> > +   smp_mb(); /* C matches D */
> 
> ... this mb() in __put_online_cpus().
> 
> And this is fine! The qustion is, perhaps it would be more "natural"
> and understandable to shift this_cpu_inc(cpuhp_seq) into
> __put_online_cpus().

Possibly; I never got further than that the required order is:

  ref++
  MB
  seq++
  MB
  ref--

It doesn't matter if the seq++ is in the lock or unlock primitive. I
never considered one place more natural than the other.

> We need to ensure 2 things:
> 
> 1. The reader should notic state = BLOCK or the writer should see
>inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
>__get_online_cpus() and in cpu_hotplug_begin().
> 
>We do not care if the writer misses some inc(__cpuhp_refcount)
>in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
>state = readers_block (and inc(cpuhp_seq) can't help anyway).

Agreed.

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
>from __put_online_cpus() (note that the writer can miss the
>corresponding inc() if it was done on another CPU, so this dec()
>can lead to sum() == 0), it should also notice the change in cpuhp_seq.
> 
>Fortunately, this can only happen if the reader migrates, in
>this case schedule() provides a barrier, the writer can't miss
>the change in cpuhp_seq.

Again, agreed; this is also the message of the second comment in
cpuhp_readers_active_check() by Paul.

> IOW. Unless I missed something, cpuhp_seq is actually needed to
> serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
> and /* D matches C */ in cpuhp_readers_active_check(), and this
> is not immediately clear if you look at __get_online_cpus().
> 
> I do not suggest to change this code, but please tell me if my
> understanding is not correct.

I think you're entirely right.

> > +static bool cpuhp_readers_active_check(void)
> >  {
> > +   unsigned int seq = per_cpu_sum(cpuhp_seq);
> > +
> > +   smp_mb(); /* B matches A */
> > +
> > +   /*
> > +* In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > +* we are guaranteed to also see its __cpuhp_refcount increment.
> > +*/
> >  
> > +   if (per_cpu_sum(__cpuhp_refcount) != 0)
> > +   return false;
> >  
> > +   smp_mb(); /* D matches C */
> 
> It seems that both barries could be smp_rmb() ? I am not sure the comments
> from srcu_readers_active_idx_check() can explain mb(), note that
> __srcu_read_lock() always succeeds unlike get_cpus_online().

I see what you mean; cpuhp_readers_active_check() is all purely reads;
there are no writes to order.

Paul; is there any argument for the MB here as opposed to RMB; and if
not should we change both these and SRCU?

> >  void cpu_hotplug_done(void)
> >  {
> > +   /* Signal the writer is done, no fast path yet. */
> > +   __cpuhp_state = readers_slow;
> > +   wake_up_all(&cpuhp_readers);
> > +
> > +   /*
> > +* The wait_event()/wake_up_all() prevents the race where the readers
> > +* are delayed between fetching __cpuhp_state and blocking.
> > +*/
> > +
> > +   /* See percpu_up_write(); readers will no longer attempt to block. */
> > +   synchronize_sched();
> > +
> > +   /* Let 'em rip */
> > +   __cpuhp_state = readers_fast;
> > +   current->cpuhp_ref--;
> > +
> > +   /*
> > +* Wait for any pending readers to be running. This ensures readers
> > +* after writer and avoids writers starving readers.
> > +*/
> > +   wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> >  }
> 
> OK, to some degree I can understand "avoids writers starving readers"
> part (although the next writer should do synchronize_sched() first),
> but could you explain "ensures readers after writer" ?

Suppose reader A sees state == BLOCK and goes to sleep; our writer B
does cpu_hotplug_done() and wakes all pending readers. If for some
reason A doesn't schedule to inc ref until B again 

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-27 Thread Oleg Nesterov
On 09/26, Peter Zijlstra wrote:
>
> But if the readers does see BLOCK it will not be an active reader no
> more; and thus the writer doesn't need to observe and wait for it.

I meant they both can block, but please ignore. Today I simply can't
understand what I was thinking about yesterday.


I tried hard to find any hole in this version but failed, I believe it
is correct.

But, could you help me to understand some details?

> +void __get_online_cpus(void)
> +{
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + __this_cpu_inc(cpuhp_seq);
> +
> + if (unlikely(__cpuhp_state == readers_block)) {

Note that there is no barrier() after inc(seq) and __cpuhp_state
check, this inc() can be "postponed" till ...

> +void __put_online_cpus(void)
>  {
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* See __srcu_read_unlock() */
> + smp_mb(); /* C matches D */

... this mb() in __put_online_cpus().

And this is fine! The qustion is, perhaps it would be more "natural"
and understandable to shift this_cpu_inc(cpuhp_seq) into
__put_online_cpus().

We need to ensure 2 things:

1. The reader should notic state = BLOCK or the writer should see
   inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
   __get_online_cpus() and in cpu_hotplug_begin().

   We do not care if the writer misses some inc(__cpuhp_refcount)
   in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
   state = readers_block (and inc(cpuhp_seq) can't help anyway).

2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
   from __put_online_cpus() (note that the writer can miss the
   corresponding inc() if it was done on another CPU, so this dec()
   can lead to sum() == 0), it should also notice the change in cpuhp_seq.

   Fortunately, this can only happen if the reader migrates, in
   this case schedule() provides a barrier, the writer can't miss
   the change in cpuhp_seq.

IOW. Unless I missed something, cpuhp_seq is actually needed to
serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
and /* D matches C */ in cpuhp_readers_active_check(), and this
is not immediately clear if you look at __get_online_cpus().

I do not suggest to change this code, but please tell me if my
understanding is not correct.

> +static bool cpuhp_readers_active_check(void)
>  {
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> + unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> + smp_mb(); /* B matches A */
> +
> + /*
> +  * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> +  * we are guaranteed to also see its __cpuhp_refcount increment.
> +  */
>  
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + if (per_cpu_sum(__cpuhp_refcount) != 0)
> + return false;
>  
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> + smp_mb(); /* D matches C */

It seems that both barries could be smp_rmb() ? I am not sure the comments
from srcu_readers_active_idx_check() can explain mb(), note that
__srcu_read_lock() always succeeds unlike get_cpus_online().

>  void cpu_hotplug_done(void)
>  {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* Signal the writer is done, no fast path yet. */
> + __cpuhp_state = readers_slow;
> + wake_up_all(&cpuhp_readers);
> +
> + /*
> +  * The wait_event()/wake_up_all() prevents the race where the readers
> +  * are delayed between fetching __cpuhp_state and blocking.
> +  */
> +
> + /* See percpu_up_write(); readers will no longer attempt to block. */
> + synchronize_sched();
> +
> + /* Let 'em rip */
> + __cpuhp_state = readers_fast;
> + current->cpuhp_ref--;
> +
> + /*
> +  * Wait for any pending readers to be running. This ensures readers
> +  * after writer and avoids writers starving readers.
> +  */
> + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }

OK, to some degree I can understand "avoids writers starving readers"
part (although the next writer should do synchronize_sched() first),
but could you explain "ensures readers after writer" ?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-26 Thread Peter Zijlstra
On Thu, Sep 26, 2013 at 06:58:40PM +0200, Oleg Nesterov wrote:
> Peter,
> 
> Sorry. Unlikely I will be able to read this patch today. So let me
> ask another potentially wrong question without any thinking.
> 
> On 09/26, Peter Zijlstra wrote:
> >
> > +void __get_online_cpus(void)
> > +{
> > +again:
> > +   /* See __srcu_read_lock() */
> > +   __this_cpu_inc(__cpuhp_refcount);
> > +   smp_mb(); /* A matches B, E */
> > +   __this_cpu_inc(cpuhp_seq);
> > +
> > +   if (unlikely(__cpuhp_state == readers_block)) {
> 
> OK. Either we should see state = BLOCK or the writer should notice the
> change in __cpuhp_refcount/seq. (altough I'd like to recheck this
> cpuhp_seq logic ;)
> 
> > +   atomic_inc(&cpuhp_waitcount);
> > +   __put_online_cpus();
> 
> OK, this does wake(cpuhp_writer).
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > ...
> > +   /*
> > +* Notify new readers to block; up until now, and thus throughout the
> > +* longish synchronize_sched() above, new readers could still come in.
> > +*/
> > +   __cpuhp_state = readers_block;
> > +
> > +   smp_mb(); /* E matches A */
> > +
> > +   /*
> > +* If they don't see our writer of readers_block to __cpuhp_state,
> > +* then we are guaranteed to see their __cpuhp_refcount increment, and
> > +* therefore will wait for them.
> > +*/
> > +
> > +   /* Wait for all now active readers to complete. */
> > +   wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> But. doesn't this mean that we need __wait_event() here as well?
> 
> Isn't it possible that the reader sees BLOCK but the writer does _not_
> see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
> "either", not "both".

But if the readers does see BLOCK it will not be an active reader no
more; and thus the writer doesn't need to observe and wait for it.

> Don't we need to ensure that we can't check cpuhp_readers_active_check()
> after wake(cpuhp_writer) was already called by the reader and before we
> take the same lock?

I'm too tired to fully grasp what you're asking here; but given the
previous answer I think not.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-26 Thread Oleg Nesterov
Peter,

Sorry. Unlikely I will be able to read this patch today. So let me
ask another potentially wrong question without any thinking.

On 09/26, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
> +{
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + __this_cpu_inc(cpuhp_seq);
> +
> + if (unlikely(__cpuhp_state == readers_block)) {

OK. Either we should see state = BLOCK or the writer should notice the
change in __cpuhp_refcount/seq. (altough I'd like to recheck this
cpuhp_seq logic ;)

> + atomic_inc(&cpuhp_waitcount);
> + __put_online_cpus();

OK, this does wake(cpuhp_writer).

>  void cpu_hotplug_begin(void)
>  {
> ...
> + /*
> +  * Notify new readers to block; up until now, and thus throughout the
> +  * longish synchronize_sched() above, new readers could still come in.
> +  */
> + __cpuhp_state = readers_block;
> +
> + smp_mb(); /* E matches A */
> +
> + /*
> +  * If they don't see our writer of readers_block to __cpuhp_state,
> +  * then we are guaranteed to see their __cpuhp_refcount increment, and
> +  * therefore will wait for them.
> +  */
> +
> + /* Wait for all now active readers to complete. */
> + wait_event(cpuhp_writer, cpuhp_readers_active_check());

But. doesn't this mean that we need __wait_event() here as well?

Isn't it possible that the reader sees BLOCK but the writer does _not_
see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
"either", not "both".

Don't we need to ensure that we can't check cpuhp_readers_active_check()
after wake(cpuhp_writer) was already called by the reader and before we
take the same lock?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-26 Thread Peter Zijlstra
On Thu, Sep 26, 2013 at 06:14:26PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> > > > -   cpu_hotplug.active_writer = NULL;
> > > > -   mutex_unlock(&cpu_hotplug.lock);
> > > > +   /* Signal the writer is done, no fast path yet. */
> > > > +   __cpuhp_state = readers_slow;
> > > > +   wake_up_all(&cpuhp_readers);
> > > > +
> > > > +   /*
> > > > +* The wait_event()/wake_up_all() prevents the race where the 
> > > > readers
> > > > +* are delayed between fetching __cpuhp_state and blocking.
> > > > +*/
> > > > +
> > > > +   /* See percpu_up_write(); readers will no longer attempt to 
> > > > block. */
> > > > +   synchronize_sched();
> > >
> > > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > > see state = BLOCK after wakeup().
> >
> > Well, if they are blocked, the wake_up_all() will do an actual
> > try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().
> 
> Yes. Everything is fine with the already blocked readers.
> 
> I meant the new reader which still can see state = BLOCK after we
> do wakeup(), but I didn't notice it should do __wait_event() which
> takes the lock unconditionally, it must see the change after that.

Ah, because both __wake_up() and __wait_event()->prepare_to_wait() take
q->lock. Thereby matching the __wake_up() RELEASE to the __wait_event()
ACQUIRE, creating the full barrier.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-26 Thread Oleg Nesterov
On 09/26, Peter Zijlstra wrote:
>
> On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:
> > >  void cpu_hotplug_done(void)
> > >  {
> > > - cpu_hotplug.active_writer = NULL;
> > > - mutex_unlock(&cpu_hotplug.lock);
> > > + /* Signal the writer is done, no fast path yet. */
> > > + __cpuhp_state = readers_slow;
> > > + wake_up_all(&cpuhp_readers);
> > > +
> > > + /*
> > > +  * The wait_event()/wake_up_all() prevents the race where the readers
> > > +  * are delayed between fetching __cpuhp_state and blocking.
> > > +  */
> > > +
> > > + /* See percpu_up_write(); readers will no longer attempt to block. */
> > > + synchronize_sched();
> >
> > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > see state = BLOCK after wakeup().
>
> Well, if they are blocked, the wake_up_all() will do an actual
> try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

Yes. Everything is fine with the already blocked readers.

I meant the new reader which still can see state = BLOCK after we
do wakeup(), but I didn't notice it should do __wait_event() which
takes the lock unconditionally, it must see the change after that.

> Right?

Yes, I was wrong, thanks.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-26 Thread Peter Zijlstra
On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >  void cpu_hotplug_done(void)
> >  {
> > -   cpu_hotplug.active_writer = NULL;
> > -   mutex_unlock(&cpu_hotplug.lock);
> > +   /* Signal the writer is done, no fast path yet. */
> > +   __cpuhp_state = readers_slow;
> > +   wake_up_all(&cpuhp_readers);
> > +
> > +   /*
> > +* The wait_event()/wake_up_all() prevents the race where the readers
> > +* are delayed between fetching __cpuhp_state and blocking.
> > +*/
> > +
> > +   /* See percpu_up_write(); readers will no longer attempt to block. */
> > +   synchronize_sched();
> 
> Shouldn't you move wake_up_all(&cpuhp_readers) down after
> synchronize_sched() (or add another one) ? To ensure that a reader can't
> see state = BLOCK after wakeup().

Well, if they are blocked, the wake_up_all() will do an actual
try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

The woken task will get a MB from passing through the context switch to
make it actually run. And therefore; like Paul's comment says; it cannot
observe the previous BLOCK state but must indeed see the just issued
SLOW state.

Right?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-26 Thread Peter Zijlstra
On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> A couple of nits and some commentary, but if there are races, they are
> quite subtle.  ;-)

*whee*..

I made one little change in the logic; I moved the waitcount increment
to before the __put_online_cpus() call, such that the writer will have
to wait for us to wake up before trying again -- not for us to actually
have acquired the read lock, for that we'd need to mess up
__get_online_cpus() a bit more.

Complete patch below.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra 
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Cc: Oleg Nesterov 
Cc: Paul McKenney 
Cc: Thomas Gleixner 
Cc: Steven Rostedt 
Signed-off-by: Peter Zijlstra 
---
 include/linux/cpu.h   |   58 +
 include/linux/sched.h |3 
 kernel/cpu.c  |  209 +++---
 kernel/sched/core.c   |2 
 4 files changed, 208 insertions(+), 64 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   /* Support reader recursion */
+   /* The value was >= 1 and remains so, reordering causes no harm. */
+   if (current->cpuhp_ref++)
+   return;
+
+   preempt_disable();
+   if (likely(!__cpuhp_state)) {
+   /* The barrier here is supplied by synchronize_sched(). */
+   __this_cpu_inc(__cpuhp_refcount);
+   } else {
+   __get_online_cpus(); /* Unconditional memory barrier. */
+   }
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   /* The value was >= 1 and remains so, reordering causes no harm. */
+   if (--current->cpuhp_ref)
+   return;
+
+   /*
+* The barrier() in preempt_disable() prevents the compiler from
+* bleeding the critical section out.
+*/
+   preempt_disable();
+   if (likely(!__cpuhp_state)) {
+   /* The barrier here is supplied by synchronize_sched().  */
+   __this_cpu_dec(__cpuhp_refcount);
+   } else {
+   __put_online_cpus(); /* Unconditional memory barrier. */
+   }
+   preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
@@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un
 
 #else  /* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()  do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+   int cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-   struct task_struct *active_writer;
-   struct mutex lock; /* Synchronizes accesses to refcount, */
-   /*
-* Also blocks the new readers during
-* an ongoing cpu hotplug operation.
-*/
-   

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Paul E. McKenney
On Wed, Sep 25, 2013 at 08:40:15PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> > No. Too tired too ;) damn LSB test failures...
> 
> 
> ok; I cobbled this together.. I might think better of it tomorrow, but
> for now I think I closed the hole before wait_event(readers_active())
> you pointed out -- of course I might have created new holes :/
> 
> For easy reading the + only version.

A couple of nits and some commentary, but if there are races, they are
quite subtle.  ;-)

Thanx, Paul

> ---
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  struct device;
> 
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader-in-reader recursion */
> + if (current->cpuhp_ref++) {
> + barrier();

Oleg was right, this barrier() can go.  The value was >=1 and remains
>=1, so reordering causes no harm.  (See below.)

> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);

The barrier required here is provided by synchronize_sched(), and
all the code is contained by the barrier()s in preempt_disable() and
preempt_enable().

> + else
> + __get_online_cpus();

And a memory barrier is unconditionally executed by __get_online_cpus().

> + preempt_enable();

The barrier() in preempt_enable() prevents the compiler from bleeding
the critical section out.

> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();

This barrier() can also be dispensed with.

> + if (--current->cpuhp_ref)

If we leave here, the value was >=1 and remains >=1, so reordering does
no harm.

> + return;
> +
> + preempt_disable();

The barrier() in preempt_disable() prevents the compiler from bleeding
the critical section out.

> + if (likely(!__cpuhp_writer))
> + __this_cpu_dec(__cpuhp_refcount);

The barrier here is supplied by synchronize_sched().

> + else
> + __put_online_cpus();

And a memory barrier is unconditionally executed by __put_online_cpus().

> + preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()do { } while (0)
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>   unsigned intsequential_io;
>   unsigned intsequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> +++ b/kernel/cpu.c
> @@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> + p->cpuhp_ref = 0;
> +}
> 
> +void __get_online_cpus(void)
>  {
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + __this_cpu_inc(cpuhp_seq);
> +
> + if (unlikely(__cpuhp_writer == readers_block)) {
> + __put_online_cpus();

Suppose we got delayed here for some time.  The writer might complete,
and be awakened by the blocked readers (we have not incremented our
counter yet).  We would then drop through, do the atomic_dec_and_test()
and deliver a spurious wake_up_all() at some random time in the future.

Which should be OK because __wait_event() looks to handle spurious
wake_up()s.

> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> +  * We either call schedule() in the wait, or we'll fall through
> +  * a

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Peter Zijlstra
On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> No. Too tired too ;) damn LSB test failures...


ok; I cobbled this together.. I might think better of it tomorrow, but
for now I think I closed the hole before wait_event(readers_active())
you pointed out -- of course I might have created new holes :/

For easy reading the + only version.

---
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   /* Support reader-in-reader recursion */
+   if (current->cpuhp_ref++) {
+   barrier();
+   return;
+   }
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_inc(__cpuhp_refcount);
+   else
+   __get_online_cpus();
+   preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   barrier();
+   if (--current->cpuhp_ref)
+   return;
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_dec(__cpuhp_refcount);
+   else
+   __put_online_cpus();
+   preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else  /* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()  do { } while (0)
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+   int cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
+++ b/kernel/cpu.c
@@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+   p->cpuhp_ref = 0;
+}
 
+void __get_online_cpus(void)
 {
+again:
+   /* See __srcu_read_lock() */
+   __this_cpu_inc(__cpuhp_refcount);
+   smp_mb(); /* A matches B, E */
+   __this_cpu_inc(cpuhp_seq);
+
+   if (unlikely(__cpuhp_writer == readers_block)) {
+   __put_online_cpus();
+
+   atomic_inc(&cpuhp_waitcount);
+
+   /*
+* We either call schedule() in the wait, or we'll fall through
+* and reschedule on the preempt_enable() in get_online_cpus().
+*/
+   preempt_enable_no_resched();
+   __wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
+   preempt_disable();
 
+   if (atomic_dec_and_test(&cpuhp_waitcount))
+   wake_up_all(&cpuhp_writer);
+
+   goto again;
+   }
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+   /* See __srcu_read_unlock() */
+   smp_mb(); /* C matches D */
+   this_cpu_dec(__cpuhp_refcount);
+
+   /* Prod writer to recheck readers_active */
+   wake_up_all(&cpuhp_writer);
 }
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
+#define per_cpu_sum(var)   \
+({ \
+   typeof(var) __sum = 0;  \
+   int cpu;\
+   for_each_possible_cpu(cpu)  \
+   __sum += per_cpu(var, cpu); \
+   __sum;  \
+)}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
 {
+   unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+   smp_mb(); /* B matches A */
 
+   if (per_cpu_sum(__cpuhp_refcount) != 0)
+   return false;
 
+   smp_mb(); /* D matches C */
 
+   return per_cpu_s

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Oleg Nesterov
On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
>
> > > +static inline void get_online_cpus(void)
> > > +{
> > > + might_sleep();
> > > +
> > > + /* Support reader-in-reader recursion */
> > > + if (current->cpuhp_ref++) {
> > > + barrier();
> > > + return;
> > > + }
> > > +
> > > + preempt_disable();
> > > + if (likely(!__cpuhp_writer))
> > > + __this_cpu_inc(__cpuhp_refcount);
> >
> > mb() to ensure the reader can't miss, say, a STORE done inside
> > the cpu_hotplug_begin/end section.
> >
> > put_online_cpus() needs mb() as well.
>
> OK, I'm not getting this; why isn't the sync_sched sufficient to get out
> of this fast path without barriers?

Aah, sorry, I didn't notice this version has another synchronize_sched()
in cpu_hotplug_done().

Then I need to recheck again...

No. Too tired too ;) damn LSB test failures...

> > > + if (atomic_dec_and_test(&cpuhp_waitcount))
> > > + wake_up_all(&cpuhp_writer);
> >
> > Same problem as in previous version. __get_online_cpus() succeeds
> > without incrementing __cpuhp_refcount. "goto start" can't help
> > afaics.
>
> I added a goto into the cond-block, not before the cond; but see the
> version below.

"into the cond-block" doesn't look right too, at first glance. This
always succeeds, but by this time another writer can already hold
the lock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Peter Zijlstra
On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
> 
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

Sure we can add more ;-) But I went with perpcu_up_write(), it too does
the sync_sched() before clearing the fast path state.

> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   /* Support reader-in-reader recursion */
> > +   if (current->cpuhp_ref++) {
> > +   barrier();
> > +   return;
> > +   }
> > +
> > +   preempt_disable();
> > +   if (likely(!__cpuhp_writer))
> > +   __this_cpu_inc(__cpuhp_refcount);
> 
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
> 
> put_online_cpus() needs mb() as well.

OK, I'm not getting this; why isn't the sync_sched sufficient to get out
of this fast path without barriers?

> > +void __get_online_cpus(void)
> > +{
> > +   if (__cpuhp_writer == 1) {
> > +   /* See __srcu_read_lock() */
> > +   __this_cpu_inc(__cpuhp_refcount);
> > +   smp_mb();
> > +   __this_cpu_inc(cpuhp_seq);
> > +   return;
> > +   }
> 
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.

I 'borrowed' it from SRCU, so if its broken here its broken there too I
suppose.

> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
> 
> > +   if (atomic_dec_and_test(&cpuhp_waitcount))
> > +   wake_up_all(&cpuhp_writer);
> 
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.

I added a goto into the cond-block, not before the cond; but see the
version below.

> >  void cpu_hotplug_begin(void)
> >  {
> > +   unsigned int count = 0;
> > +   int cpu;
> >  
> > +   lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > +   /* allow reader-in-writer recursion */
> > +   current->cpuhp_ref++;
> > +
> > +   /* make readers take the slow path */
> > +   __cpuhp_writer = 1;
> > +
> > +   /* See percpu_down_write() */
> > +   synchronize_sched();
> 
> Suppose there are no readers at this point,
> 
> > +
> > +   /* make readers block */
> > +   __cpuhp_writer = 2;
> > +
> > +   /* Wait for all readers to go away */
> > +   wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> So wait_event() "quickly" returns.
> 
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.

OK, .. I see the hole, no immediate way to fix it -- too tired atm.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Paul E. McKenney
On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
> 
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

For whatever it is worth, I too don't see how it works without read-side
memory barriers.

Thanx, Paul

> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   /* Support reader-in-reader recursion */
> > +   if (current->cpuhp_ref++) {
> > +   barrier();
> > +   return;
> > +   }
> > +
> > +   preempt_disable();
> > +   if (likely(!__cpuhp_writer))
> > +   __this_cpu_inc(__cpuhp_refcount);
> 
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
> 
> put_online_cpus() needs mb() as well.
> 
> > +void __get_online_cpus(void)
> > +{
> > +   if (__cpuhp_writer == 1) {
> > +   /* See __srcu_read_lock() */
> > +   __this_cpu_inc(__cpuhp_refcount);
> > +   smp_mb();
> > +   __this_cpu_inc(cpuhp_seq);
> > +   return;
> > +   }
> 
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.
> 
> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
> 
> > +   /*
> > +* XXX list_empty_careful(&cpuhp_readers.task_list) ?
> > +*/
> > +   if (atomic_dec_and_test(&cpuhp_waitcount))
> > +   wake_up_all(&cpuhp_writer);
> 
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > -   cpu_hotplug.active_writer = current;
> > +   unsigned int count = 0;
> > +   int cpu;
> >  
> > -   for (;;) {
> > -   mutex_lock(&cpu_hotplug.lock);
> > -   if (likely(!cpu_hotplug.refcount))
> > -   break;
> > -   __set_current_state(TASK_UNINTERRUPTIBLE);
> > -   mutex_unlock(&cpu_hotplug.lock);
> > -   schedule();
> > -   }
> > +   lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > +   /* allow reader-in-writer recursion */
> > +   current->cpuhp_ref++;
> > +
> > +   /* make readers take the slow path */
> > +   __cpuhp_writer = 1;
> > +
> > +   /* See percpu_down_write() */
> > +   synchronize_sched();
> 
> Suppose there are no readers at this point,
> 
> > +
> > +   /* make readers block */
> > +   __cpuhp_writer = 2;
> > +
> > +   /* Wait for all readers to go away */
> > +   wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> So wait_event() "quickly" returns.
> 
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Oleg Nesterov
On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
>
> > And in this case (I think) we do not care, we are already in the critical
> > section.
>
> I tend to agree, however paranoia..

Ah, in this case I tend to agree. better be paranoid ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Oleg Nesterov
On 09/24, Peter Zijlstra wrote:
>
> So now we drop from a no memory barriers fast path, into a memory
> barrier 'slow' path into blocking.

Cough... can't understand the above ;) In fact I can't understand
the patch... see below. But in any case, afaics the fast path
needs mb() unless you add another synchronize_sched() into
cpu_hotplug_done().

> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader-in-reader recursion */
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);

mb() to ensure the reader can't miss, say, a STORE done inside
the cpu_hotplug_begin/end section.

put_online_cpus() needs mb() as well.

> +void __get_online_cpus(void)
> +{
> + if (__cpuhp_writer == 1) {
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb();
> + __this_cpu_inc(cpuhp_seq);
> + return;
> + }

OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
the "stable" numbers. Looks suspicious... but lets assume this
works.

However, I do not see how "__cpuhp_writer == 1" can work, please
see below.

> + /*
> +  * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> +  */
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);

Same problem as in previous version. __get_online_cpus() succeeds
without incrementing __cpuhp_refcount. "goto start" can't help
afaics.

>  void cpu_hotplug_begin(void)
>  {
> - cpu_hotplug.active_writer = current;
> + unsigned int count = 0;
> + int cpu;
>  
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> - }
> + lockdep_assert_held(&cpu_add_remove_lock);
> +
> + /* allow reader-in-writer recursion */
> + current->cpuhp_ref++;
> +
> + /* make readers take the slow path */
> + __cpuhp_writer = 1;
> +
> + /* See percpu_down_write() */
> + synchronize_sched();

Suppose there are no readers at this point,

> +
> + /* make readers block */
> + __cpuhp_writer = 2;
> +
> + /* Wait for all readers to go away */
> + wait_event(cpuhp_writer, cpuhp_readers_active_check());

So wait_event() "quickly" returns.

Now. Why the new reader should see __cpuhp_writer = 2 ? It can
still see it == 1, and take that "if (__cpuhp_writer == 1)" path
above.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Peter Zijlstra
On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
> Yes, but my point was, this can only happen in recursive fast path.

Right, I understood.

> And in this case (I think) we do not care, we are already in the critical
> section.

I tend to agree, however paranoia..

> OK, please forget. I guess I will never understand this ;)

It might just be I'm less certain about there not being any avenue of
mischief.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-25 Thread Oleg Nesterov
On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> >
> > Yes, we need to ensure gcc doesn't reorder this code so that
> > do_something() comes before get_online_cpus(). But it can't? At least
> > it should check current->cpuhp_ref != 0 first? And if it is non-zero
> > we do not really care, we are already in the critical section and
> > this ->cpuhp_ref has only meaning in put_online_cpus().
> >
> > Confused...
>
>
> So the reason I put it in was because of the inline; it could possibly
> make it do:

[...snip...]

> In which case the recursive fast path doesn't have a barrier() between
> taking the ref and starting do_something().

Yes, but my point was, this can only happen in recursive fast path.
And in this case (I think) we do not care, we are already in the critical
section.

current->cpuhp_ref doesn't matter at all until we call put_online_cpus().

Suppose that gcc knows for sure that current->cpuhp_ref != 0. Then I
think, for example,

get_online_cpus();
do_something();
put_online_cpus();

converted to

do_something();
current->cpuhp_ref++;
current->cpuhp_ref--;

is fine. do_something() should not depend on ->cpuhp_ref.

OK, please forget. I guess I will never understand this ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Paul E. McKenney
On Tue, Sep 24, 2013 at 06:09:59PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > > +#define cpuhp_writer_wake()  
> > > \
> > > + wake_up_process(cpuhp_writer_task)
> > > +
> > > +#define cpuhp_writer_wait(cond)  
> > > \
> > > +do { 
> > > \
> > > + for (;;) {  \
> > > + set_current_state(TASK_UNINTERRUPTIBLE);\
> > > + if (cond)   \
> > > + break;  \
> > > + schedule(); \
> > > + }   \
> > > + __set_current_state(TASK_RUNNING);  \
> > > +} while (0)
> > 
> > Why not wait_event()?  Presumably the above is a bit lighter weight,
> > but is that even something that can be measured?
> 
> I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
> could create a second waitqueue; that might also be a better solution
> for the NULL thing below.

That would have the advantage of being a bit less racy.

> > > + atomic_inc(&cpuhp_waitcount);
> > > +
> > > + /*
> > > +  * We either call schedule() in the wait, or we'll fall through
> > > +  * and reschedule on the preempt_enable() in get_online_cpus().
> > > +  */
> > > + preempt_enable_no_resched();
> > > + wait_event(cpuhp_wq, !__cpuhp_writer);
> > 
> > Finally!  A good use for preempt_enable_no_resched().  ;-)
> 
> Hehe, there were a few others, but tglx removed most with the
> schedule_preempt_disabled() primitive.

;-)

> In fact, I considered a wait_event_preempt_disabled() but was too lazy.
> That whole wait_event macro fest looks like it could use an iteration or
> two of collapse anyhow.

There are some serious layers there, aren't there?

> > > + preempt_disable();
> > > +
> > > + /*
> > > +  * It would be possible for cpu_hotplug_done() to complete before
> > > +  * the atomic_inc() above; in which case there is no writer waiting
> > > +  * and doing a wakeup would be BAD (tm).
> > > +  *
> > > +  * If however we still observe cpuhp_writer_task here we know
> > > +  * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > > +  */
> > > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > 
> > OK, I'll bite...  What sequence of events results in the
> > atomic_dec_and_test() returning true but there being no
> > cpuhp_writer_task?
> > 
> > Ah, I see it...
> 
> 
> 
> Indeed, and
> 
> > But what prevents the following sequence of events?
> 
> 
> 
> > o   Task B's call to cpuhp_writer_wake() sees a NULL pointer.
> 
> Quite so.. nothing. See there was a reason I kept being confused about
> it.
> 
> > >  void cpu_hotplug_begin(void)
> > >  {
> > > + unsigned int count = 0;
> > > + int cpu;
> > > +
> > > + lockdep_assert_held(&cpu_add_remove_lock);
> > > 
> > > + __cpuhp_writer = 1;
> > > + cpuhp_writer_task = current;
> > 
> > At this point, the value of cpuhp_slowcount can go negative.  Can't see
> > that this causes a problem, given the atomic_add() below.
> 
> Agreed.
> 
> > > +
> > > + /* After this everybody will observe writer and take the slow path. */
> > > + synchronize_sched();
> > > +
> > > + /* Collapse the per-cpu refcount into slowcount */
> > > + for_each_possible_cpu(cpu) {
> > > + count += per_cpu(__cpuhp_refcount, cpu);
> > > + per_cpu(__cpuhp_refcount, cpu) = 0;
> > >   }
> > 
> > The above is safe because the readers are no longer changing their
> > __cpuhp_refcount values.
> 
> Yes, I'll expand the comment.
> 
> So how about something like this?

A few memory barriers required, if I am reading the code correctly.
Some of them, perhaps all of them, called out by Oleg.

Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  struct device;
> 
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader-in-reader recursion */
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_wri

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2013 at 10:24:23PM +0200, Peter Zijlstra wrote:
> +void __get_online_cpus(void)
> +{
> + if (__cpuhp_writer == 1) {
take_ref:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb();
> + __this_cpu_inc(cpuhp_seq);
> + return;
> + }
> +
> + atomic_inc(&cpuhp_waitcount);
> +
>   /*
> +  * We either call schedule() in the wait, or we'll fall through
> +  * and reschedule on the preempt_enable() in get_online_cpus().
>*/
> + preempt_enable_no_resched();
> + wait_event(cpuhp_readers, !__cpuhp_writer);
> + preempt_disable();
>  
> + /*
> +  * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> +  */
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);
goto take_ref;
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);

It would probably be a good idea to increment __cpuhp_refcount after the
wait_event.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> On 09/24, Paul E. McKenney wrote:
> >
> > On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> > >
> > > If gcc can actually do something wrong, then I suspect this barrier()
> > > should be unconditional.
> >
> > If you are saying that there should be a barrier() on all return paths
> > from get_online_cpus(), I agree.
> 
> Paul, Peter, could you provide any (even completely artificial) example
> to explain me why do we need this barrier() ? I am puzzled. And
> preempt_enable() already has barrier...
> 
>   get_online_cpus();
>   do_something();
> 
> Yes, we need to ensure gcc doesn't reorder this code so that
> do_something() comes before get_online_cpus(). But it can't? At least
> it should check current->cpuhp_ref != 0 first? And if it is non-zero
> we do not really care, we are already in the critical section and
> this ->cpuhp_ref has only meaning in put_online_cpus().
> 
> Confused...


So the reason I put it in was because of the inline; it could possibly
make it do:

  test  0, current->cpuhp_ref
  jelabel1:
  inc   current->cpuhp_ref

label2:
  do_something();

label1:
  inc   %gs:__preempt_count
  test  0, __cpuhp_writer
  jne   label3
  inc   %gs:__cpuhp_refcount
label5
  dec   %gs:__preempt_count
  jelabel4
  jmp   label2
label3:
  call  __get_online_cpus();
  jmp   label5
label4:
  call  preempt_schedule();
  jmp   label2

In which case the recursive fast path doesn't have a barrier() between
taking the ref and starting do_something().

I wanted to make absolutely sure nothing of do_something leaked before
the label2 thing. The other labels all have barrier() from the
preempt_count ops.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra
On Mon, Sep 23, 2013 at 07:32:03PM +0200, Oleg Nesterov wrote:
> > static void cpuhp_wait_refcount(void)
> > {
> > for (;;) {
> > unsigned int rc1, rc2;
> >
> > rc1 = cpuhp_refcount();
> > set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> > rc2 = cpuhp_refcount();
> >
> > if (rc1 == rc2 && !rc1)
> 
> But this only makes the race above "theoretical ** 2". Both
> cpuhp_refcount()'s can be equally fooled.
> 
> Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
> before it reads __cpuhp_refcount.

Ah, so SRCU has a solution for this using a sequence count.

So now we drop from a no memory barriers fast path, into a memory
barrier 'slow' path into blocking.

Only once we block do we hit global state..

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   /* Support reader-in-reader recursion */
+   if (current->cpuhp_ref++) {
+   barrier();
+   return;
+   }
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_inc(__cpuhp_refcount);
+   else
+   __get_online_cpus();
+   preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   barrier();
+   if (--current->cpuhp_ref)
+   return;
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_dec(__cpuhp_refcount);
+   else
+   __put_online_cpus();
+   preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else  /* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()  do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+   int cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,148 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-   struct task_struct *active_writer;
-   struct mutex lock; /* Synchronizes accesses to refcount, */
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+   p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+   if (__cpuhp_writer == 1) {
+   /* See __srcu_read_lock() */
+   __this_cpu_inc(__cpuhp_refcount);
+   smp_mb();
+   __this_cpu_inc(cpuhp_seq);
+   return;
+   }
+
+   atomic_inc(&cpuhp_waitcount);
+
/*
-* Also blocks the new readers during
-* an ongoing cpu hotplug operation.
+* We either call schedule() in the wait, or we'll fall through
+* and reschedule on the preempt_enable() in get_online_cpus().
 */
-   int refcount;
-} cpu_hotplug = {
-   .active_writer = NULL,
-   .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-   .refcount = 0,
-};
+   preempt_enable_no_resched();
+   wait_event(cpuhp_readers, !__cpuhp_writer);
+   preempt_disable();
 
-void get_online_cpus(void)
+   /*
+* XXX list_empty_careful(&cpuhp_readers.task_list) ?
+*/
+   if (atomic_dec_and_test(&cpuhp_waitcount))
+   wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
 {
-   might_sleep();
-   if (cpu_hotplug.active_writer == current)
-   return;
-   mutex_lock(&cpu_hotplug.lock);
-   cpu_hotplug.refcount++;
-   mutex_unlock(&cpu_hotplug.lock);
+   /* See __srcu_read_unlock() */
+   smp_mb();
+

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Oleg Nesterov
On 09/24, Paul E. McKenney wrote:
>
> On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> >
> > If gcc can actually do something wrong, then I suspect this barrier()
> > should be unconditional.
>
> If you are saying that there should be a barrier() on all return paths
> from get_online_cpus(), I agree.

Paul, Peter, could you provide any (even completely artificial) example
to explain me why do we need this barrier() ? I am puzzled. And
preempt_enable() already has barrier...

get_online_cpus();
do_something();

Yes, we need to ensure gcc doesn't reorder this code so that
do_something() comes before get_online_cpus(). But it can't? At least
it should check current->cpuhp_ref != 0 first? And if it is non-zero
we do not really care, we are already in the critical section and
this ->cpuhp_ref has only meaning in put_online_cpus().

Confused...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Paul E. McKenney
On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> On 09/24, Steven Rostedt wrote:
> >
> > On Tue, 24 Sep 2013 18:03:59 +0200
> > Oleg Nesterov  wrote:
> >
> > > On 09/24, Peter Zijlstra wrote:
> > > >
> > > > +static inline void get_online_cpus(void)
> > > > +{
> > > > +   might_sleep();
> > > > +
> > > > +   if (current->cpuhp_ref++) {
> > > > +   barrier();
> > > > +   return;
> > >
> > > I don't undestand this barrier()... we are going to return if we already
> > > hold the lock, do we really need it?
> >
> > I'm confused too. Unless gcc moves this after the release, but the
> > release uses preempt_disable() which is its own barrier.
> >
> > If anything, it requires a comment.
> 
> And I am still confused even after emails from Paul and Peter...
> 
> If gcc can actually do something wrong, then I suspect this barrier()
> should be unconditional.

If you are saying that there should be a barrier() on all return paths
from get_online_cpus(), I agree.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Oleg Nesterov
On 09/24, Steven Rostedt wrote:
>
> On Tue, 24 Sep 2013 18:03:59 +0200
> Oleg Nesterov  wrote:
>
> > On 09/24, Peter Zijlstra wrote:
> > >
> > > +static inline void get_online_cpus(void)
> > > +{
> > > + might_sleep();
> > > +
> > > + if (current->cpuhp_ref++) {
> > > + barrier();
> > > + return;
> >
> > I don't undestand this barrier()... we are going to return if we already
> > hold the lock, do we really need it?
>
> I'm confused too. Unless gcc moves this after the release, but the
> release uses preempt_disable() which is its own barrier.
>
> If anything, it requires a comment.

And I am still confused even after emails from Paul and Peter...

If gcc can actually do something wrong, then I suspect this barrier()
should be unconditional.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Oleg Nesterov
On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> > > > +   /* Signal the writer is done */
> > > > +   cpuhp_writer = 0;
> > > > +   wake_up_all(&cpuhp_wq);
> > > > +
> > > > +   /* Wait for any pending readers to be running */
> > > > +   cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > > +   cpuhp_writer_task = NULL;
> > >
> > > We also need to ensure that the next reader should see all changes
> > > done by the writer, iow this lacks "realease" semantics.
> >
> > Good point -- I was expecting wake_up_all() to provide the release
> > semantics, but code could be reordered into __wake_up()'s critical
> > section, especially in the case where there was nothing to wake
> > up, but where there were new readers starting concurrently with
> > cpu_hotplug_done().
>
> Doh, indeed. I missed this in Oleg's email, but yes I made that same
> assumption about wake_up_all().

Well, I think this is even worse... No matter what the writer does,
the new reader needs mb() after it checks !__cpuhp_writer. Or we
need another synchronize_sched() in cpu_hotplug_done(). This is
what percpu_rw_semaphore() does (to remind, this can be turned into
call_rcu).

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > >  void cpu_hotplug_done(void)
> > >  {
> > > + /* Signal the writer is done */
> > > + cpuhp_writer = 0;
> > > + wake_up_all(&cpuhp_wq);
> > > +
> > > + /* Wait for any pending readers to be running */
> > > + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > + cpuhp_writer_task = NULL;
> > 
> > We also need to ensure that the next reader should see all changes
> > done by the writer, iow this lacks "realease" semantics.
> 
> Good point -- I was expecting wake_up_all() to provide the release
> semantics, but code could be reordered into __wake_up()'s critical
> section, especially in the case where there was nothing to wake
> up, but where there were new readers starting concurrently with
> cpu_hotplug_done().

Doh, indeed. I missed this in Oleg's email, but yes I made that same
assumption about wake_up_all().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   if (current->cpuhp_ref++) {
> > +   barrier();
> > +   return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
> 
> The same for put_online_cpus().

to make {get,put}_online_cpus() always behave like per-cpu lock
sections.

I don't think its ever 'correct' for loads/stores to escape the section,
even if not strictly harmful.

> > +void __get_online_cpus(void)
> >  {
> > -   if (cpu_hotplug.active_writer == current)
> > +   if (cpuhp_writer_task == current)
> > return;
> 
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().

Oh indeed!

> > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > + cpuhp_writer_wake();
> 
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...

Yeah it is. Paul already said.

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
> 
> If the writer is pending, percpu_down_read() does
> 
>   down_read(&brw->rw_sem);
>   atomic_inc(&brw->slow_read_ctr);
>   __up_read(&brw->rw_sem);
> 
> is it really much worse than wait_event + atomic_dec_and_test?
> 
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

Good points both, no I don't think there's a significant performance gap
there.

I'm still hoping we can come up with something better though :/ I don't
particularly like either.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Paul E. McKenney
On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   if (current->cpuhp_ref++) {
> > +   barrier();
> > +   return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
> 
> The same for put_online_cpus().

The barrier() is needed because of the possibility of inlining, right?

> > +void __get_online_cpus(void)
> >  {
> > -   if (cpu_hotplug.active_writer == current)
> > +   if (cpuhp_writer_task == current)
> > return;
> 
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().
> 
> This also means that the writer doing get/put_online_cpus() will
> always use the fast path, and __cpuhp_writer can go away,
> cpuhp_writer_task != NULL can be used instead.

I would need to see the code for this change to be sure.  ;-)

> > + atomic_inc(&cpuhp_waitcount);
> > +
> > + /*
> > +  * We either call schedule() in the wait, or we'll fall through
> > +  * and reschedule on the preempt_enable() in get_online_cpus().
> > +  */
> > + preempt_enable_no_resched();
> > + wait_event(cpuhp_wq, !__cpuhp_writer);
> > + preempt_disable();
> > +
> > + /*
> > +  * It would be possible for cpu_hotplug_done() to complete before
> > +  * the atomic_inc() above; in which case there is no writer waiting
> > +  * and doing a wakeup would be BAD (tm).
> > +  *
> > +  * If however we still observe cpuhp_writer_task here we know
> > +  * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > +  */
> > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > + cpuhp_writer_wake();
> 
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
> Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
> something like ACCESS_ONCE()), its task_struct can be already freed/reused
> if the writer exits.
> 
> And I don't really understand the logic... This slow path succeds without
> incrementing any counter (except current->cpuhp_ref)? How the next writer
> can notice the fact it should wait for this reader?
> 
> >  void cpu_hotplug_done(void)
> >  {
> > -   cpu_hotplug.active_writer = NULL;
> > -   mutex_unlock(&cpu_hotplug.lock);
> > +   /* Signal the writer is done */
> > +   cpuhp_writer = 0;
> > +   wake_up_all(&cpuhp_wq);
> > +
> > +   /* Wait for any pending readers to be running */
> > +   cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > +   cpuhp_writer_task = NULL;
> 
> We also need to ensure that the next reader should see all changes
> done by the writer, iow this lacks "realease" semantics.

Good point -- I was expecting wake_up_all() to provide the release
semantics, but code could be reordered into __wake_up()'s critical
section, especially in the case where there was nothing to wake
up, but where there were new readers starting concurrently with
cpu_hotplug_done().

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
> 
> If the writer is pending, percpu_down_read() does
> 
>   down_read(&brw->rw_sem);
>   atomic_inc(&brw->slow_read_ctr);
>   __up_read(&brw->rw_sem);
> 
> is it really much worse than wait_event + atomic_dec_and_test?
> 
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Steven Rostedt
On Tue, 24 Sep 2013 18:03:59 +0200
Oleg Nesterov  wrote:

> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   if (current->cpuhp_ref++) {
> > +   barrier();
> > +   return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?

I'm confused too. Unless gcc moves this after the release, but the
release uses preempt_disable() which is its own barrier.

If anything, it requires a comment.

-- Steve

> 
> The same for put_online_cpus().
> 
> > +void __get_online_cpus(void)
> >  {
> > -   if (cpu_hotplug.active_writer == current)
> > +   if (cpuhp_writer_task == current)
> > return;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Steven Rostedt
On Tue, 24 Sep 2013 14:38:21 +0200
Peter Zijlstra  wrote:

> +#define cpuhp_writer_wait(cond)  
> \
> +do { \
> + for (;;) {  \
> + set_current_state(TASK_UNINTERRUPTIBLE);\
> + if (cond)   \
> + break;  \
> + schedule(); \
> + }   \
> + __set_current_state(TASK_RUNNING);  \
> +} while (0)
> +
> +void __get_online_cpus(void)

The above really needs a comment about how it is used. Otherwise, I can
envision someone calling this as "oh I can use this when I'm in a
preempt disable section", and the comment below for the
preempt_enable_no_resched() will no longer be true.

-- Steve


>  {
> - if (cpu_hotplug.active_writer == current)
> + if (cpuhp_writer_task == current)
>   return;
> - mutex_lock(&cpu_hotplug.lock);
>  
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> +  * We either call schedule() in the wait, or we'll fall through
> +  * and reschedule on the preempt_enable() in get_online_cpus().
> +  */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_wq, !__cpuhp_writer);
> + preempt_disable();
> +
> + /*
> +  * It would be possible for cpu_hotplug_done() to complete before
> +  * the atomic_inc() above; in which case there is no writer waiting
> +  * and doing a wakeup would be BAD (tm).
> +  *
> +  * If however we still observe cpuhp_writer_task here we know
> +  * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +  */
> + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> + cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>  
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Oleg Nesterov
On 09/24, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
>  {
> - if (cpu_hotplug.active_writer == current)
> + /* Support reader-in-writer recursion */
> + if (__cpuhp_writer == current)
>   return;
> - mutex_lock(&cpu_hotplug.lock);
>  
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + atomic_inc(&cpuhp_waitcount);
>  
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> + /*
> +  * We either call schedule() in the wait, or we'll fall through
> +  * and reschedule on the preempt_enable() in get_online_cpus().
> +  */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_readers, !__cpuhp_writer);
> + preempt_disable();
> +
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);

Yes, this should fix the races with the exiting writer, but still this
doesn't look right afaics.

In particular let me repeat,

>  void cpu_hotplug_begin(void)
>  {
> - cpu_hotplug.active_writer = current;
> + unsigned int count = 0;
> + int cpu;
> +
> + lockdep_assert_held(&cpu_add_remove_lock);
>  
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> + __cpuhp_writer = current;
> +
> + /* 
> +  * After this everybody will observe writer and take the slow path.
> +  */
> + synchronize_sched();

synchronize_sched() is slow. The new readers will likely notice
__cpuhp_writer != NULL much earlier and they will be blocked in
__get_online_cpus() while the writer sleeps before it actually
enters the critical section.

Or I completely misunderstood this all?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Oleg Nesterov
On 09/24, Peter Zijlstra wrote:
>
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;

I don't undestand this barrier()... we are going to return if we already
hold the lock, do we really need it?

The same for put_online_cpus().

> +void __get_online_cpus(void)
>  {
> - if (cpu_hotplug.active_writer == current)
> + if (cpuhp_writer_task == current)
>   return;

Probably it would be better to simply inc/dec ->cpuhp_ref in
cpu_hotplug_begin/end and remove this check here and in
__put_online_cpus().

This also means that the writer doing get/put_online_cpus() will
always use the fast path, and __cpuhp_writer can go away,
cpuhp_writer_task != NULL can be used instead.

> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> +  * We either call schedule() in the wait, or we'll fall through
> +  * and reschedule on the preempt_enable() in get_online_cpus().
> +  */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_wq, !__cpuhp_writer);
> + preempt_disable();
> +
> + /*
> +  * It would be possible for cpu_hotplug_done() to complete before
> +  * the atomic_inc() above; in which case there is no writer waiting
> +  * and doing a wakeup would be BAD (tm).
> +  *
> +  * If however we still observe cpuhp_writer_task here we know
> +  * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +  */
> + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> + cpuhp_writer_wake();

cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
something like ACCESS_ONCE()), its task_struct can be already freed/reused
if the writer exits.

And I don't really understand the logic... This slow path succeds without
incrementing any counter (except current->cpuhp_ref)? How the next writer
can notice the fact it should wait for this reader?

>  void cpu_hotplug_done(void)
>  {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* Signal the writer is done */
> + cpuhp_writer = 0;
> + wake_up_all(&cpuhp_wq);
> +
> + /* Wait for any pending readers to be running */
> + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> + cpuhp_writer_task = NULL;

We also need to ensure that the next reader should see all changes
done by the writer, iow this lacks "realease" semantics.




But, Peter, the main question is, why this is better than
percpu_rw_semaphore performance-wise? (Assuming we add
task_struct->cpuhp_ref).

If the writer is pending, percpu_down_read() does

down_read(&brw->rw_sem);
atomic_inc(&brw->slow_read_ctr);
__up_read(&brw->rw_sem);

is it really much worse than wait_event + atomic_dec_and_test?

And! please note that with your implementation the new readers will
be likely blocked while the writer sleeps in synchronize_sched().
This doesn't happen with percpu_rw_semaphore.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > +#define cpuhp_writer_wake()
> > \
> > +   wake_up_process(cpuhp_writer_task)
> > +
> > +#define cpuhp_writer_wait(cond)
> > \
> > +do {   
> > \
> > +   for (;;) {  \
> > +   set_current_state(TASK_UNINTERRUPTIBLE);\
> > +   if (cond)   \
> > +   break;  \
> > +   schedule(); \
> > +   }   \
> > +   __set_current_state(TASK_RUNNING);  \
> > +} while (0)
> 
> Why not wait_event()?  Presumably the above is a bit lighter weight,
> but is that even something that can be measured?

I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
could create a second waitqueue; that might also be a better solution
for the NULL thing below.

> > +   atomic_inc(&cpuhp_waitcount);
> > +
> > +   /*
> > +* We either call schedule() in the wait, or we'll fall through
> > +* and reschedule on the preempt_enable() in get_online_cpus().
> > +*/
> > +   preempt_enable_no_resched();
> > +   wait_event(cpuhp_wq, !__cpuhp_writer);
> 
> Finally!  A good use for preempt_enable_no_resched().  ;-)

Hehe, there were a few others, but tglx removed most with the
schedule_preempt_disabled() primitive.

In fact, I considered a wait_event_preempt_disabled() but was too lazy.
That whole wait_event macro fest looks like it could use an iteration or
two of collapse anyhow.

> > +   preempt_disable();
> > +
> > +   /*
> > +* It would be possible for cpu_hotplug_done() to complete before
> > +* the atomic_inc() above; in which case there is no writer waiting
> > +* and doing a wakeup would be BAD (tm).
> > +*
> > +* If however we still observe cpuhp_writer_task here we know
> > +* cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > +*/
> > +   if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> 
> OK, I'll bite...  What sequence of events results in the
> atomic_dec_and_test() returning true but there being no
> cpuhp_writer_task?
> 
> Ah, I see it...



Indeed, and

> But what prevents the following sequence of events?



> o Task B's call to cpuhp_writer_wake() sees a NULL pointer.

Quite so.. nothing. See there was a reason I kept being confused about
it.

> >  void cpu_hotplug_begin(void)
> >  {
> > +   unsigned int count = 0;
> > +   int cpu;
> > +
> > +   lockdep_assert_held(&cpu_add_remove_lock);
> > 
> > +   __cpuhp_writer = 1;
> > +   cpuhp_writer_task = current;
> 
> At this point, the value of cpuhp_slowcount can go negative.  Can't see
> that this causes a problem, given the atomic_add() below.

Agreed.

> > +
> > +   /* After this everybody will observe writer and take the slow path. */
> > +   synchronize_sched();
> > +
> > +   /* Collapse the per-cpu refcount into slowcount */
> > +   for_each_possible_cpu(cpu) {
> > +   count += per_cpu(__cpuhp_refcount, cpu);
> > +   per_cpu(__cpuhp_refcount, cpu) = 0;
> > }
> 
> The above is safe because the readers are no longer changing their
> __cpuhp_refcount values.

Yes, I'll expand the comment.

So how about something like this?

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   /* Support reader-in-reader recursion */
+   if (current->cpuhp_ref++) {
+   barrier();
+   return;
+   }
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_inc(__cpuhp_refcount);
+   else
+   __get_online_cpus();
+   preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   barrier();
+   if (--current->cpuhp_ref)
+   return;
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_dec(__cpuhp_refcount);
+   else
+   __put_online_cpus();
+   preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Paul E. McKenney
On Tue, Sep 24, 2013 at 02:38:21PM +0200, Peter Zijlstra wrote:
> 
> OK, so another attempt.
> 
> This one is actually fair in that it immediately forces a reader
> quiescent state by explicitly implementing reader-reader recursion.
> 
> This does away with the potentially long pending writer case and can
> thus use the simpler global state.
> 
> I don't really like this lock being fair, but alas.
> 
> Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
> cpu_hotplug_done(). I think its ok, but I keep confusing myself.

Cute!

Some commentary below.  Also one question about how a race leading to
a NULL-pointer dereference is avoided.

Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  struct device;
> 
> @@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);
> + else
> + __get_online_cpus();
> + preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();
> + if (--current->cpuhp_ref)
> + return;
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_dec(__cpuhp_refcount);
> + else
> + __put_online_cpus();
> + preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> @@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>   unsigned intsequential_io;
>   unsigned intsequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> -static struct {
> - struct task_struct *active_writer;
> - struct mutex lock; /* Synchronizes accesses to refcount, */
> - /*
> -  * Also blocks the new readers during
> -  * an ongoing cpu hotplug operation.
> -  */
> - int refcount;
> -} cpu_hotplug = {
> - .active_writer = NULL,
> - .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> - .refcount = 0,
> -};
> +static struct task_struct *cpuhp_writer_task = NULL;
> 
> -void get_online_cpus(void)
> -{
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> 
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> + p->cpuhp_ref = 0;
>  }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
> 
> -void put_online_cpus(void)
> +#define cpuhp_writer_wake()  \
> + wake_up_process(cpuhp_writer_task)
> +
> +#define cpuhp_writer_wait(cond)  
> \
> +do { \
> + for (;;) {  \
> + set_current_state(TASK_UNINTERRUPTIBLE);\
> + if (cond)   \
> + break;  \
> + schedule(); \
> + }   \
> + __set_current_state(TASK_

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-24 Thread Peter Zijlstra

OK, so another attempt.

This one is actually fair in that it immediately forces a reader
quiescent state by explicitly implementing reader-reader recursion.

This does away with the potentially long pending writer case and can
thus use the simpler global state.

I don't really like this lock being fair, but alas.

Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
cpu_hotplug_done(). I think its ok, but I keep confusing myself.

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   if (current->cpuhp_ref++) {
+   barrier();
+   return;
+   }
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_inc(__cpuhp_refcount);
+   else
+   __get_online_cpus();
+   preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   barrier();
+   if (--current->cpuhp_ref)
+   return;
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer))
+   __this_cpu_dec(__cpuhp_refcount);
+   else
+   __put_online_cpus();
+   preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
@@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
 
 #else  /* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()  do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+   int cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-   struct task_struct *active_writer;
-   struct mutex lock; /* Synchronizes accesses to refcount, */
-   /*
-* Also blocks the new readers during
-* an ongoing cpu hotplug operation.
-*/
-   int refcount;
-} cpu_hotplug = {
-   .active_writer = NULL,
-   .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-   .refcount = 0,
-};
+static struct task_struct *cpuhp_writer_task = NULL;
 
-void get_online_cpus(void)
-{
-   might_sleep();
-   if (cpu_hotplug.active_writer == current)
-   return;
-   mutex_lock(&cpu_hotplug.lock);
-   cpu_hotplug.refcount++;
-   mutex_unlock(&cpu_hotplug.lock);
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+   p->cpuhp_ref = 0;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
 
-void put_online_cpus(void)
+#define cpuhp_writer_wake()\
+   wake_up_process(cpuhp_writer_task)
+
+#define cpuhp_writer_wait(cond)
\
+do {   \
+   for (;;) {  \
+   set_current_state(TASK_UNINTERRUPTIBLE);\
+   if (cond)   \
+   break;  \
+   schedule(); \
+   }   \
+   __set_current_state(TASK_RUNNING);  \
+} while (0)
+
+void __get_online_cpus(void)
 {
-   if (cpu_hotplug.active_writer == current)
+   if (cpuhp_writer_task == current)
return;
-   mutex_lock(&cpu_hotplug.lock);
 
-   if (WARN_ON(!cpu_hotplug.refcount))
-   cpu_hotplug.refcount++; /* try to fix things up */
+   atomic_inc(&cpuhp_waitcount);
+
+   /*
+* We either call schedule() in the

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Paul E. McKenney
On Mon, Sep 23, 2013 at 06:01:30PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> > Not a problem, just stuff the idx into some per-task thing.  Either
> > task_struct or taskinfo will work fine.
> 
> Still not seeing the point of using srcu though..
> 
> srcu_read_lock() vs synchronize_srcu() is the same but far more
> expensive than preempt_disable() vs synchronize_sched().

Heh!  You want the old-style SRCU.  ;-)

> > Or to put it another way, if the underlying slow-path mutex is
> > reader-preference, then the whole thing will be reader-preference.
> 
> Right, so 1) we have no such mutex so we're going to have to open-code
> that anyway, and 2) like I just explained in the other email, I want the
> pending writer case to be _fast_ as well.

At some point I suspect that we will want some form of fairness, but in
the meantime, good point.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Oleg Nesterov
And somehow I didn't notice that cpuhp_set_state() doesn't look right,

On 09/19, Peter Zijlstra wrote:
>  void cpu_hotplug_begin(void)
>  {
> - cpu_hotplug.active_writer = current;
> + lockdep_assert_held(&cpu_add_remove_lock);
>  
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> - }
> + __cpuhp_writer = current;
> +
> + /* After this everybody will observe _writer and take the slow path. */
> + synchronize_sched();
> +
> + /* Wait for no readers -- reader preference */
> + cpuhp_wait_refcount();
> +
> + /* Stop new readers. */
> + cpuhp_set_state(1);

But this stops all readers, not only new. Even if cpuhp_wait_refcount()
was correct, a new reader can come right before cpuhp_set_state(1) and
then it can call another recursive get_online_cpus() right after.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2013 at 10:04:00AM -0700, Paul E. McKenney wrote:
> At some point I suspect that we will want some form of fairness, but in
> the meantime, good point.

I figured we could start a timer on hotplug to force quiesce the readers
after about 10 minutes or so ;-)

Should be a proper discouragement from (ab)using this hotplug stuff...

Muwhahaha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Oleg Nesterov
On 09/23, Peter Zijlstra wrote:
>
> On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > > So the slow path is still per-cpu and mostly uncontended even in the
> > > pending writer case.
> >
> > Is it really important? I mean, per-cpu/uncontended even if the writer
> > is pending?
>
> I think so, once we make {get,put}_online_cpus() really cheap they'll
> get in more and more places, and the global count with pending writer
> will make things crawl on bigger machines.

Hmm. But the writers should be rare.

> > But. We already have percpu_rw_semaphore,
>
> Oh urgh, forgot about that one. /me goes read.
>
> /me curses loudly.. that thing has an _expedited() call in it, those
> should die.

Probably yes, the original reason for _expedited() has gone away.

> I'd dread to think what would happen if a 4k cpu machine were to land in
> the slow path on that global mutex. Readers would never go-away and
> progress would make a glacier seem fast.

Another problem is that write-lock can never succeed unless it
prevents the new readers, but this needs the per-task counter.

> > Note also that percpu_down_write/percpu_up_write can be improved wrt
> > synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> > 1nd one can be avoided if another percpu_down_write() comes "soon after"
> > percpu_down_up().
>
> Write side be damned ;-)

Suppose that a 4k cpu machine does disable_nonboot_cpus(), every
_cpu_down() does synchronize_sched()... OK, perhaps the locking can be
changed so that cpu_hotplug_begin/end is called only once in this case.

> > - The writer calls cpuph_wait_refcount()
> >
> > - cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> >   refcnt == 0.
> >
> > - another reader comes on CPU_0, increments __cpuhp_refcount[0].
> >
> > - this reader migrates to CPU_1 and does put_online_cpus(),
> >   this decrements __cpuhp_refcount[1] which becomes zero.
> >
> > - cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> >   which is zero. refcnt == 0, return.
>
> Ah indeed..
>
> The best I can come up with is something like:
>
> static unsigned int cpuhp_refcount(void)
> {
>   unsigned int refcount = 0;
>   int cpu;
>
>   for_each_possible_cpu(cpu)
>   refcount += per_cpu(__cpuhp_refcount, cpu);
> }
>
> static void cpuhp_wait_refcount(void)
> {
>   for (;;) {
>   unsigned int rc1, rc2;
>
>   rc1 = cpuhp_refcount();
>   set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
>   rc2 = cpuhp_refcount();
>
>   if (rc1 == rc2 && !rc1)

But this only makes the race above "theoretical ** 2". Both
cpuhp_refcount()'s can be equally fooled.

Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
before it reads __cpuhp_refcount.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2013 at 11:59:08AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 17:22:23 +0200
> Peter Zijlstra  wrote:
> 
> > Still no point in using srcu for this; preempt_disable +
> > synchronize_sched() is similar and much faster -- its the rcu_sched
> > equivalent of what you propose.
> 
> To be honest, I sent this out last week and it somehow got trashed by
> my laptop and connecting to my smtp server. Where the last version of
> your patch still had the memory barrier ;-)

Ah, ok, yes in that case things start to make sense again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Steven Rostedt
On Mon, 23 Sep 2013 17:22:23 +0200
Peter Zijlstra  wrote:

> Still no point in using srcu for this; preempt_disable +
> synchronize_sched() is similar and much faster -- its the rcu_sched
> equivalent of what you propose.

To be honest, I sent this out last week and it somehow got trashed by
my laptop and connecting to my smtp server. Where the last version of
your patch still had the memory barrier ;-)

So yeah, a true synchronize_sched() is better.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> Not a problem, just stuff the idx into some per-task thing.  Either
> task_struct or taskinfo will work fine.

Still not seeing the point of using srcu though..

srcu_read_lock() vs synchronize_srcu() is the same but far more
expensive than preempt_disable() vs synchronize_sched().

> Or to put it another way, if the underlying slow-path mutex is
> reader-preference, then the whole thing will be reader-preference.

Right, so 1) we have no such mutex so we're going to have to open-code
that anyway, and 2) like I just explained in the other email, I want the
pending writer case to be _fast_ as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Paul E. McKenney
On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 16:54:46 +0200
> Peter Zijlstra  wrote:
> 
> > On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:

[ . . . ]

> ?? I'm not sure I understand this. The online_cpus_held++ was there for
> recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
> once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
> many times as we did a __get_online_cpus(). I don't know where the
> O(nr_tasks) comes from. The ref here was just to account for doing the
> old "get_online_cpus" instead of a srcu_read_lock().
> 
> > 
> > > static inline void put_online_cpus(void)
> > > {
> > >   if (unlikely(current->online_cpus_held)) {
> > >   current->online_cpus_held--;
> > >   __put_online_cpus();
> > >   return;
> > >   }
> > > 
> > >   srcu_read_unlock(&cpuhp_srcu);
> > > }
> > 
> > Also, you might not have noticed but, srcu_read_{,un}lock() have an
> > extra idx thing to pass about. That doesn't fit with the hotplug api.
> 
> I'll have to look a that, as I'm not exactly sure about the idx thing.

Not a problem, just stuff the idx into some per-task thing.  Either
task_struct or taskinfo will work fine.

> > > 
> > > Then have the writer simply do:
> > > 
> > >   __cpuhp_write = current;
> > >   synchronize_srcu(&cpuhp_srcu);
> > > 
> > >   
> > 
> > How does that do reader preference?
> 
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.
> 
> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).
> 
> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

Or to put it another way, if the underlying slow-path mutex is
reader-preference, then the whole thing will be reader-preference.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.

Well, for that the thing Oleg proposed works just fine and the
preempt_disable() section vs synchronize_sched() is hardly magic.

But I'd really like to get the writer pending case fast too.

> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).

Still no point in using srcu for this; preempt_disable +
synchronize_sched() is similar and much faster -- its the rcu_sched
equivalent of what you propose.

> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

No, well, it also gets slow when a hotplug is pending, which can be
quite a while if we go sprinkle get_online_cpus() all over the place and
the machine is busy.

One we start a hotplug attempt we must wait for all readers to quiesce
-- since the lock is full reader preference this can take an infinite
amount of time -- while we're waiting for this all 4k+ CPUs will be
bouncing the one mutex around on every get_online_cpus(); of which we'll
have many since that's the entire point of making them cheap, to use
more of them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Steven Rostedt
On Mon, 23 Sep 2013 16:54:46 +0200
Peter Zijlstra  wrote:

> On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> > On Thu, 19 Sep 2013 16:32:41 +0200
> > Peter Zijlstra  wrote:
> > 
> > 
> > > +extern void __get_online_cpus(void);
> > > +
> > > +static inline void get_online_cpus(void)
> > > +{
> > > + might_sleep();
> > > +
> > > + preempt_disable();
> > > + if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > > + this_cpu_inc(__cpuhp_refcount);
> > > + else
> > > + __get_online_cpus();
> > > + preempt_enable();
> > > +}
> > 
> > 
> > This isn't much different than srcu_read_lock(). What about doing
> > something like this:
> > 
> > static inline void get_online_cpus(void)
> > {
> > might_sleep();
> > 
> > srcu_read_lock(&cpuhp_srcu);
> > if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> > srcu_read_unlock(&cpuhp_srcu);
> > __get_online_cpus();
> > current->online_cpus_held++;
> > }
> > }
> 
> There's a full memory barrier in srcu_read_lock(), while there was no
> such thing in the previous fast path.

Yeah, I mentioned this to Paul, and we talked about making
srcu_read_lock() work with no mb's. But currently, doesn't
get_online_cpus() just take a mutex? What's wrong with a mb() as it
still kicks ass over what is currently there today?

> 
> Also, why current->online_cpus_held()? That would make the write side
> O(nr_tasks) instead of O(nr_cpus).

?? I'm not sure I understand this. The online_cpus_held++ was there for
recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
many times as we did a __get_online_cpus(). I don't know where the
O(nr_tasks) comes from. The ref here was just to account for doing the
old "get_online_cpus" instead of a srcu_read_lock().

> 
> > static inline void put_online_cpus(void)
> > {
> > if (unlikely(current->online_cpus_held)) {
> > current->online_cpus_held--;
> > __put_online_cpus();
> > return;
> > }
> > 
> > srcu_read_unlock(&cpuhp_srcu);
> > }
> 
> Also, you might not have noticed but, srcu_read_{,un}lock() have an
> extra idx thing to pass about. That doesn't fit with the hotplug api.

I'll have to look a that, as I'm not exactly sure about the idx thing.

> 
> > 
> > Then have the writer simply do:
> > 
> > __cpuhp_write = current;
> > synchronize_srcu(&cpuhp_srcu);
> > 
> > 
> 
> How does that do reader preference?

Well, the point I was trying to do was to let readers go very fast
(well, with a mb instead of a mutex), and then when the CPU hotplug
happens, it goes back to the current method.

That is, once we set __cpuhp_write, and then run synchronize_srcu(),
the system will be in a state that does what it does today (grabbing
mutexes, and upping refcounts).

I thought the whole point was to speed up the get_online_cpus() when no
hotplug is happening. This does that, and is rather simple. It only
gets slow when hotplug is in effect.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> On Thu, 19 Sep 2013 16:32:41 +0200
> Peter Zijlstra  wrote:
> 
> 
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   preempt_disable();
> > +   if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > +   this_cpu_inc(__cpuhp_refcount);
> > +   else
> > +   __get_online_cpus();
> > +   preempt_enable();
> > +}
> 
> 
> This isn't much different than srcu_read_lock(). What about doing
> something like this:
> 
> static inline void get_online_cpus(void)
> {
>   might_sleep();
> 
>   srcu_read_lock(&cpuhp_srcu);
>   if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
>   srcu_read_unlock(&cpuhp_srcu);
>   __get_online_cpus();
>   current->online_cpus_held++;
>   }
> }

There's a full memory barrier in srcu_read_lock(), while there was no
such thing in the previous fast path.

Also, why current->online_cpus_held()? That would make the write side
O(nr_tasks) instead of O(nr_cpus).

> static inline void put_online_cpus(void)
> {
>   if (unlikely(current->online_cpus_held)) {
>   current->online_cpus_held--;
>   __put_online_cpus();
>   return;
>   }
> 
>   srcu_read_unlock(&cpuhp_srcu);
> }

Also, you might not have noticed but, srcu_read_{,un}lock() have an
extra idx thing to pass about. That doesn't fit with the hotplug api.

> 
> Then have the writer simply do:
> 
>   __cpuhp_write = current;
>   synchronize_srcu(&cpuhp_srcu);
> 
>   

How does that do reader preference?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Steven Rostedt
On Thu, 19 Sep 2013 16:32:41 +0200
Peter Zijlstra  wrote:


> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> + this_cpu_inc(__cpuhp_refcount);
> + else
> + __get_online_cpus();
> + preempt_enable();
> +}


This isn't much different than srcu_read_lock(). What about doing
something like this:

static inline void get_online_cpus(void)
{
might_sleep();

srcu_read_lock(&cpuhp_srcu);
if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
srcu_read_unlock(&cpuhp_srcu);
__get_online_cpus();
current->online_cpus_held++;
}
}

static inline void put_online_cpus(void)
{
if (unlikely(current->online_cpus_held)) {
current->online_cpus_held--;
__put_online_cpus();
return;
}

srcu_read_unlock(&cpuhp_srcu);
}

Then have the writer simply do:

__cpuhp_write = current;
synchronize_srcu(&cpuhp_srcu);



-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-23 Thread Peter Zijlstra
On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > So the slow path is still per-cpu and mostly uncontended even in the
> > pending writer case.
> 
> Is it really important? I mean, per-cpu/uncontended even if the writer
> is pending?

I think so, once we make {get,put}_online_cpus() really cheap they'll
get in more and more places, and the global count with pending writer
will make things crawl on bigger machines.

> Otherwise we could do



> I already sent this code in 2010, it needs some trivial updates.

Yeah, I found that a few days ago.. but per the above I didn't like the
pending writer case.

> But. We already have percpu_rw_semaphore,

Oh urgh, forgot about that one. /me goes read.

/me curses loudly.. that thing has an _expedited() call in it, those
should die.

Also, it suffers the same problem. I think esp. for hotplug we should be
100% geared towards readers and pretty much damn writers.

I'd dread to think what would happen if a 4k cpu machine were to land in
the slow path on that global mutex. Readers would never go-away and
progress would make a glacier seem fast.

> Note also that percpu_down_write/percpu_up_write can be improved wrt
> synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> 1nd one can be avoided if another percpu_down_write() comes "soon after"
> percpu_down_up().

Write side be damned ;-)

It is anyway with a pure read bias and a large machine..

> As for the patch itself, I am not sure.
> 
> > +static void cpuph_wait_refcount(void)
> 
> It seems, this can succeed while it should not, see below.
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > +   lockdep_assert_held(&cpu_add_remove_lock);
> >
> > +   __cpuhp_writer = current;
> > +
> > +   /* After this everybody will observe _writer and take the slow path. */
> > +   synchronize_sched();
> 
> Yes, the reader should see _writer, but:
> 
> > +   /* Wait for no readers -- reader preference */
> > +   cpuhp_wait_refcount();
> 
> but how we can ensure the writer sees the results of the reader's updates?
> 
> Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 
> 1.
> IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.
> 
> Now,
> 
>   - The writer calls cpuph_wait_refcount()
> 
>   - cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> refcnt == 0.
> 
>   - another reader comes on CPU_0, increments __cpuhp_refcount[0].
> 
>   - this reader migrates to CPU_1 and does put_online_cpus(),
> this decrements __cpuhp_refcount[1] which becomes zero.
> 
>   - cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> which is zero. refcnt == 0, return.
> 
>   - The writer does cpuhp_set_state(1).
> 
>   - The reader R (original reader) wakes up, calls get_online_cpus()
> recursively, and sleeps in wait_event(!__cpuhp_writer).

Ah indeed.. 

The best I can come up with is something like:

static unsigned int cpuhp_refcount(void)
{
unsigned int refcount = 0;
int cpu;

for_each_possible_cpu(cpu)
refcount += per_cpu(__cpuhp_refcount, cpu);
}

static void cpuhp_wait_refcount(void)
{
for (;;) {
unsigned int rc1, rc2;

rc1 = cpuhp_refcount();
set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
rc2 = cpuhp_refcount();

if (rc1 == rc2 && !rc1)
break;

schedule();
}
__set_current_state(TASK_RUNNING);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-21 Thread Oleg Nesterov
On 09/21, Oleg Nesterov wrote:
>
> As for the patch itself, I am not sure.

Forgot to mention... and with this patch cpu_hotplug_done() loses the
"release" semantics, not sure this is fine.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-21 Thread Oleg Nesterov
Sorry for delay, I was sick...

On 09/19, Peter Zijlstra wrote:
>
> I used a per-cpu spinlock to keep the state check and refcount inc
> atomic vs the setting of state.

I think this could be simpler, see below.

> So the slow path is still per-cpu and mostly uncontended even in the
> pending writer case.

Is it really important? I mean, per-cpu/uncontended even if the writer
is pending?

Otherwise we could do

static DEFINE_PER_CPU(long, cpuhp_fast_ctr);
static struct task_struct *cpuhp_writer;
static DEFINE_MUTEX(cpuhp_slow_lock)
static long cpuhp_slow_ctr;

static bool update_fast_ctr(int inc)
{
bool success = true;

preempt_disable();
if (likely(!cpuhp_writer))
__get_cpu_var(cpuhp_fast_ctr) += inc;
else if (cpuhp_writer != current)
success = false;
preempt_enable();

return success;
}

void get_online_cpus(void)
{
if (likely(update_fast_ctr(+1));
return;

mutex_lock(&cpuhp_slow_lock);
cpuhp_slow_ctr++;
mutex_unlock(&cpuhp_slow_lock);
}

void put_online_cpus(void)
{
if (likely(update_fast_ctr(-1));
return;

mutex_lock(&cpuhp_slow_lock);
if (!--cpuhp_slow_ctr && cpuhp_writer)
wake_up_process(cpuhp_writer);
mutex_unlock(&cpuhp_slow_lock);
}

static void clear_fast_ctr(void)
{
long total = 0;
int cpu;

for_each_possible_cpu(cpu) {
total += per_cpu(cpuhp_fast_ctr, cpu);
per_cpu(cpuhp_fast_ctr, cpu) = 0;
}

return total;
}

static void cpu_hotplug_begin(void)
{
cpuhp_writer = current;
synchronize_sched();

/* Nobody except us can use can use cpuhp_fast_ctr */

mutex_lock(&cpuhp_slow_lock);
cpuhp_slow_ctr += clear_fast_ctr();

while (cpuhp_slow_ctr) {
__set_current_state(TASK_UNINTERRUPTIBLE);
mutex_unlock(&&cpuhp_slow_lock);
schedule();
mutex_lock(&cpuhp_slow_lock);
}
}

static void cpu_hotplug_done(void)
{
cpuhp_writer = NULL;
mutex_unlock(&cpuhp_slow_lock);
}

I already sent this code in 2010, it needs some trivial updates.

But. We already have percpu_rw_semaphore, can't we reuse it? In fact
I thought about this from the very beginning. Just we need
percpu_down_write_recursive_readers() which does

bool xxx(brw)
{
if (down_trylock(&brw->rw_sem))
return false;
if (!atomic_read(&brw->slow_read_ctr))
return true;
up_write(&brw->rw_sem);
return false;
}

ait_event(brw->write_waitq, xxx(brw));

instead of down_write() + wait_event(!atomic_read(&brw->slow_read_ctr)).

The only problem is the lockdep annotations in percpu_down_read(), but
this looks simple, just we need down_read_no_lockdep() (like __up_read).

Note also that percpu_down_write/percpu_up_write can be improved wrt
synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
1nd one can be avoided if another percpu_down_write() comes "soon after"
percpu_down_up().


As for the patch itself, I am not sure.

> +static void cpuph_wait_refcount(void)
> +{
> + for (;;) {
> + unsigned int refcnt = 0;
> + int cpu;
> +
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + for_each_possible_cpu(cpu)
> + refcnt += per_cpu(__cpuhp_refcount, cpu);
> +
> + if (!refcnt)
> + break;
> +
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> +}

It seems, this can succeed while it should not, see below.

>  void cpu_hotplug_begin(void)
>  {
> - cpu_hotplug.active_writer = current;
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> - }
> + __cpuhp_writer = current;
> +
> + /* After this everybody will observe _writer and take the slow path. */
> + synchronize_sched();

Yes, the reader should see _writer, but:

> + /* Wait for no readers -- reader preference */
> + cpuhp_wait_refcount();

but how we 

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-19 Thread Peter Zijlstra



Meh, I should stop poking at this..

This one lost all the comments again :/

It uses preempt_disable/preempt_enable vs synchronize_sched() to remove
the barriers from the fast path.

After that it waits for !refcount before setting state, which stops new
readers.

I used a per-cpu spinlock to keep the state check and refcount inc
atomic vs the setting of state.

So the slow path is still per-cpu and mostly uncontended even in the
pending writer case.

After setting state it again waits for !refcount -- someone could have
sneaked in between the last !refcount and setting state. But this time
we know refcount will stay 0.

The only thing I don't really like is the unconditional writer wake in
the read-unlock slowpath, but I couldn't come up with anything better.
Here at least we guarantee that there is a wakeup after the last dec --
although there might be far too many wakes.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra 
Date: Tue Sep 17 16:17:11 CEST 2013

Cc: Oleg Nesterov 
Cc: Paul McKenney 
Cc: Thomas Gleixner 
Cc: Steven Rostedt 
Signed-off-by: Peter Zijlstra 
---
 include/linux/cpu.h |   32 ++-
 kernel/cpu.c|  151 +---
 2 files changed, 116 insertions(+), 67 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -175,8 +176,35 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   preempt_disable();
+   if (likely(!__cpuhp_writer || __cpuhp_writer == current))
+   this_cpu_inc(__cpuhp_refcount);
+   else
+   __get_online_cpus();
+   preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   preempt_disable();
+   this_cpu_dec(__cpuhp_refcount);
+   if (unlikely(__cpuhp_writer && __cpuhp_writer != current))
+   __put_online_cpus();
+   preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,109 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-   struct task_struct *active_writer;
-   struct mutex lock; /* Synchronizes accesses to refcount, */
-   /*
-* Also blocks the new readers during
-* an ongoing cpu hotplug operation.
-*/
-   int refcount;
-} cpu_hotplug = {
-   .active_writer = NULL,
-   .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-   .refcount = 0,
-};
-
-void get_online_cpus(void)
-{
-   might_sleep();
-   if (cpu_hotplug.active_writer == current)
-   return;
-   mutex_lock(&cpu_hotplug.lock);
-   cpu_hotplug.refcount++;
-   mutex_unlock(&cpu_hotplug.lock);
-
-}
-EXPORT_SYMBOL_GPL(get_online_cpus);
-
-void put_online_cpus(void)
-{
-   if (cpu_hotplug.active_writer == current)
-   return;
-   mutex_lock(&cpu_hotplug.lock);
-
-   if (WARN_ON(!cpu_hotplug.refcount))
-   cpu_hotplug.refcount++; /* try to fix things up */
-
-   if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-   wake_up_process(cpu_hotplug.active_writer);
-   mutex_unlock(&cpu_hotplug.lock);
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(int, cpuhp_state);
+static DEFINE_PER_CPU(spinlock_t, cpuhp_lock);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
+{
+   spin_lock(__this_cpu_ptr(&cpuhp_lock));
+   for (;;) {
+   if (!__this_cpu_read(cpuhp_state)) {
+   __this_cpu_inc(__cpuhp_refcount);
+   break;
+   }
+
+   spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+   preempt_enable();
+
+   wait_event(cpuhp_wq, !__cpuhp_writer);
+
+   preempt_disable();
+   spin_lock(__this_cpu_ptr(&cpuhp_lock));
+   }
+   spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+   wake_up_process(__cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+static void cpuph_wait_refcount(void)
+{
+   for (;;) {
+   unsigned int refcnt = 0;
+   int cpu;
+
+   set_current_state(TASK_UNINTERRUPTIBLE);
+
+   for_each_possible_cpu(

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-18 Thread Peter Zijlstra
New version, now with excessive comments.

I found a deadlock (where both reader and writer would go to sleep);
identified below as case 1b.

The implementation without patch is reader biased, this implementation,
as Mel pointed out, is writer biased. I should try and fix this but I'm
stepping away from the computer now as I have the feeling I'll only
wreck stuff from now on.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra 
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation uses global state, change it so the reader
side uses per-cpu state in the contended fast path.

Cc: Oleg Nesterov 
Cc: Paul McKenney 
Cc: Thomas Gleixner 
Cc: Steven Rostedt 
Signed-off-by: Peter Zijlstra 
---
 include/linux/cpu.h |   29 -
 kernel/cpu.c|  159 ++--
 2 files changed, 134 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -175,8 +176,32 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   this_cpu_inc(__cpuhp_refcount);
+   smp_mb(); /* see comment near __get_online_cpus() */
+   if (unlikely(__cpuhp_writer))
+   __get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   this_cpu_dec(__cpuhp_refcount);
+   smp_mb(); /* see comment near __get_online_cpus() */
+   if (unlikely(__cpuhp_writer))
+   __put_online_cpus();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,143 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-   struct task_struct *active_writer;
-   struct mutex lock; /* Synchronizes accesses to refcount, */
-   /*
-* Also blocks the new readers during
-* an ongoing cpu hotplug operation.
-*/
-   int refcount;
-} cpu_hotplug = {
-   .active_writer = NULL,
-   .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-   .refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+/*
+ * We must order things like:
+ *
+ *  CPU0 -- read-lock  CPU1 -- write-lock
+ *
+ *  STORE __cpuhp_refcount STORE __cpuhp_writer
+ *  MB MB
+ *  LOAD __cpuhp_writerLOAD __cpuhp_refcount
+ *
+ *
+ * This gives rise to the following permutations:
+ *
+ * a) all of R happend before W
+ * b) R starts but sees the W store -- therefore W must see the R store
+ *W starts but sees the R store -- therefore R must see the W store
+ * c) all of W happens before R
+ *
+ * 1) RL vs WL:
+ *
+ * 1a) RL proceeds; WL observes refcount and goes wait for !refcount.
+ * 1b) RL drops into the slow path; WL waits for !refcount.
+ * 1c) WL proceeds; RL drops into the slow path.
+ *
+ * 2) RL vs WU:
+ *
+ * 2a) RL drops into the slow path; WU clears writer and wakes RL
+ * 2b) RL proceeds; WU continues to wake others
+ * 2d) RL proceeds.
+ *
+ * 3) RU vs WL:
+ *
+ * 3a) RU proceeds; WL proceeds.
+ * 3b) RU drops to slow path; WL proceeds
+ * 3c) WL waits for !refcount; RL drops to slow path
+ *
+ * 4) RU vs WU:
+ *
+ * Impossible since R and W state are mutually exclusive.
+ *
+ * This leaves us to consider the R slow paths:
+ *
+ * RL
+ *
+ * 1b) we must wake W
+ * 2a) nothing of importance
+ *
+ * RU
+ *
+ * 3b) nothing of importance
+ * 3c) we must wake W
+ *
+ */
 
-void get_online_cpus(void)
+void __get_online_cpus(void)
 {
-   might_sleep();
-   if (cpu_hotplug.active_writer == current)
+   if (__cpuhp_writer == current)
return;
-   mutex_lock(&cpu_hotplug.lock);
-   cpu_hotplug.refcount++;
-   mutex_unlock(&cpu_hotplug.lock);
 
+again:
+   /*
+* Case 1b; we must decrement our refcount again otherwise WL will
+* never observe !refcount and stay blocked forever. Not good since
+* we're going to sleep too. Someone must be awake and do something.
+*
+* Skip recomputing the refcount, just wake the pending writer and
+* have him check it -- writers are rare.
+*/
+   this_cpu_dec(__cpuhp_refcount);
+   wake_up_process(__cpuhp_writer); /* implies MB */
+
+   wait_event(cpuhp_wq, !__cpuhp_writer);
+
+   /* Basically re-do the fas

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-17 Thread Peter Zijlstra
On Tue, Sep 17, 2013 at 05:20:50PM +0100, Mel Gorman wrote:
> > +extern struct task_struct *__cpuhp_writer;
> > +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > +   might_sleep();
> > +
> > +   this_cpu_inc(__cpuhp_refcount);
> > +   /*
> > +* Order the refcount inc against the writer read; pairs with the full
> > +* barrier in cpu_hotplug_begin().
> > +*/
> > +   smp_mb();
> > +   if (unlikely(__cpuhp_writer))
> > +   __get_online_cpus();
> > +}
> > +
> 
> If the problem with get_online_cpus() is the shared global state then a
> full barrier in the fast path is still going to hurt. Granted, it will hurt
> a lot less and there should be no lock contention.

I went for a lot less, I wasn't smart enough to get rid of it. Also,
since its a lock op we should at least provide an ACQUIRE barrier.

> However, what barrier in cpu_hotplug_begin is the comment referring to? 

set_current_state() implies a full barrier and nicely separates the
write to __cpuhp_writer and the read of __cpuph_refcount.

> The
> other barrier is in the slowpath __get_online_cpus. Did you mean to do
> a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?

No, since we're ordering LOADs and STORES (see below) we must use full
barriers.

> I'm assuming you are currently using a full barrier to guarantee that an
> update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
> not 100% sure because of the comments.

I'm ordering:

  CPU0 -- get_online_cpus() CPU1 -- cpu_hotplug_begin()

  STORE __cpuhp_refcountSTORE __cpuhp_writer

  MBMB

  LOAD __cpuhp_writer   LOAD __cpuhp_refcount

Such that neither can miss the state of the other and we get proper
mutual exclusion.

> > +extern void __put_online_cpus(void);
> > +
> > +static inline void put_online_cpus(void)
> > +{
> > +   barrier();
> 
> Why is this barrier necessary? 

To ensure the compiler keeps all loads/stores done before the
read-unlock before it.

Arguably it should be a complete RELEASE barrier. I should've put an XXX
comment here but the brain gave out completely for the day.

> I could not find anything that stated if an
> inline function is an implicit compiler barrier but whether it is or not,
> it's not clear why it's necessary at all.

It is not, only actual function calls are an implied sync point for the
compiler.

> > +   this_cpu_dec(__cpuhp_refcount);
> > +   if (unlikely(__cpuhp_writer))
> > +   __put_online_cpus();
> > +}
> > +

> > +struct task_struct *__cpuhp_writer = NULL;
> > +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> > +
> > +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> >  
> > +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> > +
> > +void __get_online_cpus(void)
> >  {
> > +   if (__cpuhp_writer == current)
> > return;
> >  
> > +again:
> > +   /*
> > +* Ensure a pending reading has a 0 refcount.
> > +*
> > +* Without this a new reader that comes in before cpu_hotplug_begin()
> > +* reads the refcount will deadlock.
> > +*/
> > +   this_cpu_dec(__cpuhp_refcount);
> > +   wait_event(cpuhp_wq, !__cpuhp_writer);
> > +
> > +   this_cpu_inc(__cpuhp_refcount);
> > +   /*
> > +* See get_online_cpu().
> > +*/
> > +   smp_mb();
> > +   if (unlikely(__cpuhp_writer))
> > +   goto again;
> >  }
> 
> If CPU hotplug operations are very frequent (or a stupid stress test) then
> it's possible for a new hotplug operation to start (updating __cpuhp_writer)
> before a caller to __get_online_cpus can update the refcount. Potentially
> a caller to __get_online_cpus gets starved although as it only affects a
> CPU hotplug stress test it may not be a serious issue.

Right.. If that ever becomes a problem we should fix it, but aside from
stress tests hotplug should be extremely rare.

Initially I kept the reference over the wait_event() but realized (as
per the comment) that that would deadlock cpu_hotplug_begin() for it
would never observe !refcount.

One solution for this problem is having refcount as an array of 2 and
flipping the index at the appropriate times.

> > +EXPORT_SYMBOL_GPL(__get_online_cpus);
> >  
> > +void __put_online_cpus(void)
> >  {
> > +   unsigned int refcnt = 0;
> > +   int cpu;
> >  
> > +   if (__cpuhp_writer == current)
> > +   return;
> >  
> > +   for_each_possible_cpu(cpu)
> > +   refcnt += per_cpu(__cpuhp_refcount, cpu);
> >  
> 
> This can result in spurious wakeups if CPU N calls get_online_cpus after
> its refcnt has been checked but I could not think of a case where it
> matters.

Right and right.. too many wakeups aren't a correctness issue. One
should try and minimize them for performance reasons though :-)

> > +   if (!refcnt)
> > +   wake_up_process(__cpuhp_writer);
> >  }


> >  /*
> >   * This ensures that the h

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-17 Thread Mel Gorman
On Tue, Sep 17, 2013 at 04:30:03PM +0200, Peter Zijlstra wrote:
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra 
> Date: Tue Sep 17 16:17:11 CEST 2013
> 
> The cpu hotplug lock is a purely reader biased read-write lock.
> 
> The current implementation uses global state, change it so the reader
> side uses per-cpu state in the uncontended fast-path.
> 
> Cc: Oleg Nesterov 
> Cc: Paul McKenney 
> Cc: Thomas Gleixner 
> Cc: Steven Rostedt 
> Signed-off-by: Peter Zijlstra 
> ---
>  include/linux/cpu.h |   33 ++-
>  kernel/cpu.c|  108 
> ++--
>  2 files changed, 87 insertions(+), 54 deletions(-)
> 
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct device;
>  
> @@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
>  
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + this_cpu_inc(__cpuhp_refcount);
> + /*
> +  * Order the refcount inc against the writer read; pairs with the full
> +  * barrier in cpu_hotplug_begin().
> +  */
> + smp_mb();
> + if (unlikely(__cpuhp_writer))
> + __get_online_cpus();
> +}
> +

If the problem with get_online_cpus() is the shared global state then a
full barrier in the fast path is still going to hurt. Granted, it will hurt
a lot less and there should be no lock contention.

However, what barrier in cpu_hotplug_begin is the comment referring to? The
other barrier is in the slowpath __get_online_cpus. Did you mean to do
a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?
I'm assuming you are currently using a full barrier to guarantee that an
update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
not 100% sure because of the comments.

> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();

Why is this barrier necessary? I could not find anything that stated if an
inline function is an implicit compiler barrier but whether it is or not,
it's not clear why it's necessary at all.

> + this_cpu_dec(__cpuhp_refcount);
> + if (unlikely(__cpuhp_writer))
> + __put_online_cpus();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  
> -static struct {
> - struct task_struct *active_writer;
> - struct mutex lock; /* Synchronizes accesses to refcount, */
> - /*
> -  * Also blocks the new readers during
> -  * an ongoing cpu hotplug operation.
> -  */
> - int refcount;
> -} cpu_hotplug = {
> - .active_writer = NULL,
> - .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> - .refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer = NULL;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
>  
> -void get_online_cpus(void)
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void __get_online_cpus(void)
>  {
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> + if (__cpuhp_writer == current)
>   return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
>  
> +again:
> + /*
> +  * Ensure a pending reading has a 0 refcount.
> +  *
> +  * Without this a new reader that comes in before cpu_hotplug_begin()
> +  * reads the refcount will deadlock.
> +  */
> + this_cpu_dec(__cpuhp_refcount);
> + wait_event(cpuhp_wq, !__cpuhp_writer);
> +
> + this_cpu_inc(__cpuhp_refcount);
> + /*
> +  * See get_online_cpu().
> +  */
> + smp_mb();
> + if (unlikely(__cpuhp_writer))
> + goto again;
>  }

If CPU hotplug operations are very frequent (or a stupid stress test) then
it's possible for a new hotplug operation to start (updating __cpuhp_writer)
before a caller to __get_online_cpus can update the refcount. Potentially
a caller to __get_online_cpus gets starved although as it only affects a
CPU hotplug stress test it may not be a serious issue.

> -EXPORT_SYMBOL_GPL(get_online_cpus);
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>  
> -void put_online_cpus(void)
> +void __put_online_cpus(void)
>  {
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> + unsigned int refcnt

[PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-17 Thread Peter Zijlstra
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra 
Date: Tue Sep 17 16:17:11 CEST 2013

The cpu hotplug lock is a purely reader biased read-write lock.

The current implementation uses global state, change it so the reader
side uses per-cpu state in the uncontended fast-path.

Cc: Oleg Nesterov 
Cc: Paul McKenney 
Cc: Thomas Gleixner 
Cc: Steven Rostedt 
Signed-off-by: Peter Zijlstra 
---
 include/linux/cpu.h |   33 ++-
 kernel/cpu.c|  108 ++--
 2 files changed, 87 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 
@@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+   might_sleep();
+
+   this_cpu_inc(__cpuhp_refcount);
+   /*
+* Order the refcount inc against the writer read; pairs with the full
+* barrier in cpu_hotplug_begin().
+*/
+   smp_mb();
+   if (unlikely(__cpuhp_writer))
+   __get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+   barrier();
+   this_cpu_dec(__cpuhp_refcount);
+   if (unlikely(__cpuhp_writer))
+   __put_online_cpus();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)   cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-   struct task_struct *active_writer;
-   struct mutex lock; /* Synchronizes accesses to refcount, */
-   /*
-* Also blocks the new readers during
-* an ongoing cpu hotplug operation.
-*/
-   int refcount;
-} cpu_hotplug = {
-   .active_writer = NULL,
-   .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-   .refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
 
-void get_online_cpus(void)
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
 {
-   might_sleep();
-   if (cpu_hotplug.active_writer == current)
+   if (__cpuhp_writer == current)
return;
-   mutex_lock(&cpu_hotplug.lock);
-   cpu_hotplug.refcount++;
-   mutex_unlock(&cpu_hotplug.lock);
 
+again:
+   /*
+* Ensure a pending reading has a 0 refcount.
+*
+* Without this a new reader that comes in before cpu_hotplug_begin()
+* reads the refcount will deadlock.
+*/
+   this_cpu_dec(__cpuhp_refcount);
+   wait_event(cpuhp_wq, !__cpuhp_writer);
+
+   this_cpu_inc(__cpuhp_refcount);
+   /*
+* See get_online_cpu().
+*/
+   smp_mb();
+   if (unlikely(__cpuhp_writer))
+   goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-   if (cpu_hotplug.active_writer == current)
-   return;
-   mutex_lock(&cpu_hotplug.lock);
+   unsigned int refcnt = 0;
+   int cpu;
 
-   if (WARN_ON(!cpu_hotplug.refcount))
-   cpu_hotplug.refcount++; /* try to fix things up */
+   if (__cpuhp_writer == current)
+   return;
 
-   if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-   wake_up_process(cpu_hotplug.active_writer);
-   mutex_unlock(&cpu_hotplug.lock);
+   for_each_possible_cpu(cpu)
+   refcnt += per_cpu(__cpuhp_refcount, cpu);
 
+   if (!refcnt)
+   wake_up_process(__cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficu