Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-04-14 Thread Alexey Klimov
On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov  wrote:
>
> On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner  wrote:

[...]

Now, the patch:

>> Subject: cpu/hotplug: Cure the cpusets trainwreck
>> From: Thomas Gleixner 
>> Date: Sat, 27 Mar 2021 15:57:29 +0100
>>
>> Alexey and Joshua tried to solve a cpusets related hotplug problem which is
>> user space visible and results in unexpected behaviour for some time after
>> a CPU has been plugged in and the corresponding uevent was delivered.
>>
>> cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
>> workqueue. This is done because the cpusets code has already a lock
>> nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
>> waiting for the work to finish with cpu_hotplug_lock held can and will
>> deadlock because that results in the reverse lock order.
>>
>> As a consequence the uevent can be delivered before cpusets have consistent
>> state which means that a user space invocation of sched_setaffinity() to
>> move a task to the plugged CPU fails up to the point where the scheduled
>> work has been processed.
>>
>> The same is true for CPU unplug, but that does not create user observable
>> failure (yet).
>>
>> It's still inconsistent to claim that an operation is finished before it
>> actually is and that's the real issue at hand. uevents just make it
>> reliably observable.
>>
>> Obviously the problem should be fixed in cpusets/cgroups, but untangling
>> that is pretty much impossible because according to the changelog of the
>> commit which introduced this 8 years ago:
>>
>>  3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
>>
>> the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
>> the whole code is built around that.
>>
>> So bite the bullet and invoke the relevant cpuset function, which waits for
>> the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
>> only when tasks are not frozen by suspend/hibernate because that would
>> obviously wait forever.
>>
>> Waiting there with cpu_add_remove_lock, which is protecting the present
>> and possible CPU maps, held is not a problem at all because neither work
>> queues nor cpusets/cgroups have any lockchains related to that lock.
>>
>> Waiting in the hotplug machinery is not problematic either because there
>> are already state callbacks which wait for hardware queues to drain. It
>> makes the operations slightly slower, but hotplug is slow anyway.
>>
>> This ensures that state is consistent before returning from a hotplug
>> up/down operation. It's still inconsistent during the operation, but that's
>> a different story.
>>
>> Add a large comment which explains why this is done and why this is not a
>> dump ground for the hack of the day to work around half thought out locking
>> schemes. Document also the implications vs. hotplug operations and
>> serialization or the lack of it.
>>
>> Thanks to Alexy and Joshua for analyzing why this temporary
>> sched_setaffinity() failure happened.
>>
>> Reported-by: Alexey Klimov 
>> Reported-by: Joshua Baker 
>> Signed-off-by: Thomas Gleixner 
>> Cc: Daniel Jordan 
>> Cc: Qais Yousef 

Feel free to use:
Tested-by: Alexey Klimov 

The bug doesn't reproduce with this change, I had the testcase running
for ~25 hrs without failing under different workloads.

Are you going to submit the patch? Or I can do it on your behalf if you like.

[...]

Best regards,
Alexey



Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-04-03 Thread Alexey Klimov
On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner  wrote:

Lovely that you eventually found time to take a look at this since
first RFC patch was sent.

> Alexey,
>
> On Wed, Mar 17 2021 at 00:36, Alexey Klimov wrote:
> > When a CPU offlined and onlined via device_offline() and device_online()
> > the userspace gets uevent notification. If, after receiving "online" uevent,
> > userspace executes sched_setaffinity() on some task trying to move it
> > to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> > needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> > recently onlined CPU after receiving uevent.
> >
> > Cpusets used in guarantee_online_cpus() are updated using workqueue from
> > cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
> > callback
> > sched_cpu_activate() hence it may not be observable by sched_setaffinity() 
> > if
> > it is called immediately after uevent.
>
> And because cpusets are using a workqueue just to deal with their
> backwards lock order we need to cure the symptom in the CPU hotplug
> code, right?

Feel free to suggest a better place.

> > Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> > has run to completion using cpuset_wait_for_hotplug() after onlining the
> > cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> It can also be avoided by fixing the root cause which is _NOT_ in the
> CPU hotplug code at all.
>
> The fundamental assumption of CPU hotplug is that if the state machine
> reaches a given state, which might have user space visible effects or
> even just kernel visible effects, the overall state of the system has to
> be consistent.
>
> cpusets violate this assumption. And they do so since 2013 due to commit
> 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()").
>
> If that cannot be fixed in cgroups/cpusets with out rewriting the whole
> cpusets/cgroups muck, then this want's to be explained and justified in the
> changelog.
>
> Looking at the changelog of 3a5a6d0c2b03 it's entirely clear that this
> is non trivial because that changelog clearly states that the lock order
> is a design decision and that design decision required that workqueue
> workaround
>
> See? Now we suddenly have a proper root cause and not just a description
> of the symptom with some hidden hint about workqueues. And we have an
> argument why fixing the root cause is close to impossible.

Thank you for this educational scolding here and below. I see that
problem here is more fundamental than I thought before and my commit
messages standards are too low for you.
Good to see that bug that may exist since 2013 could be fixed finally.

> >  int cpu_device_up(struct device *dev)
> >  {
> > - return cpu_up(dev->id, CPUHP_ONLINE);
> > + int err;
> > +
> > + err = cpu_up(dev->id, CPUHP_ONLINE);
> > + /*
> > +  * Wait for cpuset updates to cpumasks to finish.  Later on this path
> > +  * may generate uevents whose consumers rely on the updates.
> > +  */
> > + if (!err)
> > + cpuset_wait_for_hotplug();
>
> No. Quite some people wasted^Wspent a considerable amount of time to get
> the hotplug trainwreck cleaned up and we are not sprinkling random
> workarounds all over the place again.
>
> >  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> >  {
> > - int cpu, ret = 0;
> > + cpumask_var_t mask;
> > + int cpu, ret;
> >
> > + if (!zalloc_cpumask_var(, GFP_KERNEL))
> > + return -ENOMEM;
> > +
> > + ret = 0;
> >   cpu_maps_update_begin();
> >   for_each_online_cpu(cpu) {
> >   if (topology_is_primary_thread(cpu))
> > @@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control 
> > ctrlval)
> >   ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
> >   if (ret)
> >   break;
> > - /*
> > -  * As this needs to hold the cpu maps lock it's impossible
> > -  * to call device_offline() because that ends up calling
> > -  * cpu_down() which takes cpu maps lock. cpu maps lock
> > -  * needs to be held as this might race against in kernel
> > -  * abusers of the hotplug machinery (thermal management).
> > -  *
> > -  * So nothing would update device:offline state. That would
> > -  * leave the sysfs entry stale and prevent onlining after
> > -  * smt control has been changed to 'off' again. This is
> > -  * called under the sysfs hotplug lock, so it is properly
> > -  * serialized against the regular offline usage.
> > -  */
> > - cpuhp_offline_cpu_device(cpu);
> > +
> > + cpumask_set_cpu(cpu, mask);
> >   }
> >   if (!ret)
> >   cpu_smt_control = ctrlval;
> >   cpu_maps_update_done();
> > +
> > + /*
> > +  * When the cpu maps lock 

Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-04-01 Thread Qais Yousef
On 03/27/21 22:01, Thomas Gleixner wrote:
> And while you carefully reworded the comment, did you actually read what
> it said and what is says now?
> 
> > -* cpu_down() which takes cpu maps lock. cpu maps lock
> > -* needs to be held as this might race against in kernel
> > -* abusers of the hotplug machinery (thermal management).
> 
> vs.
> 
> > +* cpu_down() which takes cpu maps lock. cpu maps lock
> > +* needed to be held as this might race against in-kernel
> > +* abusers of the hotplug machinery (thermal management).
> 
> The big fat hint is: "cpu maps lock needs to be held as this " and
> it still needs to be held for the above loop to work correctly at
> all. See also below.
> 
> So just moving comments blindly around and making them past tense is not
> cutting it. Quite the contrary the comments make no sense anymore. They
> became uncomprehensible word salad.
> 
> Now for the second part of that comment:
> 
> > +  *    This is
> > +* called under the sysfs hotplug lock, so it is properly
> > +* serialized against the regular offline usage.
> 
> So there are two layers of protection:
> 
>cpu_maps_lock and sysfs hotplug lock
> 
> One protects surprisingly against concurrent sysfs writes and the other
> is required to serialize in kernel usage.
> 
> Now lets look at the original protection:
> 
>lock(sysfs)
>  lock(cpu_maps)
>hotplug
> dev->offline = new_state
> uevent()
>  unlock(cpu_maps)
>unlock(sysfs)
> 
> and the one you created:
> 
>lock(sysfs)
>  lock(cpu_maps)
>hotplug
>  unlock(cpu_maps)
>  dev->offline = new_state
>  uevent()
>unlock(sysfs)
> 
> Where is that protection scope change mentioned in the change log and
> where is the analysis that it is correct?

The comment do still need updating though. Its reference to thermal management
is cryptic. I couldn't see who performs hotplug operations in thermal code.

I also think generally that comment is no longer valid after the refactor I did
to 'prevent' in-kernel users from calling cpu_up/down() directly and force them
all to go via device_offline/online() which is wrapped via the add/remove_cpu()

33c3736ec888 cpu/hotplug: Hide cpu_up/down()

So I think except for suspend/resume/hibernate/kexec, all in-kernel users
should be serialized by lock(hotplug) now. Since uevents() don't matter for
suspend/resume/hiberante/kexec I think moving it outside of lock(cpu_maps) is
fine.

So IIUC in the past we had the race

userspace   in-kernel users

lock(hotplug)
cpuhp_smt_disable()
   lock(cpu_maps)   cpu_down()
  lock(cpu_maps)

So they were serialized by cpu_maps lock. But that should not happen now since
in-kernel (except for the aforementioned) users should use remove_cpu().

userspace   in-kernel users

lock(hotplug)
cpuhp_smt_disable()
   lock(cpu_maps)   remove_cpu()
  lock(hotplug)
  device_offline()
cpu_down()
  lock(cpu_maps)

Which forces the serialization at lock(hotplug), which is what
lock_device_hotplug_sysfs() is actually tries to hold.

So I think that race condition should not happen now. Or did I miss something?

The only loophole is that cpu_device_up() could be abused if not caught by
reviewers. I didn't find a way to warn/error if someone other than
cpu_subsys_online() uses it. We rely on a comment explaining it..

I think cpuhp_smt_disable/enable can safely call device_offline/online now.
Although it might still be more efficient to hold lock(cpu_maps) once than
repeatedly in a loop.

If we do that, then cpu_up_down_serialize_trainwrech() can be called from
cpu_device_up/down() which implies !task_frozen.

Can't remember now if Alexey moved the uevent() handling out of the loop for
efficiency reasons or was seeing something else. I doubt it was the latter.


Thanks

--
Qais Yousef


Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-03-27 Thread Thomas Gleixner
Alexey,

On Wed, Mar 17 2021 at 00:36, Alexey Klimov wrote:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
> callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.

And because cpusets are using a workqueue just to deal with their
backwards lock order we need to cure the symptom in the CPU hotplug
code, right?

> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().

It can also be avoided by fixing the root cause which is _NOT_ in the
CPU hotplug code at all.

The fundamental assumption of CPU hotplug is that if the state machine
reaches a given state, which might have user space visible effects or
even just kernel visible effects, the overall state of the system has to
be consistent.

cpusets violate this assumption. And they do so since 2013 due to commit
3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()").

If that cannot be fixed in cgroups/cpusets with out rewriting the whole
cpusets/cgroups muck, then this want's to be explained and justified in the
changelog.

Looking at the changelog of 3a5a6d0c2b03 it's entirely clear that this
is non trivial because that changelog clearly states that the lock order
is a design decision and that design decision required that workqueue
workaround

See? Now we suddenly have a proper root cause and not just a description
of the symptom with some hidden hint about workqueues. And we have an
argument why fixing the root cause is close to impossible.

>  int cpu_device_up(struct device *dev)
>  {
> - return cpu_up(dev->id, CPUHP_ONLINE);
> + int err;
> +
> + err = cpu_up(dev->id, CPUHP_ONLINE);
> + /*
> +  * Wait for cpuset updates to cpumasks to finish.  Later on this path
> +  * may generate uevents whose consumers rely on the updates.
> +  */
> + if (!err)
> + cpuset_wait_for_hotplug();

No. Quite some people wasted^Wspent a considerable amount of time to get
the hotplug trainwreck cleaned up and we are not sprinkling random
workarounds all over the place again.

>  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  {
> - int cpu, ret = 0;
> + cpumask_var_t mask;
> + int cpu, ret;
>  
> + if (!zalloc_cpumask_var(, GFP_KERNEL))
> + return -ENOMEM;
> +
> + ret = 0;
>   cpu_maps_update_begin();
>   for_each_online_cpu(cpu) {
>   if (topology_is_primary_thread(cpu))
> @@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>   ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
>   if (ret)
>   break;
> - /*
> -  * As this needs to hold the cpu maps lock it's impossible
> -  * to call device_offline() because that ends up calling
> -  * cpu_down() which takes cpu maps lock. cpu maps lock
> -  * needs to be held as this might race against in kernel
> -  * abusers of the hotplug machinery (thermal management).
> -  *
> -  * So nothing would update device:offline state. That would
> -  * leave the sysfs entry stale and prevent onlining after
> -  * smt control has been changed to 'off' again. This is
> -  * called under the sysfs hotplug lock, so it is properly
> -  * serialized against the regular offline usage.
> -  */
> - cpuhp_offline_cpu_device(cpu);
> +
> + cpumask_set_cpu(cpu, mask);
>   }
>   if (!ret)
>   cpu_smt_control = ctrlval;
>   cpu_maps_update_done();
> +
> + /*
> +  * When the cpu maps lock was taken above it was impossible
> +  * to call device_offline() because that ends up calling
> +  * cpu_down() which takes cpu maps lock. cpu maps lock
> +  * needed to be held as this might race against in-kernel
> +  * abusers of the hotplug machinery (thermal management).
> +  *
> +  * So nothing would update device:offline state. That would
> +  * leave the sysfs entry stale and prevent onlining after
> +  * smt control has been changed to 'off' again. This is
> +  * called under the sysfs hotplug lock, so it is properly
> +  * serialized against the regular offline usage.
> +  */
> 

Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-03-18 Thread Daniel Jordan
Alexey Klimov  writes:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it could fail with such flow:
>
>   sched_setaffinity()
> cpuset_cpus_allowed()
>   guarantee_online_cpus()   <-- cs->effective_cpus mask does not
> contain recently onlined cpu
> cpumask_and()   <-- final new_mask is empty
> __set_cpus_allowed_ptr()
>   cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
>   returns -EINVAL
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
> callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.
>
> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> Cc: Daniel Jordan 
> Reviewed-by: Qais Yousef 
> Co-analyzed-by: Joshua Baker 
> Signed-off-by: Alexey Klimov 

Looks good to me.

Reviewed-by: Daniel Jordan 


[PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-03-16 Thread Alexey Klimov
When a CPU offlined and onlined via device_offline() and device_online()
the userspace gets uevent notification. If, after receiving "online" uevent,
userspace executes sched_setaffinity() on some task trying to move it
to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
needs to wait around 5..30 ms before sched_setaffinity() will succeed for
recently onlined CPU after receiving uevent.

If in_mask argument for sched_setaffinity() has only recently onlined CPU,
it could fail with such flow:

  sched_setaffinity()
cpuset_cpus_allowed()
  guarantee_online_cpus()   <-- cs->effective_cpus mask does not
contain recently onlined cpu
cpumask_and()   <-- final new_mask is empty
__set_cpus_allowed_ptr()
  cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
  returns -EINVAL

Cpusets used in guarantee_online_cpus() are updated using workqueue from
cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
callback
sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
it is called immediately after uevent.

Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
has run to completion using cpuset_wait_for_hotplug() after onlining the
cpu in cpu_device_up() and in cpuhp_smt_enable().

Cc: Daniel Jordan 
Reviewed-by: Qais Yousef 
Co-analyzed-by: Joshua Baker 
Signed-off-by: Alexey Klimov 
---

Changes since v2:
- restore cpuhp_{online,offline}_cpu_device back and move it out
of cpu maps lock;
- use Reviewed-by from Qais;
- minor corrections in commit message and in comment in code.

Changes since v1:
- cpuset_wait_for_hotplug() moved to cpu_device_up();
- corrections in comments;
- removed cpuhp_{online,offline}_cpu_device.

Changes since RFC:
- cpuset_wait_for_hotplug() used in cpuhp_smt_enable().

Previous patches and discussion are:
RFC patch: 
https://lore.kernel.org/lkml/20201203171431.256675-1-akli...@redhat.com/
v1 patch:  
https://lore.kernel.org/lkml/20210204010157.1823669-1-akli...@redhat.com/
v2 patch: 
https://lore.kernel.org/lkml/20210212003032.2037750-1-akli...@redhat.com/

The commit a49e4629b5ed "cpuset: Make cpuset hotplug synchronous"
would also get rid of the early uevent but it was reverted (deadlocks).

The nature of this bug is also described here (with different consequences):
https://lore.kernel.org/lkml/20200211141554.24181-1-qais.you...@arm.com/

Reproducer: https://gitlab.com/0xeafe/xlam

Currently with such changes the reproducer code continues to work without 
issues.
The idea is to avoid the situation when userspace receives the event about
onlined CPU which is not ready to take tasks for a while after uevent.

 kernel/cpu.c | 74 +++-
 1 file changed, 56 insertions(+), 18 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1b6302ecbabe..9b091d8a8811 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1301,7 +1302,17 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state 
target)
  */
 int cpu_device_up(struct device *dev)
 {
-   return cpu_up(dev->id, CPUHP_ONLINE);
+   int err;
+
+   err = cpu_up(dev->id, CPUHP_ONLINE);
+   /*
+* Wait for cpuset updates to cpumasks to finish.  Later on this path
+* may generate uevents whose consumers rely on the updates.
+*/
+   if (!err)
+   cpuset_wait_for_hotplug();
+
+   return err;
 }
 
 int add_cpu(unsigned int cpu)
@@ -2084,8 +2095,13 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
 
 int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 {
-   int cpu, ret = 0;
+   cpumask_var_t mask;
+   int cpu, ret;
 
+   if (!zalloc_cpumask_var(, GFP_KERNEL))
+   return -ENOMEM;
+
+   ret = 0;
cpu_maps_update_begin();
for_each_online_cpu(cpu) {
if (topology_is_primary_thread(cpu))
@@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
if (ret)
break;
-   /*
-* As this needs to hold the cpu maps lock it's impossible
-* to call device_offline() because that ends up calling
-* cpu_down() which takes cpu maps lock. cpu maps lock
-* needs to be held as this might race against in kernel
-* abusers of the hotplug machinery (thermal management).
-*
-* So nothing would update device:offline state. That would
-* leave the sysfs entry stale and prevent onlining after
-* smt control has been changed to 'off' again. This is
-* called under the sysfs