On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote: > Hi Valentin, > > On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote: > > Per-CPU kworkers forcefully migrated away by hotplug via > > workqueue_offline_cpu() can end up spawning more kworkers via > > > > manage_workers() -> maybe_create_worker() > > > > Workers created at this point will be bound using > > > > pool->attrs->cpumask > > > > which in this case is wrong, as the hotplug state machine already migrated > > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON > > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the > > dying rq). > > > > Special-case workers being attached to DISASSOCIATED pools and bind them to > > cpu_active_mask, mimicking them being present when workqueue_offline_cpu() > > was invoked. > > > > Link: > > https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.ca...@redhat.com > > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug") > > Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate > task migration on CPU unplug") ? > > Previously we had: > > AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED > ... > TEARDOWN_CPU -> clear CPU in cpu_online_mask > | > |-AP_SCHED_STARTING -> migrate_tasks() > | > AP_OFFLINE > > worker_attach_to_pool(), is "protected" by the cpu_online_mask in > set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the > cpu_online_mask is actually flipped, there's a window, between > CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up > a new one, for the hotunplugged pool that wouldn't be caught by the > hotunplug migration.
Yes, very much so, however the commit Valentin picked was supposed to preemptively fix this. So we can consider this a fix for the fix. But I don't mind an alternative or perhaps even second Fixes tag on this.