Re: BUG: workqueue lockup - SRCU schedules work on not-online CPUs during size transition

Vasily Gorbik Wed, 29 Apr 2026 10:12:44 -0700

On Wed, Apr 29, 2026 at 08:30:38PM +0530, Srikar Dronamraju wrote:
> * Tejun Heo <[email protected]> [2026-04-10 08:53:30]:
> > Hello,
> > 
> > > Seems that we (mostly Paul) have our own trick to track whether a CPU
> > > has ever been onlined in RCU, see rcu_cpu_beenfullyonline(). Paul also
> > > used it in his fix [1]. And I think it won't be that hard to copy it
> > > into workqueue and let queue_work_on() use it so that if the user queues
> > > a work on a never-onlined CPU, it can detect it (with a warning?) and do
> > > something?
> > 
> > The easiest way to do this is just creating the initial workers for all
> > possible pools. Please see below. However, the downside is that it's going
> > to create all workers for all possible cpus. This isn't a problem for
> > anybody else but these IBM mainframes often come up with a lot of possible
> > but not-yet-or-ever-online CPUs for capacity management, so the cost may not
> > be negligible on some configurations.
> > 
> > IBM folks, is that okay?
> 
> Even on PowerPC LPARS, its not uncommon to have possible cpus != online cpus
> at boot.  However your approach will work.
> 
> And Samir has already tested the same too and reported here
> https://lkml.kernel.org/r/[email protected]
> 
> > From: Tejun Heo <[email protected]>
> > Subject: workqueue: Create workers for all possible CPUs on init
> > 
> > Per-CPU worker pools are initialized for every possible CPU during early 
> > boot,
> > but workqueue_init() only creates initial workers for online CPUs. On 
> > systems
> > where possible CPUs outnumber online CPUs (e.g. s390 LPARs with 76 online 
> > and
> > 400 possible CPUs), the pools for never-onlined CPUs have POOL_DISASSOCIATED
> > set but no workers. Any work item queued on such a CPU hangs indefinitely.
> > 
> > This was exposed by 61bbcfb50514 ("srcu: Push srcu_node allocation to GP 
> > when
> > non-preemptible") which made SRCU schedule callbacks on all possible CPUs
> > during size transitions, triggering workqueue lockup warnings for all
> > never-onlined CPUs.
> > 
> > Create workers for all possible CPUs during init, not just online ones. For
> > online CPUs, the behavior is unchanged - POOL_DISASSOCIATED is cleared and 
> > the
> > worker is bound to the CPU. For not-yet-online CPUs, POOL_DISASSOCIATED
> > remains set, so worker_attach_to_pool() marks the worker UNBOUND and it can
> > execute on any CPU. When the CPU later comes online, rebind_workers() 
> > handles
> > the transition to associated operation as usual.
> > 
> 
> With these patch, if a CPU has been onlined once, it's should be ok to queue
> the work on that CPU even if its offline now.


That already seems to hold without this patch, what this patch newly
covers is queueing on CPUs that have never been online.

Do we actually need to create workers for every possible CPU at boot?
On the s390 LPAR in question (76 online / 400 possible) that's a few
hundred extra kthreads kept around for the life of the system.
That's probably the same on PowerPC.

Wouldn't Paul's SRCU-side fix [1] alone be enough here for PowerPC
as well? I retested it on s390 (76/400) and on x86 KVM with
--smp 16,maxcpus=255 and the lockup didn't reproduce in either case.

[1] 
https://lore.kernel.org/rcu/ed1fa6cd-7343-4ca3-8b9d-d699ca496f83@paulmck-laptop/

Re: BUG: workqueue lockup - SRCU schedules work on not-online CPUs during size transition

Reply via email to