On Wed, Apr 29, 2026 at 07:08:23PM +0200, Vasily Gorbik wrote:
> On Wed, Apr 29, 2026 at 08:30:38PM +0530, Srikar Dronamraju wrote:
> > * Tejun Heo <[email protected]> [2026-04-10 08:53:30]:
> > > Hello,
> > > 
> > > > Seems that we (mostly Paul) have our own trick to track whether a CPU
> > > > has ever been onlined in RCU, see rcu_cpu_beenfullyonline(). Paul also
> > > > used it in his fix [1]. And I think it won't be that hard to copy it
> > > > into workqueue and let queue_work_on() use it so that if the user queues
> > > > a work on a never-onlined CPU, it can detect it (with a warning?) and do
> > > > something?
> > > 
> > > The easiest way to do this is just creating the initial workers for all
> > > possible pools. Please see below. However, the downside is that it's going
> > > to create all workers for all possible cpus. This isn't a problem for
> > > anybody else but these IBM mainframes often come up with a lot of possible
> > > but not-yet-or-ever-online CPUs for capacity management, so the cost may 
> > > not
> > > be negligible on some configurations.
> > > 
> > > IBM folks, is that okay?
> > 
> > Even on PowerPC LPARS, its not uncommon to have possible cpus != online cpus
> > at boot.  However your approach will work.
> > 
> > And Samir has already tested the same too and reported here
> > https://lkml.kernel.org/r/[email protected]
> > 
> > > From: Tejun Heo <[email protected]>
> > > Subject: workqueue: Create workers for all possible CPUs on init
> > > 
> > > Per-CPU worker pools are initialized for every possible CPU during early 
> > > boot,
> > > but workqueue_init() only creates initial workers for online CPUs. On 
> > > systems
> > > where possible CPUs outnumber online CPUs (e.g. s390 LPARs with 76 online 
> > > and
> > > 400 possible CPUs), the pools for never-onlined CPUs have 
> > > POOL_DISASSOCIATED
> > > set but no workers. Any work item queued on such a CPU hangs indefinitely.
> > > 
> > > This was exposed by 61bbcfb50514 ("srcu: Push srcu_node allocation to GP 
> > > when
> > > non-preemptible") which made SRCU schedule callbacks on all possible CPUs
> > > during size transitions, triggering workqueue lockup warnings for all
> > > never-onlined CPUs.
> > > 
> > > Create workers for all possible CPUs during init, not just online ones. 
> > > For
> > > online CPUs, the behavior is unchanged - POOL_DISASSOCIATED is cleared 
> > > and the
> > > worker is bound to the CPU. For not-yet-online CPUs, POOL_DISASSOCIATED
> > > remains set, so worker_attach_to_pool() marks the worker UNBOUND and it 
> > > can
> > > execute on any CPU. When the CPU later comes online, rebind_workers() 
> > > handles
> > > the transition to associated operation as usual.
> > > 
> > 
> > With these patch, if a CPU has been onlined once, it's should be ok to queue
> > the work on that CPU even if its offline now.
> 
> That already seems to hold without this patch, what this patch newly
> covers is queueing on CPUs that have never been online.
> 
> Do we actually need to create workers for every possible CPU at boot?
> On the s390 LPAR in question (76 online / 400 possible) that's a few
> hundred extra kthreads kept around for the life of the system.
> That's probably the same on PowerPC.
> 
> Wouldn't Paul's SRCU-side fix [1] alone be enough here for PowerPC
> as well? I retested it on s390 (76/400) and on x86 KVM with
> --smp 16,maxcpus=255 and the lockup didn't reproduce in either case.
> 
> [1] 
> https://lore.kernel.org/rcu/ed1fa6cd-7343-4ca3-8b9d-d699ca496f83@paulmck-laptop/

Just to emphasize that SRCU really was buggy before my fix.  The
queue_work_on() kernel-doc header clearly states the rules.  The bug
is even more embarrassing given just who it was that wrote those two
sentences.  ;-)

                                                        Thanx, Paul

/**
 * queue_work_on - queue work on specific cpu
 * @cpu: CPU number to execute work on
 * @wq: workqueue to use
 * @work: work to queue
 *
 * We queue the work to a specific CPU, the caller must ensure it
 * can't go away.  Callers that fail to ensure that the specified
 * CPU cannot go away will execute on a randomly chosen CPU.
 * But note well that callers specifying a CPU that never has been
 * online will get a splat.
 *
 * Return: %false if @work was already on a queue, %true otherwise.
 */

Reply via email to