On 11/04/26 12:23 am, Tejun Heo wrote:
Hello,
On Thu, Apr 09, 2026 at 11:10:04AM -0700, Boqun Feng wrote:
On Thu, Apr 09, 2026 at 07:47:09AM -1000, Tejun Heo wrote:
On Thu, Apr 09, 2026 at 10:40:05AM -0700, Boqun Feng wrote:
On Thu, Apr 09, 2026 at 10:26:49AM -0700, Boqun Feng wrote:
On Thu, Apr 09, 2026 at 03:08:45PM +0200, Vasily Gorbik wrote:
Commit 61bbcfb50514 ("srcu: Push srcu_node allocation to GP when
non-preemptible") defers srcu_node tree allocation when called under
raw spinlock, putting SRCU through ~6 transitional grace periods
(SRCU_SIZE_ALLOC to SRCU_SIZE_BIG). During this transition srcu_gp_end()
uses mask = ~0, which makes srcu_schedule_cbs_snp() call queue_work_on()
for every possible CPU. Since rcu_gp_wq is WQ_PERCPU, work targets
per-CPU pools directly - pools for not-online CPUs have no workers,
[Cc workqueue]
Hmm.. I thought for offline CPUs the corresponding worker pools become a
unbound one hence there are still workers?
Ah, as Paul replied in another email, the problem was because these CPUs
had never been onlined, so they don't even have unbound workers?
Hahaha, we do initialize worker pool for every possible CPU but the
transition to unbound operation happens in the hot unplug callback. We
;-) ;-) ;-)
probably need to do some of the hot unplug operation during init if the CPU
Seems that we (mostly Paul) have our own trick to track whether a CPU
has ever been onlined in RCU, see rcu_cpu_beenfullyonline(). Paul also
used it in his fix [1]. And I think it won't be that hard to copy it
into workqueue and let queue_work_on() use it so that if the user queues
a work on a never-onlined CPU, it can detect it (with a warning?) and do
something?
The easiest way to do this is just creating the initial workers for all
possible pools. Please see below. However, the downside is that it's going
to create all workers for all possible cpus. This isn't a problem for
anybody else but these IBM mainframes often come up with a lot of possible
but not-yet-or-ever-online CPUs for capacity management, so the cost may not
be negligible on some configurations.
IBM folks, is that okay?
Also, why do you need to queue work items on an offline CPU? Do they
actually have to be per-cpu? Can you get away with using an unbound
workqueue?
Thanks.
Hi Tejun,
Thank you for the patch addressing the workqueue lockup issue.
workqueue lockup issue(PowerPC):
https://lore.kernel.org/lkml/[email protected]/
Regarding the approach of creating workers for all possible CPUs: On IBM
PowerPC, we commonly see configurations with a large number of possible
CPUs for capacity management. For example, systems with 384 possible
CPUs but only 80 online. This is by design - the additional capacity
exists for dynamic activation based on licensing and workload requirements.
Creating workers for all 384 possible CPUs upfront would mean allocating
resources for 304 workers that may never be used. While I understand
this is the simplest solution to the race condition, I'm concerned about
the memory overhead on such configurations.
Two questions:
1. What is the per-worker memory footprint? Can we quantify the overhead
for systems with large possible-but-offline CPU counts?
2. Would an alternative approach be feasible - such as lazy worker
creation during CPU hotplug, or deferring worker creation until a CPU
actually comes online?
I can test this patch on our IBM PowerPC systems to measure the actual
memory impact and verify the POOL_DISASSOCIATED handling works correctly
with large offline CPU counts. Would that be helpful?
Please let me know your thoughts.
Thanks,
Samir
From: Tejun Heo <[email protected]>
Subject: workqueue: Create workers for all possible CPUs on init
Per-CPU worker pools are initialized for every possible CPU during early boot,
but workqueue_init() only creates initial workers for online CPUs. On systems
where possible CPUs outnumber online CPUs (e.g. s390 LPARs with 76 online and
400 possible CPUs), the pools for never-onlined CPUs have POOL_DISASSOCIATED
set but no workers. Any work item queued on such a CPU hangs indefinitely.
This was exposed by 61bbcfb50514 ("srcu: Push srcu_node allocation to GP when
non-preemptible") which made SRCU schedule callbacks on all possible CPUs
during size transitions, triggering workqueue lockup warnings for all
never-onlined CPUs.
Create workers for all possible CPUs during init, not just online ones. For
online CPUs, the behavior is unchanged - POOL_DISASSOCIATED is cleared and the
worker is bound to the CPU. For not-yet-online CPUs, POOL_DISASSOCIATED
remains set, so worker_attach_to_pool() marks the worker UNBOUND and it can
execute on any CPU. When the CPU later comes online, rebind_workers() handles
the transition to associated operation as usual.
Reported-by: Vasily Gorbik <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul E. McKenney <[email protected]>
---
kernel/workqueue.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -8068,9 +8068,10 @@ void __init workqueue_init(void)
for_each_bh_worker_pool(pool, cpu)
BUG_ON(!create_worker(pool));
- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
for_each_cpu_worker_pool(pool, cpu) {
- pool->flags &= ~POOL_DISASSOCIATED;
+ if (cpu_online(cpu))
+ pool->flags &= ~POOL_DISASSOCIATED;
BUG_ON(!create_worker(pool));
}
}