To confirm, you want the WARN_ON(cpumask_any(pool->attrs->cpumask) >= NR_CPUS) at the point where I place my current patch?
On 05/23/2017 02:49 PM, Tejun Heo wrote: > Hello, Michael. > > On Tue, May 23, 2017 at 02:44:23PM -0500, Michael Bringmann wrote: >> On 05/16/2017 10:55 AM, Tejun Heo wrote: >>> Hello, Michael. >>> >>> On Mon, May 15, 2017 at 10:48:04AM -0500, Michael Bringmann wrote: >>>>>> --- a/kernel/workqueue.c >>>>>> +++ b/kernel/workqueue.c >>>>>> @@ -3366,6 +3366,8 @@ static struct worker_pool *get_unbound_pool(const >>>>>> struct workqueue_attrs *attrs) >>>>>> copy_workqueue_attrs(pool->attrs, attrs); >>>>>> pool->node = target_node; >>>>>> >>>>>> + cpumask_copy(pool->attrs->cpumask, >>>>>> cpumask_of(smp_processor_id())); >>>>> >>>>> What prevents a cpu getting added right here tho? >>>> >>>> PowerPC has only one control path to add/remove CPUs via DLPAR operations. >>>> Even so, the underlying code is protected through multiple locks. >>> >>> The more I look at the patch, the less sense it seems to make. So, >>> whenever we create a new pool, we ignore the requested cpumask and >>> override it with the cpumask of the current thread? >> >> No. As I mentioned previously, the operation/problem occurs within a DLPAR >> hotplug add/remove operation. This is happening to a node which previously > > But that's what the code is doing. Whenever it creates a new unbound > pool, it ends up ignoring the requested cpumask and overwrites it with > the cpumask containing self. > >> did not have any CPUs associated to it -- we are trying to add more resources >> to an LPAR / partition. At this point, the cpumask for the node is empty / >> zero. >> Sorry for not being more clear on this point earlier. > ... >>> A new unbound workqueue and thus unbound pool can also be created from >>> paths outside cpu hotplug, so get_unbound_pool() can race against >>> hotplug. Can you please explain the failures that you see in more >>> detail? I'm sure your patch works around the issue somehow but it >>> doesn't look like the right fix. >> >> We fill in an empty cpumask field with a guaranteed non-empty value. >> I verified that the incoming cpumask in the attrs was zero at this point >> preceding the failure. If we proceed without putting in a useful value, >> we go to 'wake_up_process()' (kernel/sched/core.c) next to wakeup the new >> worker for the new unbound pool. While there, the code runs through >> 'select_task_rq()' and invokes cpumask_any() on a copy of the cpumask. >> Unfortunately, running that function over an empty/non-initialized cpumask >> returns an index beyond the end of the list, resulting shortly thereafter >> in an instruction/data fetch exception. >> >> If you have a suggestion for an alternate non-empty value to use, I would >> be happy to try it. > > Can you please post the backtrace of the problematic worker pool being > created (WARN_ON empty cpumask while creating a new pool)? > > Thanks. > -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com