Hi Nathan, On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot <nf...@linux.vnet.ibm.com> wrote: > > When performing partition migrations all present CPUs must be online > as all present CPUs must make the H_JOIN call as part of the migration > process. Once all present CPUs make the H_JOIN call, one CPU is returned > to make the rtas call to perform the migration to the destination system. > > During testing of migration and changing the SMT state we have found > instances where CPUs are offlined, as part of the SMT state change, > before they make the H_JOIN call. This results in a hung system where > every CPU is either in H_JOIN or offline. > > To prevent this this patch disables CPU hotplug during the migration > process. > > Signed-off-by: Nathan Fontenot <nf...@linux.vnet.ibm.com> > --- > arch/powerpc/kernel/rtas.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index 8afd146bc9c7..2c7ed31c736e 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle) > goto out; > } > > + cpu_hotplug_disable();
So, some of the onlined CPUs ( via rtas_online_cpus_mask(offline_mask);) can go still offline, if the userspace issues an offline command, just before we execute cpu_hotplug_disable(). So we are narrowing down the race, but it still exists. Am I missing something ? > stop_topology_update(); > > /* Call function on all CPUs. One of us will make the > @@ -995,6 +996,7 @@ int rtas_ibm_suspend_me(u64 handle) > printk(KERN_ERR "Error doing global join\n"); > > start_topology_update(); > + cpu_hotplug_enable(); > > /* Take down CPUs not online prior to suspend */ > cpuret = rtas_offline_cpus_mask(offline_mask); > -- Thanks and Regards gautham.