Nathan Fontenot <nf...@linux.vnet.ibm.com> writes: > On 09/18/2018 05:32 AM, Gautham R Shenoy wrote: >> Hi Nathan, >> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot >> <nf...@linux.vnet.ibm.com> wrote: >>> >>> When performing partition migrations all present CPUs must be online >>> as all present CPUs must make the H_JOIN call as part of the migration >>> process. Once all present CPUs make the H_JOIN call, one CPU is returned >>> to make the rtas call to perform the migration to the destination system. >>> >>> During testing of migration and changing the SMT state we have found >>> instances where CPUs are offlined, as part of the SMT state change, >>> before they make the H_JOIN call. This results in a hung system where >>> every CPU is either in H_JOIN or offline. >>> >>> To prevent this this patch disables CPU hotplug during the migration >>> process. >>> >>> Signed-off-by: Nathan Fontenot <nf...@linux.vnet.ibm.com> >>> --- >>> arch/powerpc/kernel/rtas.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c >>> index 8afd146bc9c7..2c7ed31c736e 100644 >>> --- a/arch/powerpc/kernel/rtas.c >>> +++ b/arch/powerpc/kernel/rtas.c >>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle) >>> goto out; >>> } >>> >>> + cpu_hotplug_disable(); >> >> So, some of the onlined CPUs ( via >> rtas_online_cpus_mask(offline_mask);) can go still offline, >> if the userspace issues an offline command, just before we execute >> cpu_hotplug_disable(). >> >> So we are narrowing down the race, but it still exists. Am I missing >> something ? > > You're correct, this narrows the window in which a CPU can go offline. > > In testing with this patch we have not been able to re-create the failure but > there is still a small window.
Well let's close it. We just need to check that all present CPUs are online after we've called cpu_hotplug_disable() don't we? cheers