On 03/18/2014 02:49 PM, Igor Mammedov wrote: > On Tue, 18 Mar 2014 08:21:19 -0400 > Prarit Bhargava <pra...@redhat.com> wrote: > >> >> >> On 03/13/2014 10:25 AM, Igor Mammedov wrote: >>> Hang is observed on virtual machines during CPU hotplug, >>> especially in big guests with many CPUs. (It happens more >>> often if host is over-committed). >>> >> >> Hey Igor, I like this better than the previous version. Thanks for taking >> into >> account the possible races in this code. >> >> A quick question on system behaviour. As you know I've been more concerned >> lately with error handling, etc., through the cpu hotplug code as we've seen >> several customer reports of silent failures or cascading failures in the cpu >> hotplug code when users have been attempting to perform physical hotplug. >> >> After your patches have been applied, in theory the following can happen: >> >> The master CPU is completing the AP cpu's bring up. The AP cpu is doing >> (sorry >> for the cut-and-paste), >> >> void cpu_init(void) >> { >> int cpu = smp_processor_id(); >> struct task_struct *curr = current; >> struct tss_struct *t = &per_cpu(init_tss, cpu); >> struct thread_struct *thread = &curr->thread; >> >> /* >> * wait till the master CPU completes it's STARTUP sequence, >> * and decides to wait till this AP boots >> */ >> while (!cpumask_test_cpu(cpu, cpu_callout_mask)) { >> cpu_relax(); >> if (per_cpu(x86_cpu_to_apicid, cpu) == BAD_APICID) >> halt(); >> } >> >> and is spinning on cpu_relax(). Suppose something goes wrong and the >> softlockup >> watchdog fires on the AP cpu: >> >> 1. Can it? :) ie) will the softlockup fire at this point of the AP init? >> Okay, >> I'm being really lazy and not looking at the code ;) > It shouldn't, CPU is in pristine state and just came from boot trampoline at > this point without interrupts configured yet.
Okay, not a big problem. > >> >> 2. Is there anything we can do in this code to notify the user of a problem? >> Even a pr_crit() here I think would help to indicate what went wrong; it >> might >> be useful for future debugging in this area to have some sort of output. I >> think a WARN() or BUG() is necessary here as there are several calls to >> cpu_init(). > Do you mean something like this: > > + if (per_cpu(x86_cpu_to_apicid, cpu) == BAD_APICID) { > + WARN(1); > + halt(); > + } Yeah, maybe WARN_ON(1, "some comment") though. > >> >> 3. Change this comment: >> >> * wait till the master CPU completes it's STARTUP sequence, >> * and decides to wait till this AP boots >> >> to >> >> /* wait for the master CPU to complete this cpu's STARTUP. */ ? > well, that is not quite the same as above, comment should underline that > AP waits for ACK from master CPU before continuing with this AP > initialization. > > How about: > /* wait for ACK from master CPU before continuing with AP initialization */ Awesome :) P. > >> >> Apologies for the late review, >> >> P. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/