On 10/29/15 19:39, Jordan Justen wrote: > On 2015-10-29 04:45:37, Laszlo Ersek wrote: >> On 10/29/15 02:32, Jordan Justen wrote: >>> + ASSERT (MaxProcessors > 0); >>> + PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors); >> >> I think that when this branch is active, then >> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to >> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When >> this hint is available from QEMU, then we should practically disable the >> timeout option in CpuDxe's AP counting. > > I think this is a good idea, but I don't think 71 minutes is useful. > Perhaps 30 seconds? This seems more than adequate for hundreds of > processors to startup. Or perhaps some timeout based on the number of > processors?
No, my suggestion with the 71 minutes didn't aim at a "useful" timeout. Instead, when QEMU provides the number of VCPUs via fw_cfg, I'd like to take the timeout *completely* out of the picture. Wait until the advertised number of VCPUs come up, period. If they don't all appear, then hang forever. Well, at least for 71 minutes, which is the same for interactive users. If 30 seconds elapse and we boot with 1 or 2 VCPUs missing, then things will break hard. I don't actually *expect* this to occur against a 30 second timeout, but 30 seconds still sends the wrong message to the programmer and the user. It looks like a real, reasonable timeout. While in this case, the loop should never exit on a timeout, and 0xFFFFFFFF communicates that. > Janusz and I were discussing > https://github.com/tianocore/edk2/issues/21 on irc. We increased the > timeout to 10 seconds, and with only 8 processors it was still timing > out. Ugh. > Obviously we are somehow failing to start the processors correctly, or > QEMU/KVM is doing something wrong. I think the actual issue we're fighting here is described in <http://thread.gmane.org/gmane.comp.bios.edk2.devel/3260>. Due to the kernel commit named there, and due to a physical device being assigned to the guest, guest memory becomes uncacheable for each AP, until the AP clears CR0.CD. I guess... And that should slow it down extremely. > Have you been able to reproduce this issue? I think I have, although I didn't try. :) My current host kernel is based on v4.3-rc3 (upon which kvm/master is based, upon which I have a fix), and the commit in question (b18d5431acc7) is part of v4.2-rc1. If you have a host kernel at least as fresh as v4.2-rc1 (and I do, see above), then you run into the issue automatically. For which reason I've been carrying my patch referenced above in my development branches -- I've been focusing on the SMM issues, and solving (or working around) the MP startup problem is a prerequisite for that. So, yes, saw it, worked around it immediately, forgot about it. :) > It seems like we need to > set the timeout to 71 minutes, and then debug QEMU/KVM to see what > state the APs are in... I'm a bit overloaded to tackle this right now, but... > Unfortunately I haven't yet been able to reproduce the bug on my > system. :( if you install a host kernel at least as recent as v4.2-rc1, then the bug should pop up at once. Thanks Laszlo > > -Jordan > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel