PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

Laszlo Ersek Thu, 29 Oct 2015 14:13:30 -0700

On 10/29/15 19:39, Jordan Justen wrote:
> On 2015-10-29 04:45:37, Laszlo Ersek wrote:
>> On 10/29/15 02:32, Jordan Justen wrote:
>>> +    ASSERT (MaxProcessors > 0);
>>> +    PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
>>
>> I think that when this branch is active, then
>> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
>> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
>> this hint is available from QEMU, then we should practically disable the
>> timeout option in CpuDxe's AP counting.
> 
> I think this is a good idea, but I don't think 71 minutes is useful.
> Perhaps 30 seconds? This seems more than adequate for hundreds of
> processors to startup. Or perhaps some timeout based on the number of
> processors?


No, my suggestion with the 71 minutes didn't aim at a "useful" timeout.
Instead, when QEMU provides the number of VCPUs via fw_cfg, I'd like to
take the timeout *completely* out of the picture. Wait until the
advertised number of VCPUs come up, period. If they don't all appear,
then hang forever. Well, at least for 71 minutes, which is the same for
interactive users.

If 30 seconds elapse and we boot with 1 or 2 VCPUs missing, then things
will break hard. I don't actually *expect* this to occur against a 30
second timeout, but 30 seconds still sends the wrong message to the
programmer and the user. It looks like a real, reasonable timeout. While
in this case, the loop should never exit on a timeout, and 0xFFFFFFFF
communicates that.

> Janusz and I were discussing
> https://github.com/tianocore/edk2/issues/21 on irc. We increased the
> timeout to 10 seconds, and with only 8 processors it was still timing
> out.

Ugh.

> Obviously we are somehow failing to start the processors correctly, or
> QEMU/KVM is doing something wrong.

I think the actual issue we're fighting here is described in
<http://thread.gmane.org/gmane.comp.bios.edk2.devel/3260>. Due to the
kernel commit named there, and due to a physical device being assigned
to the guest, guest memory becomes uncacheable for each AP, until the AP
clears CR0.CD. I guess... And that should slow it down extremely.

> Have you been able to reproduce this issue?

I think I have, although I didn't try. :) My current host kernel is
based on v4.3-rc3 (upon which kvm/master is based, upon which I have a
fix), and the commit in question (b18d5431acc7) is part of v4.2-rc1.

If you have a host kernel at least as fresh as v4.2-rc1 (and I do, see
above), then you run into the issue automatically. For which reason I've
been carrying my patch referenced above in my development branches --
I've been focusing on the SMM issues, and solving (or working around)
the MP startup problem is a prerequisite for that.

So, yes, saw it, worked around it immediately, forgot about it. :)

> It seems like we need to
> set the timeout to 71 minutes, and then debug QEMU/KVM to see what
> state the APs are in...

I'm a bit overloaded to tackle this right now, but...

> Unfortunately I haven't yet been able to reproduce the bug on my
> system. :(

if you install a host kernel at least as recent as v4.2-rc1, then the
bug should pop up at once.

Thanks
Laszlo

> 
> -Jordan
> 

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

Reply via email to