On 11/10/16 12:17, Yao, Jiewen wrote:
> Hi Laszlo
> 
> Thanks to test for us.
> 
>  
> 
> Are you saying Jeff’s patch introduces a new issue?
> 
> Or is this a previous issue but just not fixed by Jeff’s patch?

With your v2 series applied, Jeff's patches replace the crash /
emulation failure symptoms during S3 resume with less intrusive
symptoms, namely that some of the APs cannot be brought up by the OS,
occasionally.

Without your v2 series applied, Jeff's patches seem to present the same
symptoms (OS cannot bring up some APs), although much less frequently.
However, I cannot say definitively whether or not this exact same issues
exists, on Ia32X64, with none of the patch sets applied. I haven't seen
it before (on Ia32X64), but maybe I just haven't tried hard enough.

I guess I should try harder and see if the "lost AP" issue exists
without either patch set applied.

Thanks
Laszlo


> 
>  
> 
>  
> 
> Thank you
> 
> Yao Jiewen
> 
>  
> 
> *From:*Laszlo Ersek [mailto:ler...@redhat.com]
> *Sent:* Thursday, November 10, 2016 6:41 PM
> *To:* Fan, Jeff <jeff....@intel.com>
> *Cc:* edk2-de...@ml01.01.org; Yao, Jiewen <jiewen....@intel.com>; Paolo
> Bonzini <pbonz...@redhat.com>
> *Subject:* Re: [edk2] [PATCH 0/2] Put AP into safe hlt-loop code on S3 path
> 
>  
> 
> On 11/10/16 07:07, Jeff Fan wrote:
>> On S3 path, we will wake up APs to restore CPU context in PiSmmCpuDxeSmm
>> driver. In case, one NMI or SMI happens, APs may exit from hlt state and
>> execute the instruction after HLT instruction.
>> 
>> But APs are not running on safe code, it leads OVMF S3 boot unstable.
>> 
>> https://bugzilla.tianocore.org/show_bug.cgi?id=216
>> 
>> I tested real platform with 64bit DXE.
>> 
>> Jeff Fan (2):
>>   UefiCpuPkg/PiSmmCpuDxeSmm: Put AP into safe hlt-loop code on S3 path
>>   UefiCpuPkg/PiSmmCpuDxeSmm: Place AP to 32bit protected mode on S3 path
>> 
>>  UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c             | 31 ++++++++++++++
>>  UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 25 ++++++++++++
>>  UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h    | 13 ++++++
>>  UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c  | 59 
>> +++++++++++++++++++++++++++
>>  4 files changed, 128 insertions(+)
>> 
> 
> I applied this on top of Jiewen's v2, for testing.
> 
> This series (with my addition for patch #1) doesn't fix the boot failure in 
> case 8. (See "case 8" in 
> <https://lists.01.org/pipermail/edk2-devel/2016-November/004316.html>.) I 
> don't think the series aims to do that at all, but since it modifies the 
> Ia32/SmmFuncsArch.c file, I thought I'd give it a shot.
> 
> The series (with my addition for patch #1) changed the behavior of S3 resume, 
> in case 13. There seem to be no crashes / emulation failures now. However, in 
> some of the tries, the resume seems to include a several second long busy 
> loop, and after that -- although the guest OS does come back up --, I cannot 
> access *some* of the APs from within the OS:
> 
> # this works, quickly
> taskset -c 0 efibootmgr 
> 
> # this fails
> taskset -c 1 efibootmgr
> taskset: failed to set pid 0's affinity: Invalid argument
> 
> # these work again, albeit more slowly (as expected)
> taskset -c 2 efibootmgr
> taskset -c 3 efibootmgr
> 
> I've seen this symptom ("AP goes lost during S3 resume") with the Ia32 SMM 
> build before (without Jiewen's v2 series applied).
> 
> If I run the "info cpus" QEMU command, I get:
> 
> * CPU #0: pc=0xffffffff8105eb26 (halted) thread_id=22745
>   CPU #1: pc=0x00000000fffffff0 thread_id=22746
>   CPU #2: pc=0xffffffff8105eb26 (halted) thread_id=22747
>   CPU #3: pc=0xffffffff8105eb26 (halted) thread_id=22748
> 
> The halted status for #0, #2 and #3 is fine; that's just Linux at work. CPU#1 
> is strange -- not halted, but somehow stuck in the reset vector (0xfffffff0)?
> 
> The gust kernel dmesg contains the following messages:
> 
>> [   55.805153] PM: Restoring platform NVS memory
>> [   55.805153] Enabling non-boot CPUs ...
>> [   55.805153] x86: Booting SMP configuration:
>> [   55.805516] smpboot: Booting Node 0 Processor 1 APIC 0x1
>> [   65.816049] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1 <- HERE
>> [   65.816738] Error taking CPU1 up: -5
>> [   65.817050] smpboot: Booting Node 0 Processor 2 APIC 0x2
>> [   65.817029] kvm-clock: cpu 2, msr 1:7ffd6081, secondary cpu clock
>> [   65.817029] kvm: enabling virtualization on CPU2
>> [   65.832296] KVM setup async PF for cpu 2
>> [   65.832607] kvm-stealtime: cpu 2, msr 17fd0e100
>> [   65.833031] CPU2 is up
>> [   65.833242] smpboot: Booting Node 0 Processor 3 APIC 0x3
>> [   65.833229] kvm-clock: cpu 3, msr 1:7ffd60c1, secondary cpu clock
>> [   65.833229] kvm: enabling virtualization on CPU3
>> [   65.848594] KVM setup async PF for cpu 3
>> [   65.848940] kvm-stealtime: cpu 3, msr 17fd8e100
>> [   65.849393] CPU3 is up
>> [   65.849722] ACPI: Waking up from system sleep state S3
> 
> Note the 10 second gap where I put the marker (and the error message itself, 
> too).
> 
> Here's an excerpt from the KVM trace:
> 
>>  CPU-23509 [002]  8406.908787: kvm_enter_smm:        vcpu 1: entering SMM, 
>> smbase 0x30000
>>  CPU-23509 [002]  8406.908836: kvm_enter_smm:        vcpu 1: leaving SMM, 
>> smbase 0x7ffb3000
>>  CPU-23510 [003]  8406.908850: kvm_enter_smm:        vcpu 2: entering SMM, 
>> smbase 0x30000
>>  CPU-23510 [003]  8406.908881: kvm_enter_smm:        vcpu 2: leaving SMM, 
>> smbase 0x7ffb5000
>>  CPU-23511 [001]  8406.908908: kvm_enter_smm:        vcpu 3: entering SMM, 
>> smbase 0x30000
>>  CPU-23511 [001]  8406.908941: kvm_enter_smm:        vcpu 3: leaving SMM, 
>> smbase 0x7ffb7000
>>  CPU-23508 [005]  8406.908951: kvm_enter_smm:        vcpu 0: entering SMM, 
>> smbase 0x30000
>>  CPU-23508 [005]  8406.908989: kvm_enter_smm:        vcpu 0: leaving SMM, 
>> smbase 0x7ffb1000
>>  CPU-23511 [001]  8406.920215: kvm_enter_smm:        vcpu 3: entering SMM, 
>> smbase 0x7ffb7000
>>  CPU-23509 [002]  8406.920225: kvm_enter_smm:        vcpu 1: entering SMM, 
>> smbase 0x7ffb3000
>>  CPU-23510 [003]  8406.920225: kvm_enter_smm:        vcpu 2: entering SMM, 
>> smbase 0x7ffb5000
>>  CPU-23508 [005]  8406.920227: kvm_enter_smm:        vcpu 0: entering SMM, 
>> smbase 0x7ffb1000
>>  CPU-23508 [005]  8406.920262: kvm_enter_smm:        vcpu 0: leaving SMM, 
>> smbase 0x7ffb1000
>>  CPU-23511 [001]  8406.920263: kvm_enter_smm:        vcpu 3: leaving SMM, 
>> smbase 0x7ffb7000
>>  CPU-23508 [005]  8407.020292: kvm_enter_smm:        vcpu 0: entering SMM, 
>> smbase 0x7ffb1000
>>  CPU-23509 [006]  8407.020338: kvm_enter_smm:        vcpu 1: leaving SMM, 
>> smbase 0x7ffb3000
>>  CPU-23510 [003]  8407.020338: kvm_enter_smm:        vcpu 2: leaving SMM, 
>> smbase 0x7ffb5000
>>  CPU-23508 [005]  8407.020338: kvm_enter_smm:        vcpu 0: leaving SMM, 
>> smbase 0x7ffb1000
> 
> It seems that VCPU#0 still leaves (and then re-enters) SMM while VCPU#1 and 
> VCPU#2 are firmly in SMM.
> 
> So this series is a clear improvement, but something else remains amiss.
> 
> If I remove Jiewen's v2 series, and apply only this one, then the symptom 
> shows up much less frequently, but it does exist:
> - With (Jiewen's v2 + this one), testing case 13, I hit the symptom on the 
> second resume,
> - With just this set applied, I hit the symptom (= one AP disappearing from 
> Linux after resume) only on the 24th resume.
> 
> Thanks
> Laszlo
> 

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to