Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.

Paolo Bonzini Thu, 10 Nov 2016 04:29:04 -0800


On 10/11/2016 11:41, Yao, Jiewen wrote:
> I also did not quite understand below log.
> 
> * CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835
>   CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836
>   CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837
>   CPU #3: pc=0x000000007ffd17ca thread_id=7838
> 
> As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not enter 
> SMM mode.


It's not BSP that enters SMM, it's the currently executing processor.

So this means that CPU #3 has written to B2.

Thanks,

Paolo


> So why #3 can enter SMM mode? Is that expected behavior? Or unexpected 
> behavior?
> If this is expected, how this happened? Does OS send 
> SendSmiIpiAllExcludingSelf, after ExitBootServices()?
> 
> I will see if I can finish QEMU/KVM installation tomorrow.
> 
> If you have some idea on why and how #3 enter SMM, please let us know.
>
> 
> Thank you
> Yao Jiewen
> 
> 
> From: Laszlo Ersek [mailto:ler...@redhat.com]
> Sent: Thursday, November 10, 2016 4:46 AM
> To: Yao, Jiewen <jiewen....@intel.com>
> Cc: Tian, Feng <feng.t...@intel.com>; edk2-de...@ml01.01.org; Kinney, Michael 
> D <michael.d.kin...@intel.com>; Paolo Bonzini <pbonz...@redhat.com>; Fan, 
> Jeff <jeff....@intel.com>; Zeng, Star <star.z...@intel.com>
> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
> 
> On 11/09/16 07:25, Yao, Jiewen wrote:
>> Hi Laszlo
>> I will fix DEBUG message issue in V3 patch.
>>
>> Below is rest issues:
>>
>>
>> l  Case 13: S3 fails randomly.
>> A good news: I worked with Jeff Fan to root-cause the S3 resume issue. Here 
>> is detail.
>>
>>
>> 1)      We believe the dead CPU is AP. Not BSP.
>> The reason is that:
>>
>> 1.1)   The BSP already transfer control to OS waking vector. The GDT/IDT/CR3 
>> should be set by OS.
>>
>> 1.2)   The current dead CPU still has GDT/IDT point to a BIOS reserved 
>> memory. The CS/DS/SS is typical BIOS X64 mode setting.
>>
>> 1.3)   The current dead CPU still has CR3 in SMM. (Which is obvious wrong)
>>
>>
>> 2)      Based upon the 1), we reviewed S3 resume AP flow.
>> Current BSP will wake up AP in SMRAM, for security consideration. At that 
>> time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SMM 
>> mode yet. Even after SMM rebase, we can still use it because SMRR is not set 
>> in first SMM rebase.
>> Current BSP just uses its own context to initialize AP. So that AP takes BSP 
>> CR3, which is SMM CR3, unfortunately.
>> After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is the 
>> last straw, because X64 mode halt still need paging.
>>
>>
>> 3)      The error happen, once the AP receives an interrupt (for whatever 
>> reason), AP starts executing code. However, that that time the AP might not 
>> be in SMM mode. It means SMM CR3 is not available. And then we see this.
>>
>>
>> 4)      I guess we did not see the error, or this is RANDOM issue, because 
>> it depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI.
>>
>>
>> 5)      The fix, I think, should be below:
>> We should always put AP to protected mode, so that no paging is needed.
>> We should put AP in above 1M reserved memory, instead of <1M memory, because 
>> <1M memory is restored.
>>
>>
>> Would you please file a bugzillar? I think we need assign CPU owner to fix 
>> that critical issue.
>>
>> There is no need to do more investigation. Thanks for your great help on 
>> that. :)
> 
> Thank you for your help!
> 
> I filed <https://bugzilla.tianocore.org/show_bug.cgi?id=216>. The title is
> 
>     BSP exits SMM and closes SMRAM on the S3 resume path before
>     meeting with AP(s)
> 
> I hope the title is mostly right. I didn't add any other details (I
> haven't gone through the thread in detail yet, and without that I can't
> even write up a semi-reasonable report myself). Instead, I referenced
> this message of yours in the report, and I also linked Paolo's analysis
> from elsewhere in the thread. I hope this will do for the report.
> 
> (Also, thank you Paolo, from the amazing analysis -- I haven't digested
> it yet, but I can already tell it's amazing! :))
> 
>> l  Case 17 - I do not think it is a real issue, because SMM is out of 
>> resource.
>>
>>
>> l  Case 8 - that is a very weird issue. I talk with Jeff again. I do not 
>> have a clear clue yet.
>>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): 
>>> SpinLock != ((void *) 0)
>> Here is code. We do not know why there is some code need InitializeSpinLock 
>> after ExitBootServices.
>> SPIN_LOCK *
>> EFIAPI
>> InitializeSpinLock (
>>   OUT      SPIN_LOCK                 *SpinLock
>>   )
>> {
>>   ASSERT (SpinLock != NULL);
>>
>>   _ReadWriteBarrier();
>>   *SpinLock = SPIN_LOCK_RELEASED;
>>   _ReadWriteBarrier();
>>
>>   return SpinLock;
>> }
>>
>> If you can have a quick check on below, that would be great.
>>
>> 1)      Which processor triggers this ASSERT? BSP or AP.
>>
>> 2)      Which module triggers this ASSERT? Which module contains current RIP 
>> value?
> 
> First, one additional piece of info I have learned is that the issue
> does not always present itself. Sometimes the boot just works fine,
> other times the assert fires.
> 
> Using the QEMU monitor, I managed to get the following information with
> the "info cpus" command:
> 
> * CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835
>   CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836
>   CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837
>   CPU #3: pc=0x000000007ffd17ca thread_id=7838
> 
> VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to
> point into SMRAM again.
> 
> In the OVMF log, I see
> 
> Loading SMM driver at 0x0007FFC9000 EntryPoint=0x0007FFC9253
> PiSmmCpuDxeSmm.efi
> 
> So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the
> entry point, 0x8577, 0x253 bytes less).
> 
> Running
> 
>   objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug
> 
> first I see confirmation that
> 
>   start address 0x00000253
> 
> and then
> 
> 000087bd <CpuDeadLoop>:
> VOID
> EFIAPI
> CpuDeadLoop (
>   VOID
>   )
> {
>     87bd:       55                      push   %ebp
>     87be:       89 e5                   mov    %esp,%ebp
>     87c0:       83 ec 10                sub    $0x10,%esp
>   volatile UINTN  Index;
> 
>   for (Index = 0; Index == 0;);
>     87c3:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%ebp)
>     87ca:       8b 45 fc                mov    -0x4(%ebp),%eax  <-- HERE
>     87cd:       85 c0                   test   %eax,%eax
>     87cf:       74 f9                   je     87ca <CpuDeadLoop+0xd>
> }
>     87d1:       c9                      leave
>     87d2:       c3                      ret
> 
> This seems consistent with an assertion failure.
> 
> I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the
> SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks
> like a possible caller:
> 
>       //
>       // The BUSY lock is initialized to Released state. This needs to
>       // be done early enough to be ready for BSP's SmmStartupThisAp()
>       // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be
>       // called immediately after AP's present flag is detected.
>       //
>       InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy);
> 
> Just a guess, of course.
> 
>> At same time, all my OS test is on real platform. I have not setup OVMF env 
>> to run an OS yet.
>> If you can share a step by step to me, that would be great.
> 
> (1) Grab a host computer with a CPU that supports VMX and EPT.
> 
> (2) Download and install Fedora 24 (for example):
> 
> https://getfedora.org/en/workstation/download/
> http://docs.fedoraproject.org/install-guide
> 
> (3) Install the "qemu-system-x86" package with DNF
> 
> dnf install qemu-system-x86
> 
> (4) clone edk2 with git
> 
> (5) embed OpenSSL optionally (for secure boot); see
> "CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt"
> 
> (6) build OVMF:
> 
> source edksetup.sh
> make -C "$EDK_TOOLS_PATH"
> 
> # Ia32
> build \
>   -a IA32 \
>   -p OvmfPkg/OvmfPkgIa32.dsc \
>   -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
>   -t GCC5 -b DEBUG
> 
> # Ia32X64
> build \
>   -a IA32 -a X64 \
>   -p OvmfPkg/OvmfPkgIa32X64.dsc \
>   -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
>   -t GCC5 -b DEBUG
> 
> (7) Create disk images:
> 
> qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \
>   -o preallocation=metadata -o lazy_refcounts=on disk-ia32.img 100G
> 
> qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \
>   -o preallocation=metadata -o lazy_refcounts=on disk-ia32x64.img 100G
> 
> (8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation
> that you downloaded already (the ISO image).
> 
> For 32-bit guest OS, this one used to work:
> 
> https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/
> 
> minimally the 20141209 release. Hm... actually, I think the maintainer
> of that image has discontinued the downloadable files :(
> 
> So, I don't know what 32-bit UEFI OS to recommend for testing.
> 
> 32-bit Windows doesn't boot on OVMF (I looked into that earlier, several
> times, with some help from a Microsoft developer, but we couldn't solve
> it), so I can't recommend Windows as an alternative.
> 
> Perhaps you can use
> 
> https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trail-and.html
> 
> as a 32-bit guest OS, I never tried.
> 
> (9) Anyway, once you have an installer ISO, set the "ISO" environment
> variable to the ISO image's full pathname, and then run QEMU like this:
> 
> # Settings for Ia32 only:
> 
> ISO=...
> DISK=.../disk-ia32.img
> FW=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd
> TEMPLATE=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd
> VARS=vars-32.fd
> QEMU_COMMAND="qemu-system-i386 -cpu coreduo,-nx"
> DEBUG=debug-32.log
> 
> # Settings for Ia32X64 only:
> 
> ISO=...
> DISK=.../disk-ia32x64.img
> FW=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd
> TEMPLATE=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd
> VARS=vars-3264.fd
> QEMU_COMMAND=qemu-system-x86_64
> DEBUG=debug-3264.log
> 
> # Common commands for both target arches:
> 
> # create variable store from varstore template
> # if the former doesn't exist yet
> if ! [ -e "$VARS" ]; then
>   cp -- "$TEMPLATE" "$VARS"
> fi
> 
> $QEMU_COMMAND \
>   -machine q35,smm=on,accel=kvm \
>   -m 4096 \
>   -smp sockets=1,cores=2,threads=2 \
>   -global driver=cfi.pflash01,property=secure,value=on \
>   -drive if=pflash,format=raw,unit=0,file=${FW},readonly=on \
>   -drive if=pflash,format=raw,unit=1,file=${VARS} \
>   \
>   -chardev file,id=debugfile,path=$DEBUG \
>   -device isa-debugcon,iobase=0x402,chardev=debugfile \
>   \
>   -chardev stdio,id=char0,signal=off,mux=on \
>   -mon chardev=char0,mode=readline,default \
>   -serial chardev:char0 \
>   \
>   -drive id=iso,if=none,format=raw,readonly,file=$ISO \
>   -drive id=disk,if=none,format=qcow2,file=$DISK \
>   \
>   -device virtio-scsi-pci,id=scsi0 \
>   -device scsi-cd,drive=iso,bus=scsi0.0,bootindex=2 \
>   -device scsi-hd,drive=disk,bus=scsi0.0,bootindex=1 \
>   \
>   -device VGA
> 
> This will capture the OVMF debug output in the $DEBUG file. Also, the
> terminal where you run the command can be switched between the guest's
> serial console and the QEMU monitor with [Ctrl-A C].
> 
> Thanks
> Laszlo
> 
>>
>> Thank you
>> Yao Jiewen
>>
>> From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of 
>> Laszlo Ersek
>> Sent: Tuesday, November 8, 2016 9:22 AM
>> To: Yao, Jiewen <jiewen....@intel.com<mailto:jiewen....@intel.com>>
>> Cc: Tian, Feng <feng.t...@intel.com<mailto:feng.t...@intel.com>>; 
>> edk2-de...@ml01.01.org<mailto:edk2-de...@ml01.01.org>; Kinney, Michael D 
>> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com>>; Paolo 
>> Bonzini <pbonz...@redhat.com<mailto:pbonz...@redhat.com>>; Fan, Jeff 
>> <jeff....@intel.com<mailto:jeff....@intel.com>>; Zeng, Star 
>> <star.z...@intel.com<mailto:star.z...@intel.com>>
>> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
>>
>> On 11/04/16 10:30, Jiewen Yao wrote:
>>> ==== below is V2 description ====
>>> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue.
>>> 2) PiSmmCpu: Add debug info on StartupAp() fails.
>>> 3) PiSmmCpu: Add ASSERT for AllocatePages().
>>> 4) PiSmmCpu: Add protection detail in commit message.
>>> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message.
>>>
>>> ==== below is V1 description ====
>>> This series patch enables SMM page level protection.
>>> Features are:
>>> 1) PiSmmCore reports SMM PE image code/data information
>>> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned.
>>> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable
>>> and set XD for data page and RO for code page.
>>> 3) PiSmmCpu enables Static Paging for X64 according to
>>> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G
>>> is used as long as it is supported.
>>> 4) PiSmmCpu sets importance data structure to be read only,
>>> such as Gdt, Idt, SmmEntrypoint, and PageTable itself.
>>>
>>> tested platform:
>>> 1) Intel internal platform (X64).
>>> 2) EDKII Quark IA32
>>> 3) EDKII Vlv2  X64
>>> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8)
>>>
>>> Cc: Jeff Fan 
>>> <jeff....@intel.com<mailto:jeff....@intel.com<mailto:jeff....@intel.com%3cmailto:jeff....@intel.com>>>
>>> Cc: Feng Tian 
>>> <feng.t...@intel.com<mailto:feng.t...@intel.com<mailto:feng.t...@intel.com%3cmailto:feng.t...@intel.com>>>
>>> Cc: Star Zeng 
>>> <star.z...@intel.com<mailto:star.z...@intel.com<mailto:star.z...@intel.com%3cmailto:star.z...@intel.com>>>
>>> Cc: Michael D Kinney 
>>> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com%3cmailto:michael.d.kin...@intel.com>>>
>>> Cc: Laszlo Ersek 
>>> <ler...@redhat.com<mailto:ler...@redhat.com<mailto:ler...@redhat.com%3cmailto:ler...@redhat.com>>>
>>> Contributed-under: TianoCore Contribution Agreement 1.0
>>> Signed-off-by: Jiewen Yao 
>>> <jiewen....@intel.com<mailto:jiewen....@intel.com<mailto:jiewen....@intel.com%3cmailto:jiewen....@intel.com>>>
>>
>> I have new test results. Let's start with the table again:
>>
>> Legend:
>>
>> - "untested" means the test was not executed because the same test
>>   failed or proved unreliable in a less demanding configuration already,
>>
>> - "n/a" means a setting or test case was impossible,
>>
>> - "fail" and "unreliable" (lower case) are outside the scope of this
>>   series; they either capture the pre-series status, or are expected
>>   even with the series applied due to the pre-series status,
>>
>> - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the
>>   series.
>>
>> In all cases, 36 bits were used as address width in the CPU HOB (--> up
>> to 64GB guest-phys address space).
>>
>>    series  OVMF                                                              
>> VCPU     boot       S3 resume
>>  # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTable 
>> topology result     result
>> -- ------- -------- ------------------------------- ------------------------ 
>> -------- ------     ---------
>>  1 no      Ia32      64                             n/a                      
>> 1x2x2    pass       unreliable
>>  2 no      Ia32     255                             n/a                      
>> 52x2x2   pass       untested
>>  3 no      Ia32     255                             n/a                      
>> 53x2x2   unreliable untested
>>  4 no      Ia32X64   64                             n/a                      
>> 1x2x2    pass       unreliable
>>  5 no      Ia32X64  255                             n/a                      
>> 52x2x2   pass       untested
>>  6 no      Ia32X64  255                             n/a                      
>> 54x2x2   fail       n/a
>>  7 v2      Ia32      64                             FALSE                    
>> 1x2x2    pass       untested
>>  8 v2      Ia32      64                             TRUE                     
>> 1x2x2    FAIL       untested
>>  9 v2      Ia32     255                             FALSE                    
>> 52x2x2   pass       untested
>> 10 v2      Ia32     255                             FALSE                    
>> 53x2x2   untested   untested
>> 11 v2      Ia32     255                             TRUE                     
>> 52x2x2   untested   untested
>> 12 v2      Ia32     255                             TRUE                     
>> 53x2x2   untested   untested
>> 13 v2      Ia32X64   64                             FALSE                    
>> 1x2x2    pass       unreliable
>> 14 v2      Ia32X64   64                             TRUE                     
>> 1x2x2    pass       untested
>> 15 v2      Ia32X64  255                             FALSE                    
>> 52x2x2   pass       untested
>> 16 v2      Ia32X64  255                             FALSE                    
>> 54x2x2   untested   untested
>> 17 v2      Ia32X64  255                             TRUE                     
>> 52x2x2   FAIL       untested
>> 18 v2      Ia32X64  255                             TRUE                     
>> 54x2x2   untested   untested
>>
>> * Case 8: this test case failed with v2 as well, but this time with
>>   different symptoms:
>>
>>> FSOpen: Open '\EFI\fedora\grubia32.efi' Success
>>> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8
>>> Loading driver at 0x0007DA7F000 EntryPoint=0x0007DA7F400
>>> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90
>>> PixelBlueGreenRedReserved8BitPerColor
>>> ConvertPages: Incompatible memory types
>>> PixelBlueGreenRedReserved8BitPerColor
>>> ConvertPages: Incompatible memory types
>>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>>> MpInitExitBootServicesCallback() done!
>>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): 
>>> SpinLock != ((void *) 0)
>>
>>   I didn't try to narrow this down.
>>
>> * Case 13 (the "unreliable S3 resume" case): Here the news are both bad
>>   and good. The good news is for Jiewen: this patch series does not
>>   cause the unreliability, it "only" amplifies it severely. The bad news
>>   is correspondingly for everyone else: S3 resume is actually unreliable
>>   even in case 4, that is, without this series applied, it's just the
>>   failure rate is much-much lower.
>>
>>   Namely, in my new testing, in case 13, S3 resume failed 8 times out of
>>   21 tries. (I stopped testing at the 8th failure.)
>>
>>   Whereas in case 4, S3 resume failed with *identical symptoms* (e.g.,
>>   exact same RIP=000000000009f0fd), 1 time out of 55 tries. (It was try
>>   #12 that failed; I continued testing and aborted the test after the
>>   55th try.)
>>
>>   So, while the series hugely amplifies the failure rate, the failure
>>   does exist without the series. Which is why I modified the case 4
>>   results in the table, and also lower-cased the word "unreliable" in
>>   case 13.
>>
>>   Below I will return to this problem separately; let's go over the rest
>>   of the table first.
>>
>> * Case 17: I guess this is not a real failure, I'm just including it for
>>   completeness, as PcdCpuSmmStaticPageTable==TRUE is known to present
>>   additional SMRAM demand (see the commit message on patch V2 4/6). This
>>   case fails with
>>
>>> SmmLockBox Command - 4
>>> SmmLockBoxSmmLib SetLockBoxAttributes - Enter
>>> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success)
>>> SmmLockBox SmmLockBoxHandler Exit
>>> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success)
>>> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0
>>> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp != ((void *) 0)
>>
>>   which is an SMRAM allocation failure. If I lower the VCPU count to
>>   50x2x2, then the guest boots fine.
>>
>> ----*----
>>
>> Before I get to the S3 resume problem (which, again, reproduces without
>> this series, although much less frequently), I'd like to comment on the
>> removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll()
>> function, on the return value of SmmBlockingStartupThisAp(). This change
>> allows v2 to proceed past that point; however, I'm seeing a whole lot of
>>
>>> !mSmmMpSyncData->CpuData[1].Present
>>> !mSmmMpSyncData->CpuData[2].Present
>>> !mSmmMpSyncData->CpuData[3].Present
>>> ...
>>
>> messages in the OVMF boot log, interspersed with
>>
>>> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065
>>
>> style messages. (That is, one error message for each AP, per
>> ConvertPageEntryAttribute() message.)
>>
>> Is this okay / intentional? The number of these messages can go up to
>> several thousands and that sort of drowns out everything else in the
>> log.
>>
>> It's also not easy to mask the message, because it's logged on the
>> DEBUG_ERROR level.
>>
>> ----*----
>>
>> * Okay, so the S3 problem. Last time I suspected that the failure point
>>   (RIP=9f0fd) was in the Linux guest's S3 wakeup vector (which starts at
>>   9A1D0, according to the OVMF log). In order to test this idea, I
>>   exercised this series with S3 against a Windows 8.1 guest (--> case 13
>>   again). The failure reproduced on the second S3 resume, with identical
>>   RIP, despite the Windows wakeup vector being located elsewhere (at
>>   0x1000).
>>
>>   Quoting the OVMF log leading up to the resume:
>>
>>> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase
>>> Install PPI: [PeiPostScriptTablePpi]
>>> Install PPI: [EfiEndOfPeiSignalPpi]
>>> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895
>>> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked
>>> Transfer to 16bit OS waking vector - 1000
>>
>>   QEMU log (same as before):
>>
>>> KVM internal error. Suberror: 1
>>> KVM internal error. Suberror: 1
>>> emulation failure
>>> emulation failure
>>> RAX=0000000000000002 RBX=0000000000000000 RCX=000000007ffdb168 
>>> RDX=000000007fedd058
>>> RSI=0000000000000004 RDI=000000007fedd040 RBP=0000000000000000 
>>> RSP=000000007e1a7000
>>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 
>>> R11=0000000000000000
>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
>>> R15=0000000000000000
>>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
>>> GDT=     000000007f294000 00000047
>>> IDT=     000000007f294048 00000fff
>>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
>>> DR3=0000000000000000
>>> DR6=00000000ffff0ff0 DR7=0000000000000400
>>> EFER=0000000000000500
>>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? 
>>> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 
>>> ?? ??
>>> RAX=0000000000000001 RBX=0000000000000000 RCX=000000007ffdb168 
>>> RDX=000000007fedd070
>>> RSI=0000000000000004 RDI=000000007fedd058 RBP=0000000000000000 
>>> RSP=000000007e19f000
>>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 
>>> R11=0000000000000000
>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
>>> R15=0000000000000000
>>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
>>> GDT=     000000007f294000 00000047
>>> IDT=     000000007f294048 00000fff
>>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
>>> DR3=0000000000000000
>>> DR6=00000000ffff0ff0 DR7=0000000000000400
>>> EFER=0000000000000500
>>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? 
>>> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 
>>> ?? ??
>>
>>   So, we can exclude the suspicion that the problem is guest OS
>>   dependent.
>>
>> * Then I looked for the base address of the page containing the
>>   RIP=9f0fd address, in earlier parts of the OVMF log, on the hunch that
>>   some firmware component might have allocated that area actually. Here
>>   we go:
>>
>>> Loading PEIM at 0x000008552C0 EntryPoint=0x000008554E0 CpuMpPei.efi
>>> AP Loop Mode is 1
>>> WakeupBufferStart = 9F000, WakeupBufferSize = 1000
>>
>>   That is, the failure hits (when it hits -- not always) in the area
>>   where the CpuMpPei driver *borrows* memory for the startup vector of
>>   the APs, for the purposes of the MP service PPI. ("Wakeup" is an
>>   overloaded word here; the "wakeup buffer" has nothing to do with S3
>>   resume, it just serves for booting the APs temporarily in PEI, for
>>   implementing the MP service PPI.)
>>
>>   When exiting the PEI phase (on the S3 resume path), CpuMpPei restores
>>   the original contents of this area. This occurs just before
>>   transfering control to the guest OS wakeup vector: see the
>>   "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above,
>>   quoted from the OVMF log.
>>
>>   I documented (parts of) this logic in OVMF commit
>>
>>     https://github.com/tianocore/edk2/commit/e3e3090a959a0
>>
>>   (see the code comments as well).
>>
>> * At that time, I thought to have identified a memory management bug in
>>   CpuMpPei; see the following discussion and bug report for details:
>>
>>     https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html
>>     https://bugzilla.tianocore.org/show_bug.cgi?id=67
>>
>>   However, with the extraction / introduction of MpInitLib, this issue
>>   has been fixed: GetWakeupBuffer() now calls
>>   CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is
>>   no longer; we shouldn't be looking there for the root cause.
>>
>> * Either way, I don't understand why anything would want to execute code
>>   in the one page that happens to host the MP services PPI startup
>>   buffer for APs during PEI.
>>
>>   Not understanding the "why", I looked at the "what", and resorted to
>>   tracing KVM. Because the problem readily reproduces with this series
>>   applied (case 13), it wasn't hard to start the tracing while the guest
>>   was suspended, and capture just the actions that led from the
>>   KVM-level wakeup to the failure.
>>
>>   The QEMU state dumps are visible above in the email. I've also
>>   uploaded the compressed OVMF log and the textual KVM trace here:
>>
>>     
>> http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac2a53db/
>>
>>   I sincerely hope that Paolo will have a field day with the KVM trace
>>   :) I managed to identify the following curiosities (remember this is
>>   all on the S3 resume path):
>>
>>   * First, the VCPUs (there are four of them) enter and leave SMM in a
>>     really funky pattern:
>>
>>       vcpu#0  vcpu#1  vcpu#2  vcpu#3
>>       ------  ------  ------  ------
>>               enter
>>                |
>>               leave
>>
>>                       enter
>>                         |
>>                       leave
>>
>>                               enter
>>                                 |
>>                               leave
>>
>>       enter
>>         |
>>       leave
>>
>>               enter           enter
>>        enter    |     enter     |
>>          |      |       |       |
>>        leave    |       |       |
>>                 |       |       |
>>        enter    |       |       |
>>          |      |       |       |
>>        leave  leave   leave   leave
>>
>>     That is, first we have each VCPU enter and leave SMM in complete
>>     isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon
>>     followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM
>>     temporarily (it comes back in later), while the other three remain
>>     in SMM. Finally all four of them leave SMM together.
>>
>>     After which the problem occurs.
>>
>>   * Second, the instruction that causes things to blow up is <0f aa>,
>>     i.e., RSM. I have absolutely no clue why RSM is executed:
>>
>>     (a) in the area that used to host the AP startup routine for the MP
>>     services PPI -- note that we also have "Transfer to 16bit OS waking
>>     vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
>>     area completeley! --,
>>
>>     (b) and why *after* all four VCPUs have just left SMM, together.
>>
>>   * The RSM instruction is handled successfully elsewhere, for example
>>     when all four VCPUs leave SMM, at the bottom of the diagram above:
>>
>>> CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
>>> CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
>>> CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
>>> CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa
>>
>>   * The guest-phys address 7ff7f000 that we see just before the error:
>>
>>> CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000 
>>> error_code 83
>>> CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000 
>>> error_code 83
>>> CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
>>> CPU-24444 [006] 39841.982827: kvm_exit:             reason 
>>> EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc
>>> CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
>>> CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason 
>>> KVM_EXIT_INTERNAL_ERROR (17)
>>
>>     can be found higher up in the trace; namely, it is written to CR3
>>     several times. It's the root of the page tables.
>>
>>   * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.
>>
>> * I also tried the "info tlb" monitor command, via "virsh
>>   qemu-monitor-command --hmp", while the guest was auto-paused after the
>>   crash.
>>
>>   I cannot provide results: QEMU appeared to return a message that would
>>   be longer than 16MB after encoding by libvirt, and libvirt rejected
>>   that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).
>>
>>   Anyway, the KVM trace, and the QEMU register dump, look consistent
>>   with what Paolo said about "Code=?? ?? ??...":
>>
>>     The question marks usually mean that the page tables do not map a
>>     page at that address.
>>
>>   CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM
>>   (SMM=0). We can't translate *any* guest-virtual address, as we can't
>>   even begin walking the page tables.
>>
>> Thanks
>> Laszlo
>> _______________________________________________
>> edk2-devel mailing list
>> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>
>> https://lists.01.org/mailman/listinfo/edk2-devel
>> _______________________________________________
>> edk2-devel mailing list
>> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
>> https://lists.01.org/mailman/listinfo/edk2-devel
>>
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.

Reply via email to