On 10/11/2016 11:41, Yao, Jiewen wrote: > I also did not quite understand below log. > > * CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835 > CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836 > CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837 > CPU #3: pc=0x000000007ffd17ca thread_id=7838 > > As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not enter > SMM mode.
It's not BSP that enters SMM, it's the currently executing processor. So this means that CPU #3 has written to B2. Thanks, Paolo > So why #3 can enter SMM mode? Is that expected behavior? Or unexpected > behavior? > If this is expected, how this happened? Does OS send > SendSmiIpiAllExcludingSelf, after ExitBootServices()? > > I will see if I can finish QEMU/KVM installation tomorrow. > > If you have some idea on why and how #3 enter SMM, please let us know. > > > Thank you > Yao Jiewen > > > From: Laszlo Ersek [mailto:ler...@redhat.com] > Sent: Thursday, November 10, 2016 4:46 AM > To: Yao, Jiewen <jiewen....@intel.com> > Cc: Tian, Feng <feng.t...@intel.com>; edk2-de...@ml01.01.org; Kinney, Michael > D <michael.d.kin...@intel.com>; Paolo Bonzini <pbonz...@redhat.com>; Fan, > Jeff <jeff....@intel.com>; Zeng, Star <star.z...@intel.com> > Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection. > > On 11/09/16 07:25, Yao, Jiewen wrote: >> Hi Laszlo >> I will fix DEBUG message issue in V3 patch. >> >> Below is rest issues: >> >> >> l Case 13: S3 fails randomly. >> A good news: I worked with Jeff Fan to root-cause the S3 resume issue. Here >> is detail. >> >> >> 1) We believe the dead CPU is AP. Not BSP. >> The reason is that: >> >> 1.1) The BSP already transfer control to OS waking vector. The GDT/IDT/CR3 >> should be set by OS. >> >> 1.2) The current dead CPU still has GDT/IDT point to a BIOS reserved >> memory. The CS/DS/SS is typical BIOS X64 mode setting. >> >> 1.3) The current dead CPU still has CR3 in SMM. (Which is obvious wrong) >> >> >> 2) Based upon the 1), we reviewed S3 resume AP flow. >> Current BSP will wake up AP in SMRAM, for security consideration. At that >> time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SMM >> mode yet. Even after SMM rebase, we can still use it because SMRR is not set >> in first SMM rebase. >> Current BSP just uses its own context to initialize AP. So that AP takes BSP >> CR3, which is SMM CR3, unfortunately. >> After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is the >> last straw, because X64 mode halt still need paging. >> >> >> 3) The error happen, once the AP receives an interrupt (for whatever >> reason), AP starts executing code. However, that that time the AP might not >> be in SMM mode. It means SMM CR3 is not available. And then we see this. >> >> >> 4) I guess we did not see the error, or this is RANDOM issue, because >> it depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI. >> >> >> 5) The fix, I think, should be below: >> We should always put AP to protected mode, so that no paging is needed. >> We should put AP in above 1M reserved memory, instead of <1M memory, because >> <1M memory is restored. >> >> >> Would you please file a bugzillar? I think we need assign CPU owner to fix >> that critical issue. >> >> There is no need to do more investigation. Thanks for your great help on >> that. :) > > Thank you for your help! > > I filed <https://bugzilla.tianocore.org/show_bug.cgi?id=216>. The title is > > BSP exits SMM and closes SMRAM on the S3 resume path before > meeting with AP(s) > > I hope the title is mostly right. I didn't add any other details (I > haven't gone through the thread in detail yet, and without that I can't > even write up a semi-reasonable report myself). Instead, I referenced > this message of yours in the report, and I also linked Paolo's analysis > from elsewhere in the thread. I hope this will do for the report. > > (Also, thank you Paolo, from the amazing analysis -- I haven't digested > it yet, but I can already tell it's amazing! :)) > >> l Case 17 - I do not think it is a real issue, because SMM is out of >> resource. >> >> >> l Case 8 - that is a very weird issue. I talk with Jeff again. I do not >> have a clear clue yet. >>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): >>> SpinLock != ((void *) 0) >> Here is code. We do not know why there is some code need InitializeSpinLock >> after ExitBootServices. >> SPIN_LOCK * >> EFIAPI >> InitializeSpinLock ( >> OUT SPIN_LOCK *SpinLock >> ) >> { >> ASSERT (SpinLock != NULL); >> >> _ReadWriteBarrier(); >> *SpinLock = SPIN_LOCK_RELEASED; >> _ReadWriteBarrier(); >> >> return SpinLock; >> } >> >> If you can have a quick check on below, that would be great. >> >> 1) Which processor triggers this ASSERT? BSP or AP. >> >> 2) Which module triggers this ASSERT? Which module contains current RIP >> value? > > First, one additional piece of info I have learned is that the issue > does not always present itself. Sometimes the boot just works fine, > other times the assert fires. > > Using the QEMU monitor, I managed to get the following information with > the "info cpus" command: > > * CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835 > CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836 > CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837 > CPU #3: pc=0x000000007ffd17ca thread_id=7838 > > VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to > point into SMRAM again. > > In the OVMF log, I see > > Loading SMM driver at 0x0007FFC9000 EntryPoint=0x0007FFC9253 > PiSmmCpuDxeSmm.efi > > So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the > entry point, 0x8577, 0x253 bytes less). > > Running > > objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug > > first I see confirmation that > > start address 0x00000253 > > and then > > 000087bd <CpuDeadLoop>: > VOID > EFIAPI > CpuDeadLoop ( > VOID > ) > { > 87bd: 55 push %ebp > 87be: 89 e5 mov %esp,%ebp > 87c0: 83 ec 10 sub $0x10,%esp > volatile UINTN Index; > > for (Index = 0; Index == 0;); > 87c3: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp) > 87ca: 8b 45 fc mov -0x4(%ebp),%eax <-- HERE > 87cd: 85 c0 test %eax,%eax > 87cf: 74 f9 je 87ca <CpuDeadLoop+0xd> > } > 87d1: c9 leave > 87d2: c3 ret > > This seems consistent with an assertion failure. > > I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the > SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks > like a possible caller: > > // > // The BUSY lock is initialized to Released state. This needs to > // be done early enough to be ready for BSP's SmmStartupThisAp() > // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be > // called immediately after AP's present flag is detected. > // > InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy); > > Just a guess, of course. > >> At same time, all my OS test is on real platform. I have not setup OVMF env >> to run an OS yet. >> If you can share a step by step to me, that would be great. > > (1) Grab a host computer with a CPU that supports VMX and EPT. > > (2) Download and install Fedora 24 (for example): > > https://getfedora.org/en/workstation/download/ > http://docs.fedoraproject.org/install-guide > > (3) Install the "qemu-system-x86" package with DNF > > dnf install qemu-system-x86 > > (4) clone edk2 with git > > (5) embed OpenSSL optionally (for secure boot); see > "CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt" > > (6) build OVMF: > > source edksetup.sh > make -C "$EDK_TOOLS_PATH" > > # Ia32 > build \ > -a IA32 \ > -p OvmfPkg/OvmfPkgIa32.dsc \ > -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \ > -t GCC5 -b DEBUG > > # Ia32X64 > build \ > -a IA32 -a X64 \ > -p OvmfPkg/OvmfPkgIa32X64.dsc \ > -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \ > -t GCC5 -b DEBUG > > (7) Create disk images: > > qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \ > -o preallocation=metadata -o lazy_refcounts=on disk-ia32.img 100G > > qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \ > -o preallocation=metadata -o lazy_refcounts=on disk-ia32x64.img 100G > > (8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation > that you downloaded already (the ISO image). > > For 32-bit guest OS, this one used to work: > > https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/ > > minimally the 20141209 release. Hm... actually, I think the maintainer > of that image has discontinued the downloadable files :( > > So, I don't know what 32-bit UEFI OS to recommend for testing. > > 32-bit Windows doesn't boot on OVMF (I looked into that earlier, several > times, with some help from a Microsoft developer, but we couldn't solve > it), so I can't recommend Windows as an alternative. > > Perhaps you can use > > https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trail-and.html > > as a 32-bit guest OS, I never tried. > > (9) Anyway, once you have an installer ISO, set the "ISO" environment > variable to the ISO image's full pathname, and then run QEMU like this: > > # Settings for Ia32 only: > > ISO=... > DISK=.../disk-ia32.img > FW=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd > TEMPLATE=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd > VARS=vars-32.fd > QEMU_COMMAND="qemu-system-i386 -cpu coreduo,-nx" > DEBUG=debug-32.log > > # Settings for Ia32X64 only: > > ISO=... > DISK=.../disk-ia32x64.img > FW=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd > TEMPLATE=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd > VARS=vars-3264.fd > QEMU_COMMAND=qemu-system-x86_64 > DEBUG=debug-3264.log > > # Common commands for both target arches: > > # create variable store from varstore template > # if the former doesn't exist yet > if ! [ -e "$VARS" ]; then > cp -- "$TEMPLATE" "$VARS" > fi > > $QEMU_COMMAND \ > -machine q35,smm=on,accel=kvm \ > -m 4096 \ > -smp sockets=1,cores=2,threads=2 \ > -global driver=cfi.pflash01,property=secure,value=on \ > -drive if=pflash,format=raw,unit=0,file=${FW},readonly=on \ > -drive if=pflash,format=raw,unit=1,file=${VARS} \ > \ > -chardev file,id=debugfile,path=$DEBUG \ > -device isa-debugcon,iobase=0x402,chardev=debugfile \ > \ > -chardev stdio,id=char0,signal=off,mux=on \ > -mon chardev=char0,mode=readline,default \ > -serial chardev:char0 \ > \ > -drive id=iso,if=none,format=raw,readonly,file=$ISO \ > -drive id=disk,if=none,format=qcow2,file=$DISK \ > \ > -device virtio-scsi-pci,id=scsi0 \ > -device scsi-cd,drive=iso,bus=scsi0.0,bootindex=2 \ > -device scsi-hd,drive=disk,bus=scsi0.0,bootindex=1 \ > \ > -device VGA > > This will capture the OVMF debug output in the $DEBUG file. Also, the > terminal where you run the command can be switched between the guest's > serial console and the QEMU monitor with [Ctrl-A C]. > > Thanks > Laszlo > >> >> Thank you >> Yao Jiewen >> >> From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of >> Laszlo Ersek >> Sent: Tuesday, November 8, 2016 9:22 AM >> To: Yao, Jiewen <jiewen....@intel.com<mailto:jiewen....@intel.com>> >> Cc: Tian, Feng <feng.t...@intel.com<mailto:feng.t...@intel.com>>; >> edk2-de...@ml01.01.org<mailto:edk2-de...@ml01.01.org>; Kinney, Michael D >> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com>>; Paolo >> Bonzini <pbonz...@redhat.com<mailto:pbonz...@redhat.com>>; Fan, Jeff >> <jeff....@intel.com<mailto:jeff....@intel.com>>; Zeng, Star >> <star.z...@intel.com<mailto:star.z...@intel.com>> >> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection. >> >> On 11/04/16 10:30, Jiewen Yao wrote: >>> ==== below is V2 description ==== >>> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue. >>> 2) PiSmmCpu: Add debug info on StartupAp() fails. >>> 3) PiSmmCpu: Add ASSERT for AllocatePages(). >>> 4) PiSmmCpu: Add protection detail in commit message. >>> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message. >>> >>> ==== below is V1 description ==== >>> This series patch enables SMM page level protection. >>> Features are: >>> 1) PiSmmCore reports SMM PE image code/data information >>> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned. >>> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable >>> and set XD for data page and RO for code page. >>> 3) PiSmmCpu enables Static Paging for X64 according to >>> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G >>> is used as long as it is supported. >>> 4) PiSmmCpu sets importance data structure to be read only, >>> such as Gdt, Idt, SmmEntrypoint, and PageTable itself. >>> >>> tested platform: >>> 1) Intel internal platform (X64). >>> 2) EDKII Quark IA32 >>> 3) EDKII Vlv2 X64 >>> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8) >>> >>> Cc: Jeff Fan >>> <jeff....@intel.com<mailto:jeff....@intel.com<mailto:jeff....@intel.com%3cmailto:jeff....@intel.com>>> >>> Cc: Feng Tian >>> <feng.t...@intel.com<mailto:feng.t...@intel.com<mailto:feng.t...@intel.com%3cmailto:feng.t...@intel.com>>> >>> Cc: Star Zeng >>> <star.z...@intel.com<mailto:star.z...@intel.com<mailto:star.z...@intel.com%3cmailto:star.z...@intel.com>>> >>> Cc: Michael D Kinney >>> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com%3cmailto:michael.d.kin...@intel.com>>> >>> Cc: Laszlo Ersek >>> <ler...@redhat.com<mailto:ler...@redhat.com<mailto:ler...@redhat.com%3cmailto:ler...@redhat.com>>> >>> Contributed-under: TianoCore Contribution Agreement 1.0 >>> Signed-off-by: Jiewen Yao >>> <jiewen....@intel.com<mailto:jiewen....@intel.com<mailto:jiewen....@intel.com%3cmailto:jiewen....@intel.com>>> >> >> I have new test results. Let's start with the table again: >> >> Legend: >> >> - "untested" means the test was not executed because the same test >> failed or proved unreliable in a less demanding configuration already, >> >> - "n/a" means a setting or test case was impossible, >> >> - "fail" and "unreliable" (lower case) are outside the scope of this >> series; they either capture the pre-series status, or are expected >> even with the series applied due to the pre-series status, >> >> - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the >> series. >> >> In all cases, 36 bits were used as address width in the CPU HOB (--> up >> to 64GB guest-phys address space). >> >> series OVMF >> VCPU boot S3 resume >> # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTable >> topology result result >> -- ------- -------- ------------------------------- ------------------------ >> -------- ------ --------- >> 1 no Ia32 64 n/a >> 1x2x2 pass unreliable >> 2 no Ia32 255 n/a >> 52x2x2 pass untested >> 3 no Ia32 255 n/a >> 53x2x2 unreliable untested >> 4 no Ia32X64 64 n/a >> 1x2x2 pass unreliable >> 5 no Ia32X64 255 n/a >> 52x2x2 pass untested >> 6 no Ia32X64 255 n/a >> 54x2x2 fail n/a >> 7 v2 Ia32 64 FALSE >> 1x2x2 pass untested >> 8 v2 Ia32 64 TRUE >> 1x2x2 FAIL untested >> 9 v2 Ia32 255 FALSE >> 52x2x2 pass untested >> 10 v2 Ia32 255 FALSE >> 53x2x2 untested untested >> 11 v2 Ia32 255 TRUE >> 52x2x2 untested untested >> 12 v2 Ia32 255 TRUE >> 53x2x2 untested untested >> 13 v2 Ia32X64 64 FALSE >> 1x2x2 pass unreliable >> 14 v2 Ia32X64 64 TRUE >> 1x2x2 pass untested >> 15 v2 Ia32X64 255 FALSE >> 52x2x2 pass untested >> 16 v2 Ia32X64 255 FALSE >> 54x2x2 untested untested >> 17 v2 Ia32X64 255 TRUE >> 52x2x2 FAIL untested >> 18 v2 Ia32X64 255 TRUE >> 54x2x2 untested untested >> >> * Case 8: this test case failed with v2 as well, but this time with >> different symptoms: >> >>> FSOpen: Open '\EFI\fedora\grubia32.efi' Success >>> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8 >>> Loading driver at 0x0007DA7F000 EntryPoint=0x0007DA7F400 >>> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90 >>> PixelBlueGreenRedReserved8BitPerColor >>> ConvertPages: Incompatible memory types >>> PixelBlueGreenRedReserved8BitPerColor >>> ConvertPages: Incompatible memory types >>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0 >>> MpInitExitBootServicesCallback() done! >>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): >>> SpinLock != ((void *) 0) >> >> I didn't try to narrow this down. >> >> * Case 13 (the "unreliable S3 resume" case): Here the news are both bad >> and good. The good news is for Jiewen: this patch series does not >> cause the unreliability, it "only" amplifies it severely. The bad news >> is correspondingly for everyone else: S3 resume is actually unreliable >> even in case 4, that is, without this series applied, it's just the >> failure rate is much-much lower. >> >> Namely, in my new testing, in case 13, S3 resume failed 8 times out of >> 21 tries. (I stopped testing at the 8th failure.) >> >> Whereas in case 4, S3 resume failed with *identical symptoms* (e.g., >> exact same RIP=000000000009f0fd), 1 time out of 55 tries. (It was try >> #12 that failed; I continued testing and aborted the test after the >> 55th try.) >> >> So, while the series hugely amplifies the failure rate, the failure >> does exist without the series. Which is why I modified the case 4 >> results in the table, and also lower-cased the word "unreliable" in >> case 13. >> >> Below I will return to this problem separately; let's go over the rest >> of the table first. >> >> * Case 17: I guess this is not a real failure, I'm just including it for >> completeness, as PcdCpuSmmStaticPageTable==TRUE is known to present >> additional SMRAM demand (see the commit message on patch V2 4/6). This >> case fails with >> >>> SmmLockBox Command - 4 >>> SmmLockBoxSmmLib SetLockBoxAttributes - Enter >>> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success) >>> SmmLockBox SmmLockBoxHandler Exit >>> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success) >>> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0 >>> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp != ((void *) 0) >> >> which is an SMRAM allocation failure. If I lower the VCPU count to >> 50x2x2, then the guest boots fine. >> >> ----*---- >> >> Before I get to the S3 resume problem (which, again, reproduces without >> this series, although much less frequently), I'd like to comment on the >> removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll() >> function, on the return value of SmmBlockingStartupThisAp(). This change >> allows v2 to proceed past that point; however, I'm seeing a whole lot of >> >>> !mSmmMpSyncData->CpuData[1].Present >>> !mSmmMpSyncData->CpuData[2].Present >>> !mSmmMpSyncData->CpuData[3].Present >>> ... >> >> messages in the OVMF boot log, interspersed with >> >>> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065 >> >> style messages. (That is, one error message for each AP, per >> ConvertPageEntryAttribute() message.) >> >> Is this okay / intentional? The number of these messages can go up to >> several thousands and that sort of drowns out everything else in the >> log. >> >> It's also not easy to mask the message, because it's logged on the >> DEBUG_ERROR level. >> >> ----*---- >> >> * Okay, so the S3 problem. Last time I suspected that the failure point >> (RIP=9f0fd) was in the Linux guest's S3 wakeup vector (which starts at >> 9A1D0, according to the OVMF log). In order to test this idea, I >> exercised this series with S3 against a Windows 8.1 guest (--> case 13 >> again). The failure reproduced on the second S3 resume, with identical >> RIP, despite the Windows wakeup vector being located elsewhere (at >> 0x1000). >> >> Quoting the OVMF log leading up to the resume: >> >>> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase >>> Install PPI: [PeiPostScriptTablePpi] >>> Install PPI: [EfiEndOfPeiSignalPpi] >>> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895 >>> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked >>> Transfer to 16bit OS waking vector - 1000 >> >> QEMU log (same as before): >> >>> KVM internal error. Suberror: 1 >>> KVM internal error. Suberror: 1 >>> emulation failure >>> emulation failure >>> RAX=0000000000000002 RBX=0000000000000000 RCX=000000007ffdb168 >>> RDX=000000007fedd058 >>> RSI=0000000000000004 RDI=000000007fedd040 RBP=0000000000000000 >>> RSP=000000007e1a7000 >>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 >>> R11=0000000000000000 >>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 >>> R15=0000000000000000 >>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 >>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] >>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT >>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy >>> GDT= 000000007f294000 00000047 >>> IDT= 000000007f294048 00000fff >>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220 >>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >>> DR3=0000000000000000 >>> DR6=00000000ffff0ff0 DR7=0000000000000400 >>> EFER=0000000000000500 >>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? >>> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? >>> ?? ?? >>> RAX=0000000000000001 RBX=0000000000000000 RCX=000000007ffdb168 >>> RDX=000000007fedd070 >>> RSI=0000000000000004 RDI=000000007fedd058 RBP=0000000000000000 >>> RSP=000000007e19f000 >>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 >>> R11=0000000000000000 >>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 >>> R15=0000000000000000 >>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 >>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] >>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT >>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy >>> GDT= 000000007f294000 00000047 >>> IDT= 000000007f294048 00000fff >>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220 >>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >>> DR3=0000000000000000 >>> DR6=00000000ffff0ff0 DR7=0000000000000400 >>> EFER=0000000000000500 >>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? >>> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? >>> ?? ?? >> >> So, we can exclude the suspicion that the problem is guest OS >> dependent. >> >> * Then I looked for the base address of the page containing the >> RIP=9f0fd address, in earlier parts of the OVMF log, on the hunch that >> some firmware component might have allocated that area actually. Here >> we go: >> >>> Loading PEIM at 0x000008552C0 EntryPoint=0x000008554E0 CpuMpPei.efi >>> AP Loop Mode is 1 >>> WakeupBufferStart = 9F000, WakeupBufferSize = 1000 >> >> That is, the failure hits (when it hits -- not always) in the area >> where the CpuMpPei driver *borrows* memory for the startup vector of >> the APs, for the purposes of the MP service PPI. ("Wakeup" is an >> overloaded word here; the "wakeup buffer" has nothing to do with S3 >> resume, it just serves for booting the APs temporarily in PEI, for >> implementing the MP service PPI.) >> >> When exiting the PEI phase (on the S3 resume path), CpuMpPei restores >> the original contents of this area. This occurs just before >> transfering control to the guest OS wakeup vector: see the >> "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above, >> quoted from the OVMF log. >> >> I documented (parts of) this logic in OVMF commit >> >> https://github.com/tianocore/edk2/commit/e3e3090a959a0 >> >> (see the code comments as well). >> >> * At that time, I thought to have identified a memory management bug in >> CpuMpPei; see the following discussion and bug report for details: >> >> https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html >> https://bugzilla.tianocore.org/show_bug.cgi?id=67 >> >> However, with the extraction / introduction of MpInitLib, this issue >> has been fixed: GetWakeupBuffer() now calls >> CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is >> no longer; we shouldn't be looking there for the root cause. >> >> * Either way, I don't understand why anything would want to execute code >> in the one page that happens to host the MP services PPI startup >> buffer for APs during PEI. >> >> Not understanding the "why", I looked at the "what", and resorted to >> tracing KVM. Because the problem readily reproduces with this series >> applied (case 13), it wasn't hard to start the tracing while the guest >> was suspended, and capture just the actions that led from the >> KVM-level wakeup to the failure. >> >> The QEMU state dumps are visible above in the email. I've also >> uploaded the compressed OVMF log and the textual KVM trace here: >> >> >> http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac2a53db/ >> >> I sincerely hope that Paolo will have a field day with the KVM trace >> :) I managed to identify the following curiosities (remember this is >> all on the S3 resume path): >> >> * First, the VCPUs (there are four of them) enter and leave SMM in a >> really funky pattern: >> >> vcpu#0 vcpu#1 vcpu#2 vcpu#3 >> ------ ------ ------ ------ >> enter >> | >> leave >> >> enter >> | >> leave >> >> enter >> | >> leave >> >> enter >> | >> leave >> >> enter enter >> enter | enter | >> | | | | >> leave | | | >> | | | >> enter | | | >> | | | | >> leave leave leave leave >> >> That is, first we have each VCPU enter and leave SMM in complete >> isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon >> followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM >> temporarily (it comes back in later), while the other three remain >> in SMM. Finally all four of them leave SMM together. >> >> After which the problem occurs. >> >> * Second, the instruction that causes things to blow up is <0f aa>, >> i.e., RSM. I have absolutely no clue why RSM is executed: >> >> (a) in the area that used to host the AP startup routine for the MP >> services PPI -- note that we also have "Transfer to 16bit OS waking >> vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed >> area completeley! --, >> >> (b) and why *after* all four VCPUs have just left SMM, together. >> >> * The RSM instruction is handled successfully elsewhere, for example >> when all four VCPUs leave SMM, at the bottom of the diagram above: >> >>> CPU-24447 [002] 39841.982810: kvm_emulate_insn: 0:7ffbf179: 0f aa >>> CPU-24446 [000] 39841.982810: kvm_emulate_insn: 0:7ffbd179: 0f aa >>> CPU-24445 [005] 39841.982810: kvm_emulate_insn: 0:7ffbb179: 0f aa >>> CPU-24444 [006] 39841.982811: kvm_emulate_insn: 0:7ffb9179: 0f aa >> >> * The guest-phys address 7ff7f000 that we see just before the error: >> >>> CPU-24447 [002] 39841.982825: kvm_page_fault: address 7ff7f000 >>> error_code 83 >>> CPU-24446 [000] 39841.982825: kvm_page_fault: address 7ff7f000 >>> error_code 83 >>> CPU-24447 [002] 39841.982826: kvm_emulate_insn: 0:9f0fd: 0f aa >>> CPU-24444 [006] 39841.982827: kvm_exit: reason >>> EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc >>> CPU-24447 [002] 39841.982827: kvm_emulate_insn: 0:9f0fd: 0f aa FAIL >>> CPU-24447 [002] 39841.982827: kvm_userspace_exit: reason >>> KVM_EXIT_INTERNAL_ERROR (17) >> >> can be found higher up in the trace; namely, it is written to CR3 >> several times. It's the root of the page tables. >> >> * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM. >> >> * I also tried the "info tlb" monitor command, via "virsh >> qemu-monitor-command --hmp", while the guest was auto-paused after the >> crash. >> >> I cannot provide results: QEMU appeared to return a message that would >> be longer than 16MB after encoding by libvirt, and libvirt rejected >> that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX). >> >> Anyway, the KVM trace, and the QEMU register dump, look consistent >> with what Paolo said about "Code=?? ?? ??...": >> >> The question marks usually mean that the page tables do not map a >> page at that address. >> >> CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM >> (SMM=0). We can't translate *any* guest-virtual address, as we can't >> even begin walking the page tables. >> >> Thanks >> Laszlo >> _______________________________________________ >> edk2-devel mailing list >> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>> >> https://lists.01.org/mailman/listinfo/edk2-devel >> _______________________________________________ >> edk2-devel mailing list >> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> >> https://lists.01.org/mailman/listinfo/edk2-devel >> _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel