Thanks to report case 3 issue on bugzillar. Let's focus on Case 8. It seems another random failure issue.
I did more test. 1) I tested some other our internal real platform for UEFI32 OS boot. I cannot reproduce the ASSERT. 2) I wrote a small test app to call ExitBootServices and send SMI. I run it on current my windows QEMU but I still cannot reproduce the ASSERT. It seem your env is the only way to repo the issue. I am trying to follow your step by step to install OS on QEMU/KVM. I haven't finish all thing yet, because of some network proxy issue. :( Your information and analysis is great. It gives us some clue. I guess the same thing as you and checked: InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy); This address is initialized in InitializeMpSyncData(), with gSmmCpuPrivate->SmmCoreEntryContext.NumberOfCpus which is got from MpServices->GetNumberOfProcessors(). I do not know why this address is zero. I also did not quite understand below log. * CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835 CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836 CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837 CPU #3: pc=0x000000007ffd17ca thread_id=7838 As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not enter SMM mode. So why #3 can enter SMM mode? Is that expected behavior? Or unexpected behavior? If this is expected, how this happened? Does OS send SendSmiIpiAllExcludingSelf, after ExitBootServices()? I will see if I can finish QEMU/KVM installation tomorrow. If you have some idea on why and how #3 enter SMM, please let us know. Thank you Yao Jiewen From: Laszlo Ersek [mailto:ler...@redhat.com] Sent: Thursday, November 10, 2016 4:46 AM To: Yao, Jiewen <jiewen....@intel.com> Cc: Tian, Feng <feng.t...@intel.com>; edk2-de...@ml01.01.org; Kinney, Michael D <michael.d.kin...@intel.com>; Paolo Bonzini <pbonz...@redhat.com>; Fan, Jeff <jeff....@intel.com>; Zeng, Star <star.z...@intel.com> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection. On 11/09/16 07:25, Yao, Jiewen wrote: > Hi Laszlo > I will fix DEBUG message issue in V3 patch. > > Below is rest issues: > > > l Case 13: S3 fails randomly. > A good news: I worked with Jeff Fan to root-cause the S3 resume issue. Here > is detail. > > > 1) We believe the dead CPU is AP. Not BSP. > The reason is that: > > 1.1) The BSP already transfer control to OS waking vector. The GDT/IDT/CR3 > should be set by OS. > > 1.2) The current dead CPU still has GDT/IDT point to a BIOS reserved > memory. The CS/DS/SS is typical BIOS X64 mode setting. > > 1.3) The current dead CPU still has CR3 in SMM. (Which is obvious wrong) > > > 2) Based upon the 1), we reviewed S3 resume AP flow. > Current BSP will wake up AP in SMRAM, for security consideration. At that > time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SMM > mode yet. Even after SMM rebase, we can still use it because SMRR is not set > in first SMM rebase. > Current BSP just uses its own context to initialize AP. So that AP takes BSP > CR3, which is SMM CR3, unfortunately. > After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is the > last straw, because X64 mode halt still need paging. > > > 3) The error happen, once the AP receives an interrupt (for whatever > reason), AP starts executing code. However, that that time the AP might not > be in SMM mode. It means SMM CR3 is not available. And then we see this. > > > 4) I guess we did not see the error, or this is RANDOM issue, because it > depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI. > > > 5) The fix, I think, should be below: > We should always put AP to protected mode, so that no paging is needed. > We should put AP in above 1M reserved memory, instead of <1M memory, because > <1M memory is restored. > > > Would you please file a bugzillar? I think we need assign CPU owner to fix > that critical issue. > > There is no need to do more investigation. Thanks for your great help on > that. :) Thank you for your help! I filed <https://bugzilla.tianocore.org/show_bug.cgi?id=216>. The title is BSP exits SMM and closes SMRAM on the S3 resume path before meeting with AP(s) I hope the title is mostly right. I didn't add any other details (I haven't gone through the thread in detail yet, and without that I can't even write up a semi-reasonable report myself). Instead, I referenced this message of yours in the report, and I also linked Paolo's analysis from elsewhere in the thread. I hope this will do for the report. (Also, thank you Paolo, from the amazing analysis -- I haven't digested it yet, but I can already tell it's amazing! :)) > l Case 17 - I do not think it is a real issue, because SMM is out of > resource. > > > l Case 8 - that is a very weird issue. I talk with Jeff again. I do not have > a clear clue yet. >> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): >> SpinLock != ((void *) 0) > Here is code. We do not know why there is some code need InitializeSpinLock > after ExitBootServices. > SPIN_LOCK * > EFIAPI > InitializeSpinLock ( > OUT SPIN_LOCK *SpinLock > ) > { > ASSERT (SpinLock != NULL); > > _ReadWriteBarrier(); > *SpinLock = SPIN_LOCK_RELEASED; > _ReadWriteBarrier(); > > return SpinLock; > } > > If you can have a quick check on below, that would be great. > > 1) Which processor triggers this ASSERT? BSP or AP. > > 2) Which module triggers this ASSERT? Which module contains current RIP > value? First, one additional piece of info I have learned is that the issue does not always present itself. Sometimes the boot just works fine, other times the assert fires. Using the QEMU monitor, I managed to get the following information with the "info cpus" command: * CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835 CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836 CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837 CPU #3: pc=0x000000007ffd17ca thread_id=7838 VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to point into SMRAM again. In the OVMF log, I see Loading SMM driver at 0x0007FFC9000 EntryPoint=0x0007FFC9253 PiSmmCpuDxeSmm.efi So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the entry point, 0x8577, 0x253 bytes less). Running objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug first I see confirmation that start address 0x00000253 and then 000087bd <CpuDeadLoop>: VOID EFIAPI CpuDeadLoop ( VOID ) { 87bd: 55 push %ebp 87be: 89 e5 mov %esp,%ebp 87c0: 83 ec 10 sub $0x10,%esp volatile UINTN Index; for (Index = 0; Index == 0;); 87c3: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp) 87ca: 8b 45 fc mov -0x4(%ebp),%eax <-- HERE 87cd: 85 c0 test %eax,%eax 87cf: 74 f9 je 87ca <CpuDeadLoop+0xd> } 87d1: c9 leave 87d2: c3 ret This seems consistent with an assertion failure. I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks like a possible caller: // // The BUSY lock is initialized to Released state. This needs to // be done early enough to be ready for BSP's SmmStartupThisAp() // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be // called immediately after AP's present flag is detected. // InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy); Just a guess, of course. > At same time, all my OS test is on real platform. I have not setup OVMF env > to run an OS yet. > If you can share a step by step to me, that would be great. (1) Grab a host computer with a CPU that supports VMX and EPT. (2) Download and install Fedora 24 (for example): https://getfedora.org/en/workstation/download/ http://docs.fedoraproject.org/install-guide (3) Install the "qemu-system-x86" package with DNF dnf install qemu-system-x86 (4) clone edk2 with git (5) embed OpenSSL optionally (for secure boot); see "CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt" (6) build OVMF: source edksetup.sh make -C "$EDK_TOOLS_PATH" # Ia32 build \ -a IA32 \ -p OvmfPkg/OvmfPkgIa32.dsc \ -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \ -t GCC5 -b DEBUG # Ia32X64 build \ -a IA32 -a X64 \ -p OvmfPkg/OvmfPkgIa32X64.dsc \ -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \ -t GCC5 -b DEBUG (7) Create disk images: qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \ -o preallocation=metadata -o lazy_refcounts=on disk-ia32.img 100G qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \ -o preallocation=metadata -o lazy_refcounts=on disk-ia32x64.img 100G (8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation that you downloaded already (the ISO image). For 32-bit guest OS, this one used to work: https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/ minimally the 20141209 release. Hm... actually, I think the maintainer of that image has discontinued the downloadable files :( So, I don't know what 32-bit UEFI OS to recommend for testing. 32-bit Windows doesn't boot on OVMF (I looked into that earlier, several times, with some help from a Microsoft developer, but we couldn't solve it), so I can't recommend Windows as an alternative. Perhaps you can use https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trail-and.html as a 32-bit guest OS, I never tried. (9) Anyway, once you have an installer ISO, set the "ISO" environment variable to the ISO image's full pathname, and then run QEMU like this: # Settings for Ia32 only: ISO=... DISK=.../disk-ia32.img FW=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd TEMPLATE=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd VARS=vars-32.fd QEMU_COMMAND="qemu-system-i386 -cpu coreduo,-nx" DEBUG=debug-32.log # Settings for Ia32X64 only: ISO=... DISK=.../disk-ia32x64.img FW=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd TEMPLATE=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd VARS=vars-3264.fd QEMU_COMMAND=qemu-system-x86_64 DEBUG=debug-3264.log # Common commands for both target arches: # create variable store from varstore template # if the former doesn't exist yet if ! [ -e "$VARS" ]; then cp -- "$TEMPLATE" "$VARS" fi $QEMU_COMMAND \ -machine q35,smm=on,accel=kvm \ -m 4096 \ -smp sockets=1,cores=2,threads=2 \ -global driver=cfi.pflash01,property=secure,value=on \ -drive if=pflash,format=raw,unit=0,file=${FW},readonly=on \ -drive if=pflash,format=raw,unit=1,file=${VARS} \ \ -chardev file,id=debugfile,path=$DEBUG \ -device isa-debugcon,iobase=0x402,chardev=debugfile \ \ -chardev stdio,id=char0,signal=off,mux=on \ -mon chardev=char0,mode=readline,default \ -serial chardev:char0 \ \ -drive id=iso,if=none,format=raw,readonly,file=$ISO \ -drive id=disk,if=none,format=qcow2,file=$DISK \ \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-cd,drive=iso,bus=scsi0.0,bootindex=2 \ -device scsi-hd,drive=disk,bus=scsi0.0,bootindex=1 \ \ -device VGA This will capture the OVMF debug output in the $DEBUG file. Also, the terminal where you run the command can be switched between the guest's serial console and the QEMU monitor with [Ctrl-A C]. Thanks Laszlo > > Thank you > Yao Jiewen > > From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of Laszlo > Ersek > Sent: Tuesday, November 8, 2016 9:22 AM > To: Yao, Jiewen <jiewen....@intel.com<mailto:jiewen....@intel.com>> > Cc: Tian, Feng <feng.t...@intel.com<mailto:feng.t...@intel.com>>; > edk2-de...@ml01.01.org<mailto:edk2-de...@ml01.01.org>; Kinney, Michael D > <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com>>; Paolo > Bonzini <pbonz...@redhat.com<mailto:pbonz...@redhat.com>>; Fan, Jeff > <jeff....@intel.com<mailto:jeff....@intel.com>>; Zeng, Star > <star.z...@intel.com<mailto:star.z...@intel.com>> > Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection. > > On 11/04/16 10:30, Jiewen Yao wrote: >> ==== below is V2 description ==== >> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue. >> 2) PiSmmCpu: Add debug info on StartupAp() fails. >> 3) PiSmmCpu: Add ASSERT for AllocatePages(). >> 4) PiSmmCpu: Add protection detail in commit message. >> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message. >> >> ==== below is V1 description ==== >> This series patch enables SMM page level protection. >> Features are: >> 1) PiSmmCore reports SMM PE image code/data information >> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned. >> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable >> and set XD for data page and RO for code page. >> 3) PiSmmCpu enables Static Paging for X64 according to >> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G >> is used as long as it is supported. >> 4) PiSmmCpu sets importance data structure to be read only, >> such as Gdt, Idt, SmmEntrypoint, and PageTable itself. >> >> tested platform: >> 1) Intel internal platform (X64). >> 2) EDKII Quark IA32 >> 3) EDKII Vlv2 X64 >> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8) >> >> Cc: Jeff Fan >> <jeff....@intel.com<mailto:jeff....@intel.com<mailto:jeff....@intel.com%3cmailto:jeff....@intel.com>>> >> Cc: Feng Tian >> <feng.t...@intel.com<mailto:feng.t...@intel.com<mailto:feng.t...@intel.com%3cmailto:feng.t...@intel.com>>> >> Cc: Star Zeng >> <star.z...@intel.com<mailto:star.z...@intel.com<mailto:star.z...@intel.com%3cmailto:star.z...@intel.com>>> >> Cc: Michael D Kinney >> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com%3cmailto:michael.d.kin...@intel.com>>> >> Cc: Laszlo Ersek >> <ler...@redhat.com<mailto:ler...@redhat.com<mailto:ler...@redhat.com%3cmailto:ler...@redhat.com>>> >> Contributed-under: TianoCore Contribution Agreement 1.0 >> Signed-off-by: Jiewen Yao >> <jiewen....@intel.com<mailto:jiewen....@intel.com<mailto:jiewen....@intel.com%3cmailto:jiewen....@intel.com>>> > > I have new test results. Let's start with the table again: > > Legend: > > - "untested" means the test was not executed because the same test > failed or proved unreliable in a less demanding configuration already, > > - "n/a" means a setting or test case was impossible, > > - "fail" and "unreliable" (lower case) are outside the scope of this > series; they either capture the pre-series status, or are expected > even with the series applied due to the pre-series status, > > - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the > series. > > In all cases, 36 bits were used as address width in the CPU HOB (--> up > to 64GB guest-phys address space). > > series OVMF > VCPU boot S3 resume > # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTable > topology result result > -- ------- -------- ------------------------------- ------------------------ > -------- ------ --------- > 1 no Ia32 64 n/a > 1x2x2 pass unreliable > 2 no Ia32 255 n/a > 52x2x2 pass untested > 3 no Ia32 255 n/a > 53x2x2 unreliable untested > 4 no Ia32X64 64 n/a > 1x2x2 pass unreliable > 5 no Ia32X64 255 n/a > 52x2x2 pass untested > 6 no Ia32X64 255 n/a > 54x2x2 fail n/a > 7 v2 Ia32 64 FALSE > 1x2x2 pass untested > 8 v2 Ia32 64 TRUE > 1x2x2 FAIL untested > 9 v2 Ia32 255 FALSE > 52x2x2 pass untested > 10 v2 Ia32 255 FALSE > 53x2x2 untested untested > 11 v2 Ia32 255 TRUE > 52x2x2 untested untested > 12 v2 Ia32 255 TRUE > 53x2x2 untested untested > 13 v2 Ia32X64 64 FALSE > 1x2x2 pass unreliable > 14 v2 Ia32X64 64 TRUE > 1x2x2 pass untested > 15 v2 Ia32X64 255 FALSE > 52x2x2 pass untested > 16 v2 Ia32X64 255 FALSE > 54x2x2 untested untested > 17 v2 Ia32X64 255 TRUE > 52x2x2 FAIL untested > 18 v2 Ia32X64 255 TRUE > 54x2x2 untested untested > > * Case 8: this test case failed with v2 as well, but this time with > different symptoms: > >> FSOpen: Open '\EFI\fedora\grubia32.efi' Success >> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8 >> Loading driver at 0x0007DA7F000 EntryPoint=0x0007DA7F400 >> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90 >> PixelBlueGreenRedReserved8BitPerColor >> ConvertPages: Incompatible memory types >> PixelBlueGreenRedReserved8BitPerColor >> ConvertPages: Incompatible memory types >> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0 >> MpInitExitBootServicesCallback() done! >> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): >> SpinLock != ((void *) 0) > > I didn't try to narrow this down. > > * Case 13 (the "unreliable S3 resume" case): Here the news are both bad > and good. The good news is for Jiewen: this patch series does not > cause the unreliability, it "only" amplifies it severely. The bad news > is correspondingly for everyone else: S3 resume is actually unreliable > even in case 4, that is, without this series applied, it's just the > failure rate is much-much lower. > > Namely, in my new testing, in case 13, S3 resume failed 8 times out of > 21 tries. (I stopped testing at the 8th failure.) > > Whereas in case 4, S3 resume failed with *identical symptoms* (e.g., > exact same RIP=000000000009f0fd), 1 time out of 55 tries. (It was try > #12 that failed; I continued testing and aborted the test after the > 55th try.) > > So, while the series hugely amplifies the failure rate, the failure > does exist without the series. Which is why I modified the case 4 > results in the table, and also lower-cased the word "unreliable" in > case 13. > > Below I will return to this problem separately; let's go over the rest > of the table first. > > * Case 17: I guess this is not a real failure, I'm just including it for > completeness, as PcdCpuSmmStaticPageTable==TRUE is known to present > additional SMRAM demand (see the commit message on patch V2 4/6). This > case fails with > >> SmmLockBox Command - 4 >> SmmLockBoxSmmLib SetLockBoxAttributes - Enter >> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success) >> SmmLockBox SmmLockBoxHandler Exit >> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success) >> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0 >> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp != ((void *) 0) > > which is an SMRAM allocation failure. If I lower the VCPU count to > 50x2x2, then the guest boots fine. > > ----*---- > > Before I get to the S3 resume problem (which, again, reproduces without > this series, although much less frequently), I'd like to comment on the > removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll() > function, on the return value of SmmBlockingStartupThisAp(). This change > allows v2 to proceed past that point; however, I'm seeing a whole lot of > >> !mSmmMpSyncData->CpuData[1].Present >> !mSmmMpSyncData->CpuData[2].Present >> !mSmmMpSyncData->CpuData[3].Present >> ... > > messages in the OVMF boot log, interspersed with > >> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065 > > style messages. (That is, one error message for each AP, per > ConvertPageEntryAttribute() message.) > > Is this okay / intentional? The number of these messages can go up to > several thousands and that sort of drowns out everything else in the > log. > > It's also not easy to mask the message, because it's logged on the > DEBUG_ERROR level. > > ----*---- > > * Okay, so the S3 problem. Last time I suspected that the failure point > (RIP=9f0fd) was in the Linux guest's S3 wakeup vector (which starts at > 9A1D0, according to the OVMF log). In order to test this idea, I > exercised this series with S3 against a Windows 8.1 guest (--> case 13 > again). The failure reproduced on the second S3 resume, with identical > RIP, despite the Windows wakeup vector being located elsewhere (at > 0x1000). > > Quoting the OVMF log leading up to the resume: > >> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase >> Install PPI: [PeiPostScriptTablePpi] >> Install PPI: [EfiEndOfPeiSignalPpi] >> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895 >> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked >> Transfer to 16bit OS waking vector - 1000 > > QEMU log (same as before): > >> KVM internal error. Suberror: 1 >> KVM internal error. Suberror: 1 >> emulation failure >> emulation failure >> RAX=0000000000000002 RBX=0000000000000000 RCX=000000007ffdb168 >> RDX=000000007fedd058 >> RSI=0000000000000004 RDI=000000007fedd040 RBP=0000000000000000 >> RSP=000000007e1a7000 >> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 >> R11=0000000000000000 >> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 >> R15=0000000000000000 >> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] >> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT >> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy >> GDT= 000000007f294000 00000047 >> IDT= 000000007f294048 00000fff >> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220 >> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >> DR3=0000000000000000 >> DR6=00000000ffff0ff0 DR7=0000000000000400 >> EFER=0000000000000500 >> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? >> ?? ?? >> RAX=0000000000000001 RBX=0000000000000000 RCX=000000007ffdb168 >> RDX=000000007fedd070 >> RSI=0000000000000004 RDI=000000007fedd058 RBP=0000000000000000 >> RSP=000000007e19f000 >> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 >> R11=0000000000000000 >> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 >> R15=0000000000000000 >> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] >> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] >> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT >> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy >> GDT= 000000007f294000 00000047 >> IDT= 000000007f294048 00000fff >> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220 >> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >> DR3=0000000000000000 >> DR6=00000000ffff0ff0 DR7=0000000000000400 >> EFER=0000000000000500 >> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? >> ?? ?? > > So, we can exclude the suspicion that the problem is guest OS > dependent. > > * Then I looked for the base address of the page containing the > RIP=9f0fd address, in earlier parts of the OVMF log, on the hunch that > some firmware component might have allocated that area actually. Here > we go: > >> Loading PEIM at 0x000008552C0 EntryPoint=0x000008554E0 CpuMpPei.efi >> AP Loop Mode is 1 >> WakeupBufferStart = 9F000, WakeupBufferSize = 1000 > > That is, the failure hits (when it hits -- not always) in the area > where the CpuMpPei driver *borrows* memory for the startup vector of > the APs, for the purposes of the MP service PPI. ("Wakeup" is an > overloaded word here; the "wakeup buffer" has nothing to do with S3 > resume, it just serves for booting the APs temporarily in PEI, for > implementing the MP service PPI.) > > When exiting the PEI phase (on the S3 resume path), CpuMpPei restores > the original contents of this area. This occurs just before > transfering control to the guest OS wakeup vector: see the > "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above, > quoted from the OVMF log. > > I documented (parts of) this logic in OVMF commit > > https://github.com/tianocore/edk2/commit/e3e3090a959a0 > > (see the code comments as well). > > * At that time, I thought to have identified a memory management bug in > CpuMpPei; see the following discussion and bug report for details: > > https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html > https://bugzilla.tianocore.org/show_bug.cgi?id=67 > > However, with the extraction / introduction of MpInitLib, this issue > has been fixed: GetWakeupBuffer() now calls > CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is > no longer; we shouldn't be looking there for the root cause. > > * Either way, I don't understand why anything would want to execute code > in the one page that happens to host the MP services PPI startup > buffer for APs during PEI. > > Not understanding the "why", I looked at the "what", and resorted to > tracing KVM. Because the problem readily reproduces with this series > applied (case 13), it wasn't hard to start the tracing while the guest > was suspended, and capture just the actions that led from the > KVM-level wakeup to the failure. > > The QEMU state dumps are visible above in the email. I've also > uploaded the compressed OVMF log and the textual KVM trace here: > > > http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac2a53db/ > > I sincerely hope that Paolo will have a field day with the KVM trace > :) I managed to identify the following curiosities (remember this is > all on the S3 resume path): > > * First, the VCPUs (there are four of them) enter and leave SMM in a > really funky pattern: > > vcpu#0 vcpu#1 vcpu#2 vcpu#3 > ------ ------ ------ ------ > enter > | > leave > > enter > | > leave > > enter > | > leave > > enter > | > leave > > enter enter > enter | enter | > | | | | > leave | | | > | | | > enter | | | > | | | | > leave leave leave leave > > That is, first we have each VCPU enter and leave SMM in complete > isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon > followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM > temporarily (it comes back in later), while the other three remain > in SMM. Finally all four of them leave SMM together. > > After which the problem occurs. > > * Second, the instruction that causes things to blow up is <0f aa>, > i.e., RSM. I have absolutely no clue why RSM is executed: > > (a) in the area that used to host the AP startup routine for the MP > services PPI -- note that we also have "Transfer to 16bit OS waking > vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed > area completeley! --, > > (b) and why *after* all four VCPUs have just left SMM, together. > > * The RSM instruction is handled successfully elsewhere, for example > when all four VCPUs leave SMM, at the bottom of the diagram above: > >> CPU-24447 [002] 39841.982810: kvm_emulate_insn: 0:7ffbf179: 0f aa >> CPU-24446 [000] 39841.982810: kvm_emulate_insn: 0:7ffbd179: 0f aa >> CPU-24445 [005] 39841.982810: kvm_emulate_insn: 0:7ffbb179: 0f aa >> CPU-24444 [006] 39841.982811: kvm_emulate_insn: 0:7ffb9179: 0f aa > > * The guest-phys address 7ff7f000 that we see just before the error: > >> CPU-24447 [002] 39841.982825: kvm_page_fault: address 7ff7f000 >> error_code 83 >> CPU-24446 [000] 39841.982825: kvm_page_fault: address 7ff7f000 >> error_code 83 >> CPU-24447 [002] 39841.982826: kvm_emulate_insn: 0:9f0fd: 0f aa >> CPU-24444 [006] 39841.982827: kvm_exit: reason >> EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc >> CPU-24447 [002] 39841.982827: kvm_emulate_insn: 0:9f0fd: 0f aa FAIL >> CPU-24447 [002] 39841.982827: kvm_userspace_exit: reason >> KVM_EXIT_INTERNAL_ERROR (17) > > can be found higher up in the trace; namely, it is written to CR3 > several times. It's the root of the page tables. > > * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM. > > * I also tried the "info tlb" monitor command, via "virsh > qemu-monitor-command --hmp", while the guest was auto-paused after the > crash. > > I cannot provide results: QEMU appeared to return a message that would > be longer than 16MB after encoding by libvirt, and libvirt rejected > that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX). > > Anyway, the KVM trace, and the QEMU register dump, look consistent > with what Paolo said about "Code=?? ?? ??...": > > The question marks usually mean that the page tables do not map a > page at that address. > > CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM > (SMM=0). We can't translate *any* guest-virtual address, as we can't > even begin walking the page tables. > > Thanks > Laszlo > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel