Thanks to report case 3 issue on bugzillar.

Let's focus on Case 8.
It seems another random failure issue.

I did more test.

1)      I tested some other our internal real platform for UEFI32 OS boot. I 
cannot reproduce the ASSERT.

2)      I wrote a small test app to call ExitBootServices and send SMI. I run 
it on current my windows QEMU but I still cannot reproduce the ASSERT.

It seem your env is the only way to repo the issue. I am trying to follow your 
step by step to install OS on QEMU/KVM. I haven't finish all thing yet, because 
of some network proxy issue. :(

Your information and analysis is great. It gives us some clue.

I guess the same thing as you and checked: InitializeSpinLock 
(mSmmMpSyncData->CpuData[CpuIndex].Busy);

This address is initialized in InitializeMpSyncData(), with 
gSmmCpuPrivate->SmmCoreEntryContext.NumberOfCpus which is got from 
MpServices->GetNumberOfProcessors().
I do not know why this address is zero.

I also did not quite understand below log.

* CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835
  CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836
  CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837
  CPU #3: pc=0x000000007ffd17ca thread_id=7838

As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not enter 
SMM mode.
So why #3 can enter SMM mode? Is that expected behavior? Or unexpected behavior?
If this is expected, how this happened? Does OS send 
SendSmiIpiAllExcludingSelf, after ExitBootServices()?

I will see if I can finish QEMU/KVM installation tomorrow.

If you have some idea on why and how #3 enter SMM, please let us know.


Thank you
Yao Jiewen


From: Laszlo Ersek [mailto:ler...@redhat.com]
Sent: Thursday, November 10, 2016 4:46 AM
To: Yao, Jiewen <jiewen....@intel.com>
Cc: Tian, Feng <feng.t...@intel.com>; edk2-de...@ml01.01.org; Kinney, Michael D 
<michael.d.kin...@intel.com>; Paolo Bonzini <pbonz...@redhat.com>; Fan, Jeff 
<jeff....@intel.com>; Zeng, Star <star.z...@intel.com>
Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.

On 11/09/16 07:25, Yao, Jiewen wrote:
> Hi Laszlo
> I will fix DEBUG message issue in V3 patch.
>
> Below is rest issues:
>
>
> l  Case 13: S3 fails randomly.
> A good news: I worked with Jeff Fan to root-cause the S3 resume issue. Here 
> is detail.
>
>
> 1)      We believe the dead CPU is AP. Not BSP.
> The reason is that:
>
> 1.1)   The BSP already transfer control to OS waking vector. The GDT/IDT/CR3 
> should be set by OS.
>
> 1.2)   The current dead CPU still has GDT/IDT point to a BIOS reserved 
> memory. The CS/DS/SS is typical BIOS X64 mode setting.
>
> 1.3)   The current dead CPU still has CR3 in SMM. (Which is obvious wrong)
>
>
> 2)      Based upon the 1), we reviewed S3 resume AP flow.
> Current BSP will wake up AP in SMRAM, for security consideration. At that 
> time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SMM 
> mode yet. Even after SMM rebase, we can still use it because SMRR is not set 
> in first SMM rebase.
> Current BSP just uses its own context to initialize AP. So that AP takes BSP 
> CR3, which is SMM CR3, unfortunately.
> After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is the 
> last straw, because X64 mode halt still need paging.
>
>
> 3)      The error happen, once the AP receives an interrupt (for whatever 
> reason), AP starts executing code. However, that that time the AP might not 
> be in SMM mode. It means SMM CR3 is not available. And then we see this.
>
>
> 4)      I guess we did not see the error, or this is RANDOM issue, because it 
> depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI.
>
>
> 5)      The fix, I think, should be below:
> We should always put AP to protected mode, so that no paging is needed.
> We should put AP in above 1M reserved memory, instead of <1M memory, because 
> <1M memory is restored.
>
>
> Would you please file a bugzillar? I think we need assign CPU owner to fix 
> that critical issue.
>
> There is no need to do more investigation. Thanks for your great help on 
> that. :)

Thank you for your help!

I filed <https://bugzilla.tianocore.org/show_bug.cgi?id=216>. The title is

    BSP exits SMM and closes SMRAM on the S3 resume path before
    meeting with AP(s)

I hope the title is mostly right. I didn't add any other details (I
haven't gone through the thread in detail yet, and without that I can't
even write up a semi-reasonable report myself). Instead, I referenced
this message of yours in the report, and I also linked Paolo's analysis
from elsewhere in the thread. I hope this will do for the report.

(Also, thank you Paolo, from the amazing analysis -- I haven't digested
it yet, but I can already tell it's amazing! :))

> l  Case 17 - I do not think it is a real issue, because SMM is out of 
> resource.
>
>
> l  Case 8 - that is a very weird issue. I talk with Jeff again. I do not have 
> a clear clue yet.
>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): 
>> SpinLock != ((void *) 0)
> Here is code. We do not know why there is some code need InitializeSpinLock 
> after ExitBootServices.
> SPIN_LOCK *
> EFIAPI
> InitializeSpinLock (
>   OUT      SPIN_LOCK                 *SpinLock
>   )
> {
>   ASSERT (SpinLock != NULL);
>
>   _ReadWriteBarrier();
>   *SpinLock = SPIN_LOCK_RELEASED;
>   _ReadWriteBarrier();
>
>   return SpinLock;
> }
>
> If you can have a quick check on below, that would be great.
>
> 1)      Which processor triggers this ASSERT? BSP or AP.
>
> 2)      Which module triggers this ASSERT? Which module contains current RIP 
> value?

First, one additional piece of info I have learned is that the issue
does not always present itself. Sometimes the boot just works fine,
other times the assert fires.

Using the QEMU monitor, I managed to get the following information with
the "info cpus" command:

* CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835
  CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836
  CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837
  CPU #3: pc=0x000000007ffd17ca thread_id=7838

VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to
point into SMRAM again.

In the OVMF log, I see

Loading SMM driver at 0x0007FFC9000 EntryPoint=0x0007FFC9253
PiSmmCpuDxeSmm.efi

So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the
entry point, 0x8577, 0x253 bytes less).

Running

  objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug

first I see confirmation that

  start address 0x00000253

and then

000087bd <CpuDeadLoop>:
VOID
EFIAPI
CpuDeadLoop (
  VOID
  )
{
    87bd:       55                      push   %ebp
    87be:       89 e5                   mov    %esp,%ebp
    87c0:       83 ec 10                sub    $0x10,%esp
  volatile UINTN  Index;

  for (Index = 0; Index == 0;);
    87c3:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%ebp)
    87ca:       8b 45 fc                mov    -0x4(%ebp),%eax  <-- HERE
    87cd:       85 c0                   test   %eax,%eax
    87cf:       74 f9                   je     87ca <CpuDeadLoop+0xd>
}
    87d1:       c9                      leave
    87d2:       c3                      ret

This seems consistent with an assertion failure.

I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the
SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks
like a possible caller:

      //
      // The BUSY lock is initialized to Released state. This needs to
      // be done early enough to be ready for BSP's SmmStartupThisAp()
      // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be
      // called immediately after AP's present flag is detected.
      //
      InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy);

Just a guess, of course.

> At same time, all my OS test is on real platform. I have not setup OVMF env 
> to run an OS yet.
> If you can share a step by step to me, that would be great.

(1) Grab a host computer with a CPU that supports VMX and EPT.

(2) Download and install Fedora 24 (for example):

https://getfedora.org/en/workstation/download/
http://docs.fedoraproject.org/install-guide

(3) Install the "qemu-system-x86" package with DNF

dnf install qemu-system-x86

(4) clone edk2 with git

(5) embed OpenSSL optionally (for secure boot); see
"CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt"

(6) build OVMF:

source edksetup.sh
make -C "$EDK_TOOLS_PATH"

# Ia32
build \
  -a IA32 \
  -p OvmfPkg/OvmfPkgIa32.dsc \
  -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
  -t GCC5 -b DEBUG

# Ia32X64
build \
  -a IA32 -a X64 \
  -p OvmfPkg/OvmfPkgIa32X64.dsc \
  -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
  -t GCC5 -b DEBUG

(7) Create disk images:

qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \
  -o preallocation=metadata -o lazy_refcounts=on disk-ia32.img 100G

qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \
  -o preallocation=metadata -o lazy_refcounts=on disk-ia32x64.img 100G

(8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation
that you downloaded already (the ISO image).

For 32-bit guest OS, this one used to work:

https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/

minimally the 20141209 release. Hm... actually, I think the maintainer
of that image has discontinued the downloadable files :(

So, I don't know what 32-bit UEFI OS to recommend for testing.

32-bit Windows doesn't boot on OVMF (I looked into that earlier, several
times, with some help from a Microsoft developer, but we couldn't solve
it), so I can't recommend Windows as an alternative.

Perhaps you can use

https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trail-and.html

as a 32-bit guest OS, I never tried.

(9) Anyway, once you have an installer ISO, set the "ISO" environment
variable to the ISO image's full pathname, and then run QEMU like this:

# Settings for Ia32 only:

ISO=...
DISK=.../disk-ia32.img
FW=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd
TEMPLATE=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd
VARS=vars-32.fd
QEMU_COMMAND="qemu-system-i386 -cpu coreduo,-nx"
DEBUG=debug-32.log

# Settings for Ia32X64 only:

ISO=...
DISK=.../disk-ia32x64.img
FW=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd
TEMPLATE=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd
VARS=vars-3264.fd
QEMU_COMMAND=qemu-system-x86_64
DEBUG=debug-3264.log

# Common commands for both target arches:

# create variable store from varstore template
# if the former doesn't exist yet
if ! [ -e "$VARS" ]; then
  cp -- "$TEMPLATE" "$VARS"
fi

$QEMU_COMMAND \
  -machine q35,smm=on,accel=kvm \
  -m 4096 \
  -smp sockets=1,cores=2,threads=2 \
  -global driver=cfi.pflash01,property=secure,value=on \
  -drive if=pflash,format=raw,unit=0,file=${FW},readonly=on \
  -drive if=pflash,format=raw,unit=1,file=${VARS} \
  \
  -chardev file,id=debugfile,path=$DEBUG \
  -device isa-debugcon,iobase=0x402,chardev=debugfile \
  \
  -chardev stdio,id=char0,signal=off,mux=on \
  -mon chardev=char0,mode=readline,default \
  -serial chardev:char0 \
  \
  -drive id=iso,if=none,format=raw,readonly,file=$ISO \
  -drive id=disk,if=none,format=qcow2,file=$DISK \
  \
  -device virtio-scsi-pci,id=scsi0 \
  -device scsi-cd,drive=iso,bus=scsi0.0,bootindex=2 \
  -device scsi-hd,drive=disk,bus=scsi0.0,bootindex=1 \
  \
  -device VGA

This will capture the OVMF debug output in the $DEBUG file. Also, the
terminal where you run the command can be switched between the guest's
serial console and the QEMU monitor with [Ctrl-A C].

Thanks
Laszlo

>
> Thank you
> Yao Jiewen
>
> From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of Laszlo 
> Ersek
> Sent: Tuesday, November 8, 2016 9:22 AM
> To: Yao, Jiewen <jiewen....@intel.com<mailto:jiewen....@intel.com>>
> Cc: Tian, Feng <feng.t...@intel.com<mailto:feng.t...@intel.com>>; 
> edk2-de...@ml01.01.org<mailto:edk2-de...@ml01.01.org>; Kinney, Michael D 
> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com>>; Paolo 
> Bonzini <pbonz...@redhat.com<mailto:pbonz...@redhat.com>>; Fan, Jeff 
> <jeff....@intel.com<mailto:jeff....@intel.com>>; Zeng, Star 
> <star.z...@intel.com<mailto:star.z...@intel.com>>
> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
>
> On 11/04/16 10:30, Jiewen Yao wrote:
>> ==== below is V2 description ====
>> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue.
>> 2) PiSmmCpu: Add debug info on StartupAp() fails.
>> 3) PiSmmCpu: Add ASSERT for AllocatePages().
>> 4) PiSmmCpu: Add protection detail in commit message.
>> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message.
>>
>> ==== below is V1 description ====
>> This series patch enables SMM page level protection.
>> Features are:
>> 1) PiSmmCore reports SMM PE image code/data information
>> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned.
>> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable
>> and set XD for data page and RO for code page.
>> 3) PiSmmCpu enables Static Paging for X64 according to
>> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G
>> is used as long as it is supported.
>> 4) PiSmmCpu sets importance data structure to be read only,
>> such as Gdt, Idt, SmmEntrypoint, and PageTable itself.
>>
>> tested platform:
>> 1) Intel internal platform (X64).
>> 2) EDKII Quark IA32
>> 3) EDKII Vlv2  X64
>> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8)
>>
>> Cc: Jeff Fan 
>> <jeff....@intel.com<mailto:jeff....@intel.com<mailto:jeff....@intel.com%3cmailto:jeff....@intel.com>>>
>> Cc: Feng Tian 
>> <feng.t...@intel.com<mailto:feng.t...@intel.com<mailto:feng.t...@intel.com%3cmailto:feng.t...@intel.com>>>
>> Cc: Star Zeng 
>> <star.z...@intel.com<mailto:star.z...@intel.com<mailto:star.z...@intel.com%3cmailto:star.z...@intel.com>>>
>> Cc: Michael D Kinney 
>> <michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com<mailto:michael.d.kin...@intel.com%3cmailto:michael.d.kin...@intel.com>>>
>> Cc: Laszlo Ersek 
>> <ler...@redhat.com<mailto:ler...@redhat.com<mailto:ler...@redhat.com%3cmailto:ler...@redhat.com>>>
>> Contributed-under: TianoCore Contribution Agreement 1.0
>> Signed-off-by: Jiewen Yao 
>> <jiewen....@intel.com<mailto:jiewen....@intel.com<mailto:jiewen....@intel.com%3cmailto:jiewen....@intel.com>>>
>
> I have new test results. Let's start with the table again:
>
> Legend:
>
> - "untested" means the test was not executed because the same test
>   failed or proved unreliable in a less demanding configuration already,
>
> - "n/a" means a setting or test case was impossible,
>
> - "fail" and "unreliable" (lower case) are outside the scope of this
>   series; they either capture the pre-series status, or are expected
>   even with the series applied due to the pre-series status,
>
> - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the
>   series.
>
> In all cases, 36 bits were used as address width in the CPU HOB (--> up
> to 64GB guest-phys address space).
>
>    series  OVMF                                                              
> VCPU     boot       S3 resume
>  # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTable 
> topology result     result
> -- ------- -------- ------------------------------- ------------------------ 
> -------- ------     ---------
>  1 no      Ia32      64                             n/a                      
> 1x2x2    pass       unreliable
>  2 no      Ia32     255                             n/a                      
> 52x2x2   pass       untested
>  3 no      Ia32     255                             n/a                      
> 53x2x2   unreliable untested
>  4 no      Ia32X64   64                             n/a                      
> 1x2x2    pass       unreliable
>  5 no      Ia32X64  255                             n/a                      
> 52x2x2   pass       untested
>  6 no      Ia32X64  255                             n/a                      
> 54x2x2   fail       n/a
>  7 v2      Ia32      64                             FALSE                    
> 1x2x2    pass       untested
>  8 v2      Ia32      64                             TRUE                     
> 1x2x2    FAIL       untested
>  9 v2      Ia32     255                             FALSE                    
> 52x2x2   pass       untested
> 10 v2      Ia32     255                             FALSE                    
> 53x2x2   untested   untested
> 11 v2      Ia32     255                             TRUE                     
> 52x2x2   untested   untested
> 12 v2      Ia32     255                             TRUE                     
> 53x2x2   untested   untested
> 13 v2      Ia32X64   64                             FALSE                    
> 1x2x2    pass       unreliable
> 14 v2      Ia32X64   64                             TRUE                     
> 1x2x2    pass       untested
> 15 v2      Ia32X64  255                             FALSE                    
> 52x2x2   pass       untested
> 16 v2      Ia32X64  255                             FALSE                    
> 54x2x2   untested   untested
> 17 v2      Ia32X64  255                             TRUE                     
> 52x2x2   FAIL       untested
> 18 v2      Ia32X64  255                             TRUE                     
> 54x2x2   untested   untested
>
> * Case 8: this test case failed with v2 as well, but this time with
>   different symptoms:
>
>> FSOpen: Open '\EFI\fedora\grubia32.efi' Success
>> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8
>> Loading driver at 0x0007DA7F000 EntryPoint=0x0007DA7F400
>> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90
>> PixelBlueGreenRedReserved8BitPerColor
>> ConvertPages: Incompatible memory types
>> PixelBlueGreenRedReserved8BitPerColor
>> ConvertPages: Incompatible memory types
>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>> MpInitExitBootServicesCallback() done!
>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): 
>> SpinLock != ((void *) 0)
>
>   I didn't try to narrow this down.
>
> * Case 13 (the "unreliable S3 resume" case): Here the news are both bad
>   and good. The good news is for Jiewen: this patch series does not
>   cause the unreliability, it "only" amplifies it severely. The bad news
>   is correspondingly for everyone else: S3 resume is actually unreliable
>   even in case 4, that is, without this series applied, it's just the
>   failure rate is much-much lower.
>
>   Namely, in my new testing, in case 13, S3 resume failed 8 times out of
>   21 tries. (I stopped testing at the 8th failure.)
>
>   Whereas in case 4, S3 resume failed with *identical symptoms* (e.g.,
>   exact same RIP=000000000009f0fd), 1 time out of 55 tries. (It was try
>   #12 that failed; I continued testing and aborted the test after the
>   55th try.)
>
>   So, while the series hugely amplifies the failure rate, the failure
>   does exist without the series. Which is why I modified the case 4
>   results in the table, and also lower-cased the word "unreliable" in
>   case 13.
>
>   Below I will return to this problem separately; let's go over the rest
>   of the table first.
>
> * Case 17: I guess this is not a real failure, I'm just including it for
>   completeness, as PcdCpuSmmStaticPageTable==TRUE is known to present
>   additional SMRAM demand (see the commit message on patch V2 4/6). This
>   case fails with
>
>> SmmLockBox Command - 4
>> SmmLockBoxSmmLib SetLockBoxAttributes - Enter
>> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success)
>> SmmLockBox SmmLockBoxHandler Exit
>> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success)
>> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0
>> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp != ((void *) 0)
>
>   which is an SMRAM allocation failure. If I lower the VCPU count to
>   50x2x2, then the guest boots fine.
>
> ----*----
>
> Before I get to the S3 resume problem (which, again, reproduces without
> this series, although much less frequently), I'd like to comment on the
> removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll()
> function, on the return value of SmmBlockingStartupThisAp(). This change
> allows v2 to proceed past that point; however, I'm seeing a whole lot of
>
>> !mSmmMpSyncData->CpuData[1].Present
>> !mSmmMpSyncData->CpuData[2].Present
>> !mSmmMpSyncData->CpuData[3].Present
>> ...
>
> messages in the OVMF boot log, interspersed with
>
>> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065
>
> style messages. (That is, one error message for each AP, per
> ConvertPageEntryAttribute() message.)
>
> Is this okay / intentional? The number of these messages can go up to
> several thousands and that sort of drowns out everything else in the
> log.
>
> It's also not easy to mask the message, because it's logged on the
> DEBUG_ERROR level.
>
> ----*----
>
> * Okay, so the S3 problem. Last time I suspected that the failure point
>   (RIP=9f0fd) was in the Linux guest's S3 wakeup vector (which starts at
>   9A1D0, according to the OVMF log). In order to test this idea, I
>   exercised this series with S3 against a Windows 8.1 guest (--> case 13
>   again). The failure reproduced on the second S3 resume, with identical
>   RIP, despite the Windows wakeup vector being located elsewhere (at
>   0x1000).
>
>   Quoting the OVMF log leading up to the resume:
>
>> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase
>> Install PPI: [PeiPostScriptTablePpi]
>> Install PPI: [EfiEndOfPeiSignalPpi]
>> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895
>> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked
>> Transfer to 16bit OS waking vector - 1000
>
>   QEMU log (same as before):
>
>> KVM internal error. Suberror: 1
>> KVM internal error. Suberror: 1
>> emulation failure
>> emulation failure
>> RAX=0000000000000002 RBX=0000000000000000 RCX=000000007ffdb168 
>> RDX=000000007fedd058
>> RSI=0000000000000004 RDI=000000007fedd040 RBP=0000000000000000 
>> RSP=000000007e1a7000
>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 
>> R11=0000000000000000
>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
>> R15=0000000000000000
>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
>> GDT=     000000007f294000 00000047
>> IDT=     000000007f294048 00000fff
>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
>> DR3=0000000000000000
>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> EFER=0000000000000500
>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? 
>> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 
>> ?? ??
>> RAX=0000000000000001 RBX=0000000000000000 RCX=000000007ffdb168 
>> RDX=000000007fedd070
>> RSI=0000000000000004 RDI=000000007fedd058 RBP=0000000000000000 
>> RSP=000000007e19f000
>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 
>> R11=0000000000000000
>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
>> R15=0000000000000000
>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
>> GDT=     000000007f294000 00000047
>> IDT=     000000007f294048 00000fff
>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
>> DR3=0000000000000000
>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> EFER=0000000000000500
>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? 
>> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 
>> ?? ??
>
>   So, we can exclude the suspicion that the problem is guest OS
>   dependent.
>
> * Then I looked for the base address of the page containing the
>   RIP=9f0fd address, in earlier parts of the OVMF log, on the hunch that
>   some firmware component might have allocated that area actually. Here
>   we go:
>
>> Loading PEIM at 0x000008552C0 EntryPoint=0x000008554E0 CpuMpPei.efi
>> AP Loop Mode is 1
>> WakeupBufferStart = 9F000, WakeupBufferSize = 1000
>
>   That is, the failure hits (when it hits -- not always) in the area
>   where the CpuMpPei driver *borrows* memory for the startup vector of
>   the APs, for the purposes of the MP service PPI. ("Wakeup" is an
>   overloaded word here; the "wakeup buffer" has nothing to do with S3
>   resume, it just serves for booting the APs temporarily in PEI, for
>   implementing the MP service PPI.)
>
>   When exiting the PEI phase (on the S3 resume path), CpuMpPei restores
>   the original contents of this area. This occurs just before
>   transfering control to the guest OS wakeup vector: see the
>   "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above,
>   quoted from the OVMF log.
>
>   I documented (parts of) this logic in OVMF commit
>
>     https://github.com/tianocore/edk2/commit/e3e3090a959a0
>
>   (see the code comments as well).
>
> * At that time, I thought to have identified a memory management bug in
>   CpuMpPei; see the following discussion and bug report for details:
>
>     https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html
>     https://bugzilla.tianocore.org/show_bug.cgi?id=67
>
>   However, with the extraction / introduction of MpInitLib, this issue
>   has been fixed: GetWakeupBuffer() now calls
>   CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is
>   no longer; we shouldn't be looking there for the root cause.
>
> * Either way, I don't understand why anything would want to execute code
>   in the one page that happens to host the MP services PPI startup
>   buffer for APs during PEI.
>
>   Not understanding the "why", I looked at the "what", and resorted to
>   tracing KVM. Because the problem readily reproduces with this series
>   applied (case 13), it wasn't hard to start the tracing while the guest
>   was suspended, and capture just the actions that led from the
>   KVM-level wakeup to the failure.
>
>   The QEMU state dumps are visible above in the email. I've also
>   uploaded the compressed OVMF log and the textual KVM trace here:
>
>     
> http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac2a53db/
>
>   I sincerely hope that Paolo will have a field day with the KVM trace
>   :) I managed to identify the following curiosities (remember this is
>   all on the S3 resume path):
>
>   * First, the VCPUs (there are four of them) enter and leave SMM in a
>     really funky pattern:
>
>       vcpu#0  vcpu#1  vcpu#2  vcpu#3
>       ------  ------  ------  ------
>               enter
>                |
>               leave
>
>                       enter
>                         |
>                       leave
>
>                               enter
>                                 |
>                               leave
>
>       enter
>         |
>       leave
>
>               enter           enter
>        enter    |     enter     |
>          |      |       |       |
>        leave    |       |       |
>                 |       |       |
>        enter    |       |       |
>          |      |       |       |
>        leave  leave   leave   leave
>
>     That is, first we have each VCPU enter and leave SMM in complete
>     isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon
>     followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM
>     temporarily (it comes back in later), while the other three remain
>     in SMM. Finally all four of them leave SMM together.
>
>     After which the problem occurs.
>
>   * Second, the instruction that causes things to blow up is <0f aa>,
>     i.e., RSM. I have absolutely no clue why RSM is executed:
>
>     (a) in the area that used to host the AP startup routine for the MP
>     services PPI -- note that we also have "Transfer to 16bit OS waking
>     vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
>     area completeley! --,
>
>     (b) and why *after* all four VCPUs have just left SMM, together.
>
>   * The RSM instruction is handled successfully elsewhere, for example
>     when all four VCPUs leave SMM, at the bottom of the diagram above:
>
>> CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
>> CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
>> CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
>> CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa
>
>   * The guest-phys address 7ff7f000 that we see just before the error:
>
>> CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000 
>> error_code 83
>> CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000 
>> error_code 83
>> CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
>> CPU-24444 [006] 39841.982827: kvm_exit:             reason 
>> EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc
>> CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
>> CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason 
>> KVM_EXIT_INTERNAL_ERROR (17)
>
>     can be found higher up in the trace; namely, it is written to CR3
>     several times. It's the root of the page tables.
>
>   * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.
>
> * I also tried the "info tlb" monitor command, via "virsh
>   qemu-monitor-command --hmp", while the guest was auto-paused after the
>   crash.
>
>   I cannot provide results: QEMU appeared to return a message that would
>   be longer than 16MB after encoding by libvirt, and libvirt rejected
>   that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).
>
>   Anyway, the KVM trace, and the QEMU register dump, look consistent
>   with what Paolo said about "Code=?? ?? ??...":
>
>     The question marks usually mean that the page tables do not map a
>     page at that address.
>
>   CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM
>   (SMM=0). We can't translate *any* guest-virtual address, as we can't
>   even begin walking the page tables.
>
> Thanks
> Laszlo
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>
> https://lists.01.org/mailman/listinfo/edk2-devel
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
> https://lists.01.org/mailman/listinfo/edk2-devel
>
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to