On 12/04/13 18:17, Jordan Justen wrote: > On Tue, Dec 3, 2013 at 9:46 PM, Laszlo Ersek <ler...@redhat.com> wrote: >>> Install PPI: 88C9D306-0900-4EB5-8260-3E2DBEDA1F89 >>> Install PPI: 605EA650-C65C-42E1-BA80-91A52AB618C6 >> >> These two PPIs (BootScriptDonePpi, EndOfPeiPpi) are installed by >> S3ResumeBootOs() early. >> >>> Transfer to 16bit OS waking vector - 9A1D0 >> >> In my testing anyway, which is case #2 from the above list, the >> AsmTransferControl() assembly code is invoked, implemented in >> MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S. >> >>> SecCoreStartupWithStack(0xFFFCC000, 0x80000) >> >> And, it reboots the virtual machine. It's an infinite reboot loop >> actually, always going through the same resume path and rebooting in the >> same spot. > > Does the OS waking vector appear to have some code? > > I can think of one *major* issue for OVMF S3. We run PEI from RAM. > This allows allows us to compress the PEI code. > > Therefore, on S3 resume, we will overwrite a big chunk of memory when > we decompress the main FV. This is going to trash a big chunk of RAM > likely used by the OS. It starts at PcdOvmfMemFvBase and has a size of > PcdOvmfMemFvSize + 2MB. (The 2MB is actually closer to 1MB, and is > some extra RAM used during decompression). > > Oh, and one other chunk of RAM we use is SEC/PEI Temporary RAM at > 0x70000-0x80000. (SecMain.c SecCoreStartupWithStack) > > I guess the first quick hack would be to reserve these memory ranges > so the OS will not use them. (EfiACPIMemoryNVS) (I *very briefly* > looked over your patches, and I didn't notice them taking this into > account. Sorry if I missed it.)
No, you're right. I did think of this briefly, but I didn't have the details. Thanks for them, I'll certainy have to fix this. However, the reboot problem manifests *before* we jump to the waking vector. AsmTransferControl() consists of several parts, and the reboot happens somewhere in the middle, when we try to switch from one part to another part. Actually there's no "leap" intended there, the code progresses linearly (as a human would read it), the LRET is only needed because we want to reload the CS with another selector value (0x18). The GDT is set up earlier and 0x18 is valid. I tested if the (intended) target location of the LRET is reached, and it is not. (It's easy to test by adding a small infinite loop, moving it around, and seeing if the VM is spinning with or without producing a bunch of output on the debug port.) It's *really* that internally-targeted LRET that causes a reboot. I did some KVM tracing last night too, and it doesn't even seem to trap. I also tested today on my AMD/SVM box, and the behavior is identical. And, now I have some idea why it happens. I hacked PlatformPei and S3Resume2Pei to read the CS and log it. (Of course without seeing the actual GDT entries that these values point to we can't say much, but we can't say something.) I also added a small count-down loop in gcc inline assembly that uses LRETQ for the iteration -- I load CS to RAX, push RAX, compute the RIP in RAX and push that too, then LRETQ. In PlatformPei, after cold boot *or* resume, Cs=0x18, and the loop with LRETQ works. In S3Resume2Pei, after resume, Cs=0x0418, and the loop with LRETQ still works (IOW whatever GDT we use, it has a good entry for the 0x0418 selector). But, there's this section in S3ResumeExecuteBootScript(): InterruptStatus = SaveAndDisableInterrupts (); // // Need to make sure the GDT is loaded with values that support long mode and real mode. // AsmWriteGdtr (&mGdt); // // update segment selectors per the new GDT. // AsmSetDataSelectors (DATA_SEGEMENT_SELECTOR); // // Restore interrupt state. // SetInterruptState (InterruptStatus); This code doesn't change the value of CS, it remains 0x0418. But immediately after this code, the first LRETQ (for which I push the current CS, ie. CS=0x0418) triggers a reboot. Which is probably justified, because the new GDT has no entry for the 0x0418 selector at all. (The highest selector under mGdt.mGdtEntries is 0x40.) I *guess* when doing an LRETQ, the *old* CS value matters too, for comparing privilege levels between old and new, or whatever. (The code in AsmTransferControl() doesn't try to reuse the current CS, it simply sets a new, *valid* CS. But, apparently, having a busted *old* CS could suffice for a reboot.) Thanks Laszlo ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel