On 11/02/15 23:49, Laszlo Ersek wrote:
> On 11/02/15 23:29, Jordan Justen wrote:
>> On 2015-11-02 12:16:12, Laszlo Ersek wrote:
>>> On 11/02/15 19:53, Alex Williamson wrote:
>>>> On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote:
>>>>> On 10/31/15 18:50, Laszlo Ersek wrote:
>>>>>
>>>>>> I'm very sorry, but I don't think I can spend time on this,
>>>>>> unless someone gives me ssh and/or console access to a host that
>>>>>> readily reproduces the bug, with the latest kvm/master, qemu, and
>>>>>> ekd2 builds.
>>>>>
>>>>> We just got lucky, the problem reproduces on my oldie workstation
>>>>> (HP Z400, (R) Xeon(R) CPU W3550 @ 3.07GHz,
>>>>> family/model/stepping/microcode = 6/26/5/0x19).
>>>>>
>>>>> OVMF works fine with the most recent Fedora 21 host kernel,
>>>>> "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new
>>>>> upstream Linux 4.3 release.
>>>>>
>>>>> The OVMF log contains a number of failed ASSERT()s, printed by the
>>>>> APs, in a way that is very characteristic of multi-threading bugs:
>>>>>
>>>>> --*--
>>>>>
>>>>> Detect CPU count: 2
>>>>> AASSSSEERRTT  
>>>>> //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
>>>>>  8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c 
>>>>> kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T1N))^M
>>>>> 1)
>>>>> ASSERT 
>>>>> /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
>>>>>  LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)
>>>>>
>>>>> --*--
>>>>>
>>>>> (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count:
>>>>> 2".)
>>>>>
>>>>>
>>>>> Maybe important, maybe not: my workstation's IOMMU does *not* have
>>>>> snoop control:
>>>>>
>>>>> [    3.043297] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 
>>>>> c90780106f0462 ecap f02076
>>>>>
>>>>> Where 0xf02076 & 0x80 == 0.
>>>>>
>>>>> Whereas my laptop *does* have Snoop Control:
>>>>>
>>>>> [    0.030486] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 
>>>>> c0000020660462 ecap f0101a
>>>>> [    0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap 
>>>>> d2008020660462 ecap f010da
>>>>>
>>>>> The first IOMMU doesn't have Snoop Control, but its scope only
>>>>> extends to the integrated Intel Graphics Device. The second IOMMU,
>>>>> which covers the SD card reader I used for testing (and all other
>>>>> PCI devices), *does* have Snoop Control. 0xf010da & 0x80 = 0x80.
>>>>>
>>>>> However, SC support should not matter, due to kernel commit
>>>>> fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be
>>>>> looking an independent issue. (Possibly in CpuDxe, only exposed by
>>>>> the new host
>>>>> kernel.)
>>>>
>>>> Snoop Control is what I immediately jumped to as well, but I've
>>>> never seen the problem on my desktop, which reports the following
>>>> ecaps: f0101a & f0105a.  Thanks,
>>>
>>> This issue is actually very easy to reproduce on my workstation. It
>>> has 1 socket, 4 cores, 2 threads/core. If I start a guest with 32
>>> VCPUs (i.e., heavily over-subscribing the host), then the issue hits
>>> immediately. Device assignment is not necessary.
>>>
>>> But, it also reproduces fairly regularly with just 8 VCPUs (even
>>> with VCPUs pinned to separate logical processors on the host).
>>>
>>> Judging purely from the KVM trace captured during such a 32-VCPU
>>> startup, I'm suspecting a general thread scheduling (or signaling)
>>> change sometime in the v4.1..v4.2 interval. It takes *extremely*
>>> long after the second SIPI (which should wake the AP threads) until
>>> the first "kvm_entry" message shows up for the last nonzero VCPU.
>>>
>>> Below is an excerpt. The first line shows the 2nd SIPI being
>>> delivered to the last VCPU. The lines after that show the first time
>>> that each nonzero VCPU reports in with "kvm_entry".
>>>
>>>  qemu-system-x86-19114 [005]  6668.931183: kvm_apic_accept_irq:  apicid 1f 
>>> vec 159 (SIPI|edge)
>>> ...
>>>  qemu-system-x86-19136  [006]  6668.931953:  kvm_entry:  vcpu  22
>>>  qemu-system-x86-19133  [003]  6668.931953:  kvm_entry:  vcpu  19
>>>  qemu-system-x86-19138  [007]  6668.931953:  kvm_entry:  vcpu  24
>>>  qemu-system-x86-19118  [004]  6668.934967:  kvm_entry:  vcpu  4
>>>  qemu-system-x86-19119  [006]  6668.941502:  kvm_entry:  vcpu  5
>>>  qemu-system-x86-19121  [004]  6668.944957:  kvm_entry:  vcpu  7
>>>  qemu-system-x86-19124  [001]  6668.944957:  kvm_entry:  vcpu  10
>>>  qemu-system-x86-19137  [000]  6668.948561:  kvm_entry:  vcpu  23
>>>  qemu-system-x86-19140  [002]  6668.948561:  kvm_entry:  vcpu  26
>>>  qemu-system-x86-19131  [003]  6668.948561:  kvm_entry:  vcpu  17
>>>  qemu-system-x86-19122  [007]  6668.948561:  kvm_entry:  vcpu  8
>>>  qemu-system-x86-19120  [001]  6668.964832:  kvm_entry:  vcpu  6
>>>  qemu-system-x86-19142  [002]  6668.964962:  kvm_entry:  vcpu  28
>>>  qemu-system-x86-19125  [000]  6668.964964:  kvm_entry:  vcpu  11
>>>  qemu-system-x86-19139  [007]  6668.964965:  kvm_entry:  vcpu  25
>>>  qemu-system-x86-19123  [002]  6668.968954:  kvm_entry:  vcpu  9
>>>  qemu-system-x86-19130  [000]  6668.968955:  kvm_entry:  vcpu  16
>>>  qemu-system-x86-19144  [000]  6668.974965:  kvm_entry:  vcpu  30
>>>  qemu-system-x86-19132  [003]  6668.974967:  kvm_entry:  vcpu  18
>>>  qemu-system-x86-19126  [002]  6668.981415:  kvm_entry:  vcpu  12
>>>  qemu-system-x86-19143  [003]  6668.981416:  kvm_entry:  vcpu  29
>>>  qemu-system-x86-19134  [000]  6668.981418:  kvm_entry:  vcpu  20
>>>  qemu-system-x86-19127  [003]  6669.002546:  kvm_entry:  vcpu  13
>>>  qemu-system-x86-19117  [000]  6669.004972:  kvm_entry:  vcpu  3
>>>  qemu-system-x86-19135  [003]  6669.004973:  kvm_entry:  vcpu  21
>>>  qemu-system-x86-19145  [002]  6669.004974:  kvm_entry:  vcpu  31
>>>  qemu-system-x86-19116  [005]  6669.005002:  kvm_entry:  vcpu  2
>>>  qemu-system-x86-19128  [000]  6669.014976:  kvm_entry:  vcpu  14
>>>  qemu-system-x86-19129  [002]  6669.015058:  kvm_entry:  vcpu  15
>>>  qemu-system-x86-19115  [002]  6669.029626:  kvm_entry:  vcpu  1
>>>  qemu-system-x86-19141  [005]  6669.034981:  kvm_entry:  vcpu  27
>>>
>>> The last VCPU (#27) reports in 6669.034981 - 6668.931183 = 103.798
>>> ms after the second SIPI's delivery is completed.
>>>
>>> That is about twice as long as the virtual BSP is willing to wait
>>> for the virtual AP detection to complete. (The default timeout is 50
>>> ms.)
>>
>> This doesn't appear to be reproducing the same issue that Janusz is
>> seeing.
>
> You might be correct, but I can only work with what I can reproduce.
>
>> In that case even a *10 second* timeout was not enough for 8
>> processors to start up.
>
> Note that I didn't try to await until the AP enumeration *completed*
> in the problematic cases. The only thing I "measured" in the above was
> the (nearly) *first* sign of life for each VCPU, after the
> INIT-SIPI-SIPI. Your timeout depends on the *last* relevant act of the
> "slowest" CPU.
>
> In any case, even if this issue is independent, the success vs.
> failure question is pretty clear on my workstation, wherever I am in
> the bisection. Assuming I end up with a relevant-looking culprit
> commit, we could do something about it, and ask Janusz to test it too.

- I bisected v4.1..v4.3.
- I never assigned any device to the guest. In fact this was my
  sophisticated test script:

> OVMF_BIN=/home/vm-images/firmware/OVMF_CODE.fd
> VARSTORE=/root/vars.fd
> LOG=/root/ovmf.log
>
> /usr/bin/qemu-system-x86_64 \
>   -machine accel=kvm \
>   -smp cpus=10 \
>   -m 4096 \
>   -device qxl-vga \
>   -drive if=pflash,format=raw,unit=0,file="$OVMF_BIN",readonly=on \
>   -drive if=pflash,format=raw,unit=1,file="$VARSTORE" \
>   -debugcon file:"$LOG" \
>   -global isa-debugcon.iobase=0x402 \
>   -net none

As I said, my workstation has 8 logical PCPUs (1*4*2). Using the above
minimal oversubscription (10 VCPUs), and requiring 10 successful OVMF
boots to qualify a kernel build "good" during bisection, I got *very
clear* decisions for each build, and the result is:

> commit b18d5431acc7a2fd22767925f3a6f597aa4bd29e
> Author: Xiao Guangrong <guangrong.x...@linux.intel.com>
> Date:   Mon Jun 15 16:55:21 2015 +0800
>
>     KVM: x86: fix CR0.CD virtualization
>
>     Currently, CR0.CD is not checked when we virtualize memory cache type for
>     noncoherent_dma guests, this patch fixes it by :
>
>     - setting UC for all memory if CR0.CD = 1
>     - zapping all the last sptes in MMU if CR0.CD is changed
>
>     Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
>     Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>
> :040000 040000 a08dd8837198d53aab75569607c23e8dd05bb257 
> 25c21bcc6ca57f76477e7d0c540213909d9d000b M      arch

Here's the bisection log:

> git bisect start
> # bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
> git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861
> # good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
> git bisect good b953c0d234bc72e8489d3bf51a276c5c4ec85345
> # bad: [31191a85fb875cf123cea56bbfd34f4b941f3c79] perf report: Add support 
> for srcfile sort key
> git bisect bad 31191a85fb875cf123cea56bbfd34f4b941f3c79
> # bad: [5b4ca4447757019f11a601b0009534ef84bed801] Merge tag 'media/v4.2-1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> git bisect bad 5b4ca4447757019f11a601b0009534ef84bed801
> # bad: [8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2] mm: kmemleak_alloc_percpu() 
> should follow the gfp from per_alloc()
> git bisect bad 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2
> # good: [10b4b096d0c7e9f1b5f84c2a0658b2963e1e6ed0] Merge tag 'gpio-v4.2-1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
> git bisect good 10b4b096d0c7e9f1b5f84c2a0658b2963e1e6ed0
> # good: [36a1624d8844b6c165daf61649e6b68c02d0835f] Merge tag 'for-4.2' of 
> git://git.infradead.org/battery-2.6
> git bisect good 36a1624d8844b6c165daf61649e6b68c02d0835f
> # good: [4b1f2af6752a4cc9acc1c22ddf3842478965f113] Merge branch 'for-linus' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect good 4b1f2af6752a4cc9acc1c22ddf3842478965f113
> # good: [08d183e3c1f650b4db1d07d764502116861542fa] Merge tag 'powerpc-4.2-1' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
> git bisect good 08d183e3c1f650b4db1d07d764502116861542fa
> # bad: [4e241557fc1cb560bd9e77ca1b4a9352732a5427] Merge tag 'for-linus' of 
> git://git.kernel.org/pub/scm/virt/kvm/kvm
> git bisect bad 4e241557fc1cb560bd9e77ca1b4a9352732a5427
> # good: [0ad83caa21bff9f383c10e73b452425709573042] x86: kvmclock: set 
> scheduler clock stable
> git bisect good 0ad83caa21bff9f383c10e73b452425709573042
> # good: [05fe125fa3237de2ec5bada80031e694de78909c] Merge tag 
> 'kvm-arm-for-4.2' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
> git bisect good 05fe125fa3237de2ec5bada80031e694de78909c
> # bad: [19efffa244071ccd0385b240d03adb38feaab04e] KVM: MTRR: sort variable 
> MTRRs
> git bisect bad 19efffa244071ccd0385b240d03adb38feaab04e
> # bad: [eb839917a75207b89799e3500211163cb6de0dea] KVM: MTRR: handle 
> MSR_MTRRcap in kvm_mtrr_get_msr
> git bisect bad eb839917a75207b89799e3500211163cb6de0dea
> # good: [0b8ba4a2b65860ac1fcd5be6cfca5d3b27a5b170] KVM: fix checkpatch.pl 
> errors in kvm/coalesced_mmio.h
> git bisect good 0b8ba4a2b65860ac1fcd5be6cfca5d3b27a5b170
> # bad: [b18d5431acc7a2fd22767925f3a6f597aa4bd29e] KVM: x86: fix CR0.CD 
> virtualization
> git bisect bad b18d5431acc7a2fd22767925f3a6f597aa4bd29e
> # good: [f104765b4f81fd74d69e0eb161e89096deade2db] KVM: nSVM: Check for NRIPS 
> support before updating control field
> git bisect good f104765b4f81fd74d69e0eb161e89096deade2db
> # first bad commit: [b18d5431acc7a2fd22767925f3a6f597aa4bd29e] KVM: x86: fix 
> CR0.CD virtualization

(Side note: I had to fix up one of the early builds with "git
cherry-pick --no-commit 82819ffb42fb45197bacf3223191deca31d3eb91",
because otherwise the build failed, but that's not relevant here.)

So, what does this mean? Because we've *always* known about this commit
(after Xiao told us about it)! Importantly, what does it mean that the
bug reproduces *without* an assigned device, and even in *v4.3*, where
Xiao's fixup

> commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
> Author: Xiao Guangrong <guangrong.x...@intel.com>
> Date:   Thu Jul 16 03:25:56 2015 +0800
>
>     KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED

is in place (starting with v4.2-rc4)?

It means that the original commit b18d5431acc7 modified *two* functions
(namely vmx_get_mt_mask() and kvm_set_cr0()), and that commit fb279950ba
fixed up *only one* of them (vmx_get_mt_mask()).

This code remains unchanged in v4.3, from b18d5431acc7:

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 43f0df7..43fdb10 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -621,6 +621,10 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>
>       if ((cr0 ^ old_cr0) & update_bits)
>               kvm_mmu_reset_context(vcpu);
> +
> +     if ((cr0 ^ old_cr0) & X86_CR0_CD)
> +             kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> +
>       return 0;
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_cr0);

Xiao actually pointed this out (look at the first paragraph):

> Kernel commit b18d5431acc causes the vCPUs need more time to startup
> as:
> - it zaps all the mappings for the guest memory in EPT or shadow page
>    table, it requires VM-exits to rebuild the mappings for all memory
>    access.
>
> - if there is device passthrough-ed in guest and IOMMU lacks snooping
>    control feature, the memory will become UC after CR0.CD is set to
>    1.

We've been focusing on the second paragraph (vmx_get_mt_mask()), but
apparently the kvm_set_cr0() change, which was not quirked, has a large
impact on some hosts.

To me an interesting aspect of the above hunk is that *any one* VCPU
flipping its CR0.CD will re-set nested paging *globally*, which will
likely impact all other VCPUs as well, I guess.

Plus, with kvm_zap_gfn_range() taking the guest-global kvm->mmu_lock
spin-lock, perhaps there is a bit of host-side lock contention too, when
the APs start up all at once.

Anyway, with the following host kernel change, the AP startup problem
goes away (tested on top of v4.3):

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a9a198..4f978ad 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -622,7 +622,8 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>       if ((cr0 ^ old_cr0) & update_bits)
>               kvm_mmu_reset_context(vcpu);
>
> -     if ((cr0 ^ old_cr0) & X86_CR0_CD)
> +     if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED) &&
> +         (cr0 ^ old_cr0) & X86_CR0_CD)
>               kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>
>       return 0;

(Honestly I just imitated fb279950ba here; I'm not making any better
argument for this diff. But, independently, I wonder why this hunk
didn't have the noncoherent DMA check either, originally.)

Janusz, could you rebuild your host kernel with this patch, and share
the results? (I'm also attaching the same as a formatted patch, so you
can apply it with "git am" easily.)

Thanks
Laszlo
>From a5b5aa46bc6bf98f171b2a20c252164a66bb8ceb Mon Sep 17 00:00:00 2001
From: Laszlo Ersek <ler...@redhat.com>
Date: Tue, 3 Nov 2015 01:45:58 +0100
Subject: [PATCH] KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0().

Commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization") was
technically correct, but it broke OVMF guests by slowing down various
parts of the firmware.

Commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") quirked the
first function modified by b18d5431acc7, vmx_get_mt_mask(), for OVMF's
sake. This restored the speed of the OVMF code that runs before
PlatformPei (including the memory intensive LZMA decompression in SEC).

This patch extends the quirk to the second function modified by
b18d5431acc7, kvm_set_cr0(). This should eliminate the intrusive slowdown
in the CpuDxe AP startup.

Signed-off-by: Laszlo Ersek <ler...@redhat.com>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a9a198..4f978ad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -622,7 +622,8 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if ((cr0 ^ old_cr0) & update_bits)
 		kvm_mmu_reset_context(vcpu);
 
-	if ((cr0 ^ old_cr0) & X86_CR0_CD)
+	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED) &&
+	    (cr0 ^ old_cr0) & X86_CR0_CD)
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 
 	return 0;
-- 
1.8.3.1

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to