On 2015-11-02 17:14:12, Laszlo Ersek wrote:
> On 11/02/15 23:49, Laszlo Ersek wrote:
> > On 11/02/15 23:29, Jordan Justen wrote:
> >> On 2015-11-02 12:16:12, Laszlo Ersek wrote:
> >>> On 11/02/15 19:53, Alex Williamson wrote:
> >>>> On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote:
> >>>>> On 10/31/15 18:50, Laszlo Ersek wrote:
> >>>>>
> >>>>>> I'm very sorry, but I don't think I can spend time on this,
> >>>>>> unless someone gives me ssh and/or console access to a host that
> >>>>>> readily reproduces the bug, with the latest kvm/master, qemu, and
> >>>>>> ekd2 builds.
> >>>>>
> >>>>> We just got lucky, the problem reproduces on my oldie workstation
> >>>>> (HP Z400, (R) Xeon(R) CPU W3550 @ 3.07GHz,
> >>>>> family/model/stepping/microcode = 6/26/5/0x19).
> >>>>>
> >>>>> OVMF works fine with the most recent Fedora 21 host kernel,
> >>>>> "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new
> >>>>> upstream Linux 4.3 release.
> >>>>>
> >>>>> The OVMF log contains a number of failed ASSERT()s, printed by the
> >>>>> APs, in a way that is very characteristic of multi-threading bugs:
> >>>>>
> >>>>> --*--
> >>>>>
> >>>>> Detect CPU count: 2
> >>>>> AASSSSEERRTT  
> >>>>> //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
> >>>>>  8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c 
> >>>>> kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T1N))^M
> >>>>> 1)
> >>>>> ASSERT 
> >>>>> /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
> >>>>>  LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)
> >>>>>
> >>>>> --*--
> >>>>>
> >>>>> (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count:
> >>>>> 2".)
> >>>>>
> >>>>>
> >>>>> Maybe important, maybe not: my workstation's IOMMU does *not* have
> >>>>> snoop control:
> >>>>>
> >>>>> [    3.043297] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 
> >>>>> c90780106f0462 ecap f02076
> >>>>>
> >>>>> Where 0xf02076 & 0x80 == 0.
> >>>>>
> >>>>> Whereas my laptop *does* have Snoop Control:
> >>>>>
> >>>>> [    0.030486] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 
> >>>>> c0000020660462 ecap f0101a
> >>>>> [    0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap 
> >>>>> d2008020660462 ecap f010da
> >>>>>
> >>>>> The first IOMMU doesn't have Snoop Control, but its scope only
> >>>>> extends to the integrated Intel Graphics Device. The second IOMMU,
> >>>>> which covers the SD card reader I used for testing (and all other
> >>>>> PCI devices), *does* have Snoop Control. 0xf010da & 0x80 = 0x80.
> >>>>>
> >>>>> However, SC support should not matter, due to kernel commit
> >>>>> fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be
> >>>>> looking an independent issue. (Possibly in CpuDxe, only exposed by
> >>>>> the new host
> >>>>> kernel.)
> >>>>
> >>>> Snoop Control is what I immediately jumped to as well, but I've
> >>>> never seen the problem on my desktop, which reports the following
> >>>> ecaps: f0101a & f0105a.  Thanks,
> >>>
> >>> This issue is actually very easy to reproduce on my workstation. It
> >>> has 1 socket, 4 cores, 2 threads/core. If I start a guest with 32
> >>> VCPUs (i.e., heavily over-subscribing the host), then the issue hits
> >>> immediately. Device assignment is not necessary.
> >>>
> >>> But, it also reproduces fairly regularly with just 8 VCPUs (even
> >>> with VCPUs pinned to separate logical processors on the host).
> >>>
> >>> Judging purely from the KVM trace captured during such a 32-VCPU
> >>> startup, I'm suspecting a general thread scheduling (or signaling)
> >>> change sometime in the v4.1..v4.2 interval. It takes *extremely*
> >>> long after the second SIPI (which should wake the AP threads) until
> >>> the first "kvm_entry" message shows up for the last nonzero VCPU.
> >>>
> >>> Below is an excerpt. The first line shows the 2nd SIPI being
> >>> delivered to the last VCPU. The lines after that show the first time
> >>> that each nonzero VCPU reports in with "kvm_entry".
> >>>
> >>>  qemu-system-x86-19114 [005]  6668.931183: kvm_apic_accept_irq:  apicid 
> >>> 1f vec 159 (SIPI|edge)
> >>> ...
> >>>  qemu-system-x86-19136  [006]  6668.931953:  kvm_entry:  vcpu  22
> >>>  qemu-system-x86-19133  [003]  6668.931953:  kvm_entry:  vcpu  19
> >>>  qemu-system-x86-19138  [007]  6668.931953:  kvm_entry:  vcpu  24
> >>>  qemu-system-x86-19118  [004]  6668.934967:  kvm_entry:  vcpu  4
> >>>  qemu-system-x86-19119  [006]  6668.941502:  kvm_entry:  vcpu  5
> >>>  qemu-system-x86-19121  [004]  6668.944957:  kvm_entry:  vcpu  7
> >>>  qemu-system-x86-19124  [001]  6668.944957:  kvm_entry:  vcpu  10
> >>>  qemu-system-x86-19137  [000]  6668.948561:  kvm_entry:  vcpu  23
> >>>  qemu-system-x86-19140  [002]  6668.948561:  kvm_entry:  vcpu  26
> >>>  qemu-system-x86-19131  [003]  6668.948561:  kvm_entry:  vcpu  17
> >>>  qemu-system-x86-19122  [007]  6668.948561:  kvm_entry:  vcpu  8
> >>>  qemu-system-x86-19120  [001]  6668.964832:  kvm_entry:  vcpu  6
> >>>  qemu-system-x86-19142  [002]  6668.964962:  kvm_entry:  vcpu  28
> >>>  qemu-system-x86-19125  [000]  6668.964964:  kvm_entry:  vcpu  11
> >>>  qemu-system-x86-19139  [007]  6668.964965:  kvm_entry:  vcpu  25
> >>>  qemu-system-x86-19123  [002]  6668.968954:  kvm_entry:  vcpu  9
> >>>  qemu-system-x86-19130  [000]  6668.968955:  kvm_entry:  vcpu  16
> >>>  qemu-system-x86-19144  [000]  6668.974965:  kvm_entry:  vcpu  30
> >>>  qemu-system-x86-19132  [003]  6668.974967:  kvm_entry:  vcpu  18
> >>>  qemu-system-x86-19126  [002]  6668.981415:  kvm_entry:  vcpu  12
> >>>  qemu-system-x86-19143  [003]  6668.981416:  kvm_entry:  vcpu  29
> >>>  qemu-system-x86-19134  [000]  6668.981418:  kvm_entry:  vcpu  20
> >>>  qemu-system-x86-19127  [003]  6669.002546:  kvm_entry:  vcpu  13
> >>>  qemu-system-x86-19117  [000]  6669.004972:  kvm_entry:  vcpu  3
> >>>  qemu-system-x86-19135  [003]  6669.004973:  kvm_entry:  vcpu  21
> >>>  qemu-system-x86-19145  [002]  6669.004974:  kvm_entry:  vcpu  31
> >>>  qemu-system-x86-19116  [005]  6669.005002:  kvm_entry:  vcpu  2
> >>>  qemu-system-x86-19128  [000]  6669.014976:  kvm_entry:  vcpu  14
> >>>  qemu-system-x86-19129  [002]  6669.015058:  kvm_entry:  vcpu  15
> >>>  qemu-system-x86-19115  [002]  6669.029626:  kvm_entry:  vcpu  1
> >>>  qemu-system-x86-19141  [005]  6669.034981:  kvm_entry:  vcpu  27
> >>>
> >>> The last VCPU (#27) reports in 6669.034981 - 6668.931183 = 103.798
> >>> ms after the second SIPI's delivery is completed.
> >>>
> >>> That is about twice as long as the virtual BSP is willing to wait
> >>> for the virtual AP detection to complete. (The default timeout is 50
> >>> ms.)
> >>
> >> This doesn't appear to be reproducing the same issue that Janusz is
> >> seeing.
> >
> > You might be correct, but I can only work with what I can reproduce.
> >
> >> In that case even a *10 second* timeout was not enough for 8
> >> processors to start up.
> >
> > Note that I didn't try to await until the AP enumeration *completed*
> > in the problematic cases. The only thing I "measured" in the above was
> > the (nearly) *first* sign of life for each VCPU, after the
> > INIT-SIPI-SIPI. Your timeout depends on the *last* relevant act of the
> > "slowest" CPU.
> >
> > In any case, even if this issue is independent, the success vs.
> > failure question is pretty clear on my workstation, wherever I am in
> > the bisection. Assuming I end up with a relevant-looking culprit
> > commit, we could do something about it, and ask Janusz to test it too.
> 
> - I bisected v4.1..v4.3.
> - I never assigned any device to the guest. In fact this was my
>   sophisticated test script:
> 
> > OVMF_BIN=/home/vm-images/firmware/OVMF_CODE.fd
> > VARSTORE=/root/vars.fd
> > LOG=/root/ovmf.log
> >
> > /usr/bin/qemu-system-x86_64 \
> >   -machine accel=kvm \
> >   -smp cpus=10 \
> >   -m 4096 \
> >   -device qxl-vga \
> >   -drive if=pflash,format=raw,unit=0,file="$OVMF_BIN",readonly=on \
> >   -drive if=pflash,format=raw,unit=1,file="$VARSTORE" \
> >   -debugcon file:"$LOG" \
> >   -global isa-debugcon.iobase=0x402 \
> >   -net none
> 
> As I said, my workstation has 8 logical PCPUs (1*4*2). Using the above
> minimal oversubscription (10 VCPUs), and requiring 10 successful OVMF
> boots to qualify a kernel build "good" during bisection, I got *very
> clear* decisions for each build, and the result is:
> 
> > commit b18d5431acc7a2fd22767925f3a6f597aa4bd29e
> > Author: Xiao Guangrong <guangrong.x...@linux.intel.com>
> > Date:   Mon Jun 15 16:55:21 2015 +0800
> >
> >     KVM: x86: fix CR0.CD virtualization
> >
> >     Currently, CR0.CD is not checked when we virtualize memory cache type 
> > for
> >     noncoherent_dma guests, this patch fixes it by :
> >
> >     - setting UC for all memory if CR0.CD = 1
> >     - zapping all the last sptes in MMU if CR0.CD is changed
> >
> >     Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
> >     Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> >
> > :040000 040000 a08dd8837198d53aab75569607c23e8dd05bb257 
> > 25c21bcc6ca57f76477e7d0c540213909d9d000b M      arch
> 
> Here's the bisection log:
> 
> > git bisect start
> > # bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
> > git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861
> > # good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
> > git bisect good b953c0d234bc72e8489d3bf51a276c5c4ec85345
> > # bad: [31191a85fb875cf123cea56bbfd34f4b941f3c79] perf report: Add support 
> > for srcfile sort key
> > git bisect bad 31191a85fb875cf123cea56bbfd34f4b941f3c79
> > # bad: [5b4ca4447757019f11a601b0009534ef84bed801] Merge tag 'media/v4.2-1' 
> > of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> > git bisect bad 5b4ca4447757019f11a601b0009534ef84bed801
> > # bad: [8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2] mm: 
> > kmemleak_alloc_percpu() should follow the gfp from per_alloc()
> > git bisect bad 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2
> > # good: [10b4b096d0c7e9f1b5f84c2a0658b2963e1e6ed0] Merge tag 'gpio-v4.2-1' 
> > of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
> > git bisect good 10b4b096d0c7e9f1b5f84c2a0658b2963e1e6ed0
> > # good: [36a1624d8844b6c165daf61649e6b68c02d0835f] Merge tag 'for-4.2' of 
> > git://git.infradead.org/battery-2.6
> > git bisect good 36a1624d8844b6c165daf61649e6b68c02d0835f
> > # good: [4b1f2af6752a4cc9acc1c22ddf3842478965f113] Merge branch 'for-linus' 
> > of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> > git bisect good 4b1f2af6752a4cc9acc1c22ddf3842478965f113
> > # good: [08d183e3c1f650b4db1d07d764502116861542fa] Merge tag 
> > 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
> > git bisect good 08d183e3c1f650b4db1d07d764502116861542fa
> > # bad: [4e241557fc1cb560bd9e77ca1b4a9352732a5427] Merge tag 'for-linus' of 
> > git://git.kernel.org/pub/scm/virt/kvm/kvm
> > git bisect bad 4e241557fc1cb560bd9e77ca1b4a9352732a5427
> > # good: [0ad83caa21bff9f383c10e73b452425709573042] x86: kvmclock: set 
> > scheduler clock stable
> > git bisect good 0ad83caa21bff9f383c10e73b452425709573042
> > # good: [05fe125fa3237de2ec5bada80031e694de78909c] Merge tag 
> > 'kvm-arm-for-4.2' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
> > git bisect good 05fe125fa3237de2ec5bada80031e694de78909c
> > # bad: [19efffa244071ccd0385b240d03adb38feaab04e] KVM: MTRR: sort variable 
> > MTRRs
> > git bisect bad 19efffa244071ccd0385b240d03adb38feaab04e
> > # bad: [eb839917a75207b89799e3500211163cb6de0dea] KVM: MTRR: handle 
> > MSR_MTRRcap in kvm_mtrr_get_msr
> > git bisect bad eb839917a75207b89799e3500211163cb6de0dea
> > # good: [0b8ba4a2b65860ac1fcd5be6cfca5d3b27a5b170] KVM: fix checkpatch.pl 
> > errors in kvm/coalesced_mmio.h
> > git bisect good 0b8ba4a2b65860ac1fcd5be6cfca5d3b27a5b170
> > # bad: [b18d5431acc7a2fd22767925f3a6f597aa4bd29e] KVM: x86: fix CR0.CD 
> > virtualization
> > git bisect bad b18d5431acc7a2fd22767925f3a6f597aa4bd29e
> > # good: [f104765b4f81fd74d69e0eb161e89096deade2db] KVM: nSVM: Check for 
> > NRIPS support before updating control field
> > git bisect good f104765b4f81fd74d69e0eb161e89096deade2db
> > # first bad commit: [b18d5431acc7a2fd22767925f3a6f597aa4bd29e] KVM: x86: 
> > fix CR0.CD virtualization
> 
> (Side note: I had to fix up one of the early builds with "git
> cherry-pick --no-commit 82819ffb42fb45197bacf3223191deca31d3eb91",
> because otherwise the build failed, but that's not relevant here.)
> 
> So, what does this mean? Because we've *always* known about this commit
> (after Xiao told us about it)! Importantly, what does it mean that the
> bug reproduces *without* an assigned device, and even in *v4.3*, where
> Xiao's fixup
> 
> > commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
> > Author: Xiao Guangrong <guangrong.x...@intel.com>
> > Date:   Thu Jul 16 03:25:56 2015 +0800
> >
> >     KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
> 
> is in place (starting with v4.2-rc4)?
> 
> It means that the original commit b18d5431acc7 modified *two* functions
> (namely vmx_get_mt_mask() and kvm_set_cr0()), and that commit fb279950ba
> fixed up *only one* of them (vmx_get_mt_mask()).
> 
> This code remains unchanged in v4.3, from b18d5431acc7:
> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 43f0df7..43fdb10 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -621,6 +621,10 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
> > cr0)
> >
> >       if ((cr0 ^ old_cr0) & update_bits)
> >               kvm_mmu_reset_context(vcpu);
> > +
> > +     if ((cr0 ^ old_cr0) & X86_CR0_CD)
> > +             kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> > +
> >       return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_cr0);
> 
> Xiao actually pointed this out (look at the first paragraph):
> 
> > Kernel commit b18d5431acc causes the vCPUs need more time to startup
> > as:
> > - it zaps all the mappings for the guest memory in EPT or shadow page
> >    table, it requires VM-exits to rebuild the mappings for all memory
> >    access.
> >
> > - if there is device passthrough-ed in guest and IOMMU lacks snooping
> >    control feature, the memory will become UC after CR0.CD is set to
> >    1.
> 
> We've been focusing on the second paragraph (vmx_get_mt_mask()), but
> apparently the kvm_set_cr0() change, which was not quirked, has a large
> impact on some hosts.
> 
> To me an interesting aspect of the above hunk is that *any one* VCPU
> flipping its CR0.CD will re-set nested paging *globally*, which will
> likely impact all other VCPUs as well, I guess.
> 
> Plus, with kvm_zap_gfn_range() taking the guest-global kvm->mmu_lock
> spin-lock, perhaps there is a bit of host-side lock contention too, when
> the APs start up all at once.
> 
> Anyway, with the following host kernel change, the AP startup problem
> goes away (tested on top of v4.3):
> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9a9a198..4f978ad 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -622,7 +622,8 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
> > cr0)
> >       if ((cr0 ^ old_cr0) & update_bits)
> >               kvm_mmu_reset_context(vcpu);
> >
> > -     if ((cr0 ^ old_cr0) & X86_CR0_CD)
> > +     if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED) &&
> > +         (cr0 ^ old_cr0) & X86_CR0_CD)
> >               kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> >
> >       return 0;
> 
> (Honestly I just imitated fb279950ba here; I'm not making any better
> argument for this diff. But, independently, I wonder why this hunk
> didn't have the noncoherent DMA check either, originally.)
> 
> Janusz, could you rebuild your host kernel with this patch, and share
> the results? (I'm also attaching the same as a formatted patch, so you
> can apply it with "git am" easily.)

One note related to decompression caching. I also created this patch:

https://github.com/jljusten/edk2/commit/7a1d3799

In my https://github.com/jljusten/edk2/tree/decompress-mtrr branch.

Infernix tried it and it didn't seem to totally fix the issue. (See
comments on https://github.com/tianocore/edk2/issues/21) I don't know
if Janusz got a chance to try it.

I didn't clean it up and send it out, since I didn't have any evidence
that it helped...

-Jordan
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to