On 11/02/15 19:53, Alex Williamson wrote:
> On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote:
>> On 10/31/15 18:50, Laszlo Ersek wrote:
>>
>>> I'm very sorry, but I don't think I can spend time on this, unless
>>> someone gives me ssh and/or console access to a host that readily
>>> reproduces the bug, with the latest kvm/master, qemu, and ekd2
>>> builds.
>>
>> We just got lucky, the problem reproduces on my oldie workstation (HP Z400, 
>> (R) Xeon(R) CPU W3550 @ 3.07GHz, family/model/stepping/microcode = 
>> 6/26/5/0x19).
>>
>> OVMF works fine with the most recent Fedora 21 host kernel, 
>> "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new upstream 
>> Linux 4.3 release.
>>
>> The OVMF log contains a number of failed ASSERT()s, printed by the APs, in a 
>> way that is very characteristic of multi-threading bugs:
>>
>> --*--
>>
>> Detect CPU count: 2
>> AASSSSEERRTT  
>> //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
>>  8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c kLVoaclkuVea 
>> l=u=e  (=(=U I(N(TUNI)N T1N))^M 
>> 1)
>> ASSERT 
>> /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
>>  LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)
>>
>> --*--
>>
>> (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count: 2".)
>>
>>
>> Maybe important, maybe not: my workstation's IOMMU does *not* have snoop 
>> control:
>>
>> [    3.043297] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 
>> c90780106f0462 ecap f02076
>>
>> Where 0xf02076 & 0x80 == 0.
>>
>> Whereas my laptop *does* have Snoop Control:
>>
>> [    0.030486] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 
>> c0000020660462 ecap f0101a
>> [    0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap 
>> d2008020660462 ecap f010da
>>
>> The first IOMMU doesn't have Snoop Control, but its scope only extends to 
>> the integrated Intel Graphics Device. The second IOMMU, which covers the SD 
>> card reader I used for testing (and all other PCI devices), *does* have 
>> Snoop Control. 0xf010da & 0x80 = 0x80.
>>
>> However, SC support should not matter, due to kernel commit 
>> fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be looking an 
>> independent issue. (Possibly in CpuDxe, only exposed by the new host kernel.)
> 
> Snoop Control is what I immediately jumped to as well, but I've never
> seen the problem on my desktop, which reports the following ecaps:
> f0101a & f0105a.  Thanks,

This issue is actually very easy to reproduce on my workstation. It has 1 
socket, 4 cores, 2 threads/core. If I start a guest with 32 VCPUs (i.e., 
heavily over-subscribing the host), then the issue hits immediately. Device 
assignment is not necessary.

But, it also reproduces fairly regularly with just 8 VCPUs (even with VCPUs 
pinned to separate logical processors on the host).

Judging purely from the KVM trace captured during such a 32-VCPU startup, I'm 
suspecting a general thread scheduling (or signaling) change sometime in the 
v4.1..v4.2 interval. It takes *extremely* long after the second SIPI (which 
should wake the AP threads) until the first "kvm_entry" message shows up for 
the last nonzero VCPU.

Below is an excerpt. The first line shows the 2nd SIPI being delivered to the 
last VCPU. The lines after that show the first time that each nonzero VCPU 
reports in with "kvm_entry".

 qemu-system-x86-19114 [005]  6668.931183: kvm_apic_accept_irq:  apicid 1f vec 
159 (SIPI|edge)
...
 qemu-system-x86-19136  [006]  6668.931953:  kvm_entry:  vcpu  22
 qemu-system-x86-19133  [003]  6668.931953:  kvm_entry:  vcpu  19
 qemu-system-x86-19138  [007]  6668.931953:  kvm_entry:  vcpu  24
 qemu-system-x86-19118  [004]  6668.934967:  kvm_entry:  vcpu  4
 qemu-system-x86-19119  [006]  6668.941502:  kvm_entry:  vcpu  5
 qemu-system-x86-19121  [004]  6668.944957:  kvm_entry:  vcpu  7
 qemu-system-x86-19124  [001]  6668.944957:  kvm_entry:  vcpu  10
 qemu-system-x86-19137  [000]  6668.948561:  kvm_entry:  vcpu  23
 qemu-system-x86-19140  [002]  6668.948561:  kvm_entry:  vcpu  26
 qemu-system-x86-19131  [003]  6668.948561:  kvm_entry:  vcpu  17
 qemu-system-x86-19122  [007]  6668.948561:  kvm_entry:  vcpu  8
 qemu-system-x86-19120  [001]  6668.964832:  kvm_entry:  vcpu  6
 qemu-system-x86-19142  [002]  6668.964962:  kvm_entry:  vcpu  28
 qemu-system-x86-19125  [000]  6668.964964:  kvm_entry:  vcpu  11
 qemu-system-x86-19139  [007]  6668.964965:  kvm_entry:  vcpu  25
 qemu-system-x86-19123  [002]  6668.968954:  kvm_entry:  vcpu  9
 qemu-system-x86-19130  [000]  6668.968955:  kvm_entry:  vcpu  16
 qemu-system-x86-19144  [000]  6668.974965:  kvm_entry:  vcpu  30
 qemu-system-x86-19132  [003]  6668.974967:  kvm_entry:  vcpu  18
 qemu-system-x86-19126  [002]  6668.981415:  kvm_entry:  vcpu  12
 qemu-system-x86-19143  [003]  6668.981416:  kvm_entry:  vcpu  29
 qemu-system-x86-19134  [000]  6668.981418:  kvm_entry:  vcpu  20
 qemu-system-x86-19127  [003]  6669.002546:  kvm_entry:  vcpu  13
 qemu-system-x86-19117  [000]  6669.004972:  kvm_entry:  vcpu  3
 qemu-system-x86-19135  [003]  6669.004973:  kvm_entry:  vcpu  21
 qemu-system-x86-19145  [002]  6669.004974:  kvm_entry:  vcpu  31
 qemu-system-x86-19116  [005]  6669.005002:  kvm_entry:  vcpu  2
 qemu-system-x86-19128  [000]  6669.014976:  kvm_entry:  vcpu  14
 qemu-system-x86-19129  [002]  6669.015058:  kvm_entry:  vcpu  15
 qemu-system-x86-19115  [002]  6669.029626:  kvm_entry:  vcpu  1
 qemu-system-x86-19141  [005]  6669.034981:  kvm_entry:  vcpu  27

The last VCPU (#27) reports in 6669.034981 - 6668.931183 = 103.798 ms after the 
second SIPI's delivery is completed.

That is about twice as long as the virtual BSP is willing to wait for the 
virtual AP detection to complete. (The default timeout is 50 ms.)

For comparison, within that 50 ms, the virtual BSP manages to read the emulated 
ACPI PM timer a whopping 6443 times! That is the same number of userspace 
exits, QEMU doing the emulation, etc.

I haven't compared this to v4.1 numbers (yet?), but the inter-thread latency 
seems quite bad. (I'm sure it boosts throughput though.)

I'll try to bisect this: the v4.1..b18d5431acc range. Hopefully this problem 
was introduced *before* b18d5431acc would mask it with the very long LZMA 
decompression time.

Thanks
Laszlo

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to