[Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs

Heiko Sieger Fri, 10 Jul 2020 07:51:46 -0700

@Jan: this coreinfo output looks good.

I finally managed to get the core /cache alignment right, I believe:


  <vcpu placement="static" current="24">32</vcpu>
  <vcpus>
    <vcpu id="0" enabled="yes" hotpluggable="no"/>
    <vcpu id="1" enabled="yes" hotpluggable="yes"/>
    <vcpu id="2" enabled="yes" hotpluggable="yes"/>
    <vcpu id="3" enabled="yes" hotpluggable="yes"/>
    <vcpu id="4" enabled="yes" hotpluggable="yes"/>
    <vcpu id="5" enabled="yes" hotpluggable="yes"/>
    <vcpu id="6" enabled="no" hotpluggable="yes"/>
    <vcpu id="7" enabled="no" hotpluggable="yes"/>
    <vcpu id="8" enabled="yes" hotpluggable="yes"/>
    <vcpu id="9" enabled="yes" hotpluggable="yes"/>
    <vcpu id="10" enabled="yes" hotpluggable="yes"/>
    <vcpu id="11" enabled="yes" hotpluggable="yes"/>
    <vcpu id="12" enabled="yes" hotpluggable="yes"/>
    <vcpu id="13" enabled="yes" hotpluggable="yes"/>
    <vcpu id="14" enabled="no" hotpluggable="yes"/>
    <vcpu id="15" enabled="no" hotpluggable="yes"/>
    <vcpu id="16" enabled="yes" hotpluggable="yes"/>
    <vcpu id="17" enabled="yes" hotpluggable="yes"/>
    <vcpu id="18" enabled="yes" hotpluggable="yes"/>
    <vcpu id="19" enabled="yes" hotpluggable="yes"/>
    <vcpu id="20" enabled="yes" hotpluggable="yes"/>
    <vcpu id="21" enabled="yes" hotpluggable="yes"/>
    <vcpu id="22" enabled="no" hotpluggable="yes"/>
    <vcpu id="23" enabled="no" hotpluggable="yes"/>
    <vcpu id="24" enabled="yes" hotpluggable="yes"/>
    <vcpu id="25" enabled="yes" hotpluggable="yes"/>
    <vcpu id="26" enabled="yes" hotpluggable="yes"/>
    <vcpu id="27" enabled="yes" hotpluggable="yes"/>
    <vcpu id="28" enabled="yes" hotpluggable="yes"/>
    <vcpu id="29" enabled="yes" hotpluggable="yes"/>
    <vcpu id="30" enabled="no" hotpluggable="yes"/>
    <vcpu id="31" enabled="no" hotpluggable="yes"/>
  </vcpus>
  <cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="12"/>
    <vcpupin vcpu="2" cpuset="1"/>
    <vcpupin vcpu="3" cpuset="13"/>
    <vcpupin vcpu="4" cpuset="2"/>
    <vcpupin vcpu="5" cpuset="14"/>
    <vcpupin vcpu="8" cpuset="3"/>
    <vcpupin vcpu="9" cpuset="15"/>
    <vcpupin vcpu="10" cpuset="4"/>
    <vcpupin vcpu="11" cpuset="16"/>
    <vcpupin vcpu="12" cpuset="5"/>
    <vcpupin vcpu="13" cpuset="17"/>
    <vcpupin vcpu="16" cpuset="6"/>
    <vcpupin vcpu="17" cpuset="18"/>
    <vcpupin vcpu="18" cpuset="7"/>
    <vcpupin vcpu="19" cpuset="19"/>
    <vcpupin vcpu="20" cpuset="8"/>
    <vcpupin vcpu="21" cpuset="20"/>
    <vcpupin vcpu="24" cpuset="9"/>
    <vcpupin vcpu="25" cpuset="21"/>
    <vcpupin vcpu="26" cpuset="10"/>
    <vcpupin vcpu="27" cpuset="22"/>
    <vcpupin vcpu="28" cpuset="11"/>
    <vcpupin vcpu="29" cpuset="23"/>
  </cputune>

...
  <cpu mode="host-passthrough" check="none">
    <topology sockets="1" dies="1" cores="16" threads="2"/>
    <cache mode="passthrough"/>


The Windows Coreinfo output is this:

Logical to Physical Processor Map:
**----------------  Physical Processor 0 (Hyperthreaded)
--**--------------  Physical Processor 1 (Hyperthreaded)
----**------------  Physical Processor 2 (Hyperthreaded)
------**----------  Physical Processor 3 (Hyperthreaded)
--------**--------  Physical Processor 4 (Hyperthreaded)
----------**------  Physical Processor 5 (Hyperthreaded)
------------**----  Physical Processor 6 (Hyperthreaded)
--------------**--  Physical Processor 7 (Hyperthreaded)
----------------**  Physical Processor 8 (Hyperthreaded)

Logical Processor to Socket Map:
******************  Socket 0

Logical Processor to NUMA Node Map:
******************  NUMA Node 0

No NUMA nodes.

Logical Processor to Cache Map:
**----------------  Data Cache          0, Level 1,   32 KB, Assoc   8, 
LineSize  64
**----------------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, 
LineSize  64
**----------------  Unified Cache       0, Level 2,  512 KB, Assoc   8, 
LineSize  64
******------------  Unified Cache       1, Level 3,   16 MB, Assoc  16, 
LineSize  64
--**--------------  Data Cache          1, Level 1,   32 KB, Assoc   8, 
LineSize  64
--**--------------  Instruction Cache   1, Level 1,   32 KB, Assoc   8, 
LineSize  64
--**--------------  Unified Cache       2, Level 2,  512 KB, Assoc   8, 
LineSize  64
----**------------  Data Cache          2, Level 1,   32 KB, Assoc   8, 
LineSize  64
----**------------  Instruction Cache   2, Level 1,   32 KB, Assoc   8, 
LineSize  64
----**------------  Unified Cache       3, Level 2,  512 KB, Assoc   8, 
LineSize  64
------**----------  Data Cache          3, Level 1,   32 KB, Assoc   8, 
LineSize  64
------**----------  Instruction Cache   3, Level 1,   32 KB, Assoc   8, 
LineSize  64
------**----------  Unified Cache       4, Level 2,  512 KB, Assoc   8, 
LineSize  64
------******------  Unified Cache       5, Level 3,   16 MB, Assoc  16, 
LineSize  64
--------**--------  Data Cache          4, Level 1,   32 KB, Assoc   8, 
LineSize  64
--------**--------  Instruction Cache   4, Level 1,   32 KB, Assoc   8, 
LineSize  64
--------**--------  Unified Cache       6, Level 2,  512 KB, Assoc   8, 
LineSize  64
----------**------  Data Cache          5, Level 1,   32 KB, Assoc   8, 
LineSize  64
----------**------  Instruction Cache   5, Level 1,   32 KB, Assoc   8, 
LineSize  64
----------**------  Unified Cache       7, Level 2,  512 KB, Assoc   8, 
LineSize  64
------------**----  Data Cache          6, Level 1,   32 KB, Assoc   8, 
LineSize  64
------------**----  Instruction Cache   6, Level 1,   32 KB, Assoc   8, 
LineSize  64
------------**----  Unified Cache       8, Level 2,  512 KB, Assoc   8, 
LineSize  64
------------******  Unified Cache       9, Level 3,   16 MB, Assoc  16, 
LineSize  64
--------------**--  Data Cache          7, Level 1,   32 KB, Assoc   8, 
LineSize  64
--------------**--  Instruction Cache   7, Level 1,   32 KB, Assoc   8, 
LineSize  64
--------------**--  Unified Cache      10, Level 2,  512 KB, Assoc   8, 
LineSize  64
----------------**  Data Cache          8, Level 1,   32 KB, Assoc   8, 
LineSize  64
----------------**  Instruction Cache   8, Level 1,   32 KB, Assoc   8, 
LineSize  64
----------------**  Unified Cache      11, Level 2,  512 KB, Assoc   8, 
LineSize  64

Logical Processor to Group Map:
******************  Group 0


Haven't been able to test if it performs as expected. Need to do that.

Of course it would be great if QEMU was patched to recognize correct CCX
alignment as I'm not sure if and what will be the penalty of this weird
setup.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1856335

Title:
  Cache Layout wrong on many Zen Arch CPUs

Status in QEMU:
  New

Bug description:
  AMD CPUs have L3 cache per 2, 3 or 4 cores. Currently, TOPOEXT seems
  to always map Cache ass if it was an 4-Core per CCX CPU, which is
  incorrect, and costs upwards 30% performance (more realistically 10%)
  in L3 Cache Layout aware applications.

  Example on a 4-CCX CPU (1950X /w 8 Cores and no SMT):

    <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-IBPB</model>
      <vendor>AMD</vendor>
      <topology sockets='1' cores='8' threads='1'/>

  In windows, coreinfo reports correctly:

  ****----  Unified Cache 1, Level 3,    8 MB, Assoc  16, LineSize  64
  ----****  Unified Cache 6, Level 3,    8 MB, Assoc  16, LineSize  64

  On a 3-CCX CPU (3960X /w 6 cores and no SMT):

   <cpu mode='custom' match='exact' check='full'>
      <model fallback='forbid'>EPYC-IBPB</model>
      <vendor>AMD</vendor>
      <topology sockets='1' cores='6' threads='1'/>

  in windows, coreinfo reports incorrectly:

  ****--  Unified Cache  1, Level 3,    8 MB, Assoc  16, LineSize  64
  ----**  Unified Cache  6, Level 3,    8 MB, Assoc  16, LineSize  64

  Validated against 3.0, 3.1, 4.1 and 4.2 versions of qemu-kvm.

  With newer Qemu there is a fix (that does behave correctly) in using the dies 
parameter:
   <qemu:arg value='cores=3,threads=1,dies=2,sockets=1'/>

  The problem is that the dies are exposed differently than how AMD does
  it natively, they are exposed to Windows as sockets, which means, that
  if you are nto a business user, you can't ever have a machine with
  more than two CCX (6 cores) as consumer versions of Windows only
  supports two sockets. (Should this be reported as a separate bug?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1856335/+subscriptions

[Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs

Reply via email to