Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

William Tu Fri, 29 May 2020 11:50:50 -0700

On Fri, May 29, 2020 at 4:47 AM Van Haaren, Harry
<harry.van.haa...@intel.com> wrote:
>
> > -----Original Message-----
> > From: William Tu <u9012...@gmail.com>
> > Sent: Friday, May 29, 2020 2:19 AM
> > To: Van Haaren, Harry <harry.van.haa...@intel.com>
> > Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org
> > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
> > implementation
> >
> > On Wed, May 27, 2020 at 12:21:43PM +0000, Van Haaren, Harry wrote:
> <snip hashing details>
> > > As a result, hashing identical data in different .c files produces a 
> > > different hash
> > values.
> > >
> > > From OVS docs 
> > > (http://docs.openvswitch.org/en/latest/intro/install/general/)
> > the following
> > > enables native ISA for your build, or else just enable SSE4.2 and 
> > > popcount:
> > > ./configure CFLAGS="-g -O2 -march=native"
> > > ./configure CFLAGS="-g -O2 -march=nehalem"
> >
> > Hi Harry,
> > Thanks for the info!
> > I can make it work now, with
> > ./configure CFLAGS="-g -O2 -msse4.2 -march=native"
>
> OK - that's good - the root cause of the bug/hash-mismatch is confirmed!
>
>
> > using similar setup
> > ovs-ofctl add-flow br0 'actions=drop'
> > ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5
> > ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk \
> >   options:dpdk-
> > devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1
> >
> > The performance seems a little worse (9.7Mpps -> 8.7Mpps).
> > I wonder whether it's due to running it in VM (however I don't
> > have physical machine).
>
> Performance degradations are not expected, let me try understand
> the below performance data posted, and work through it.
>
> Agree that isolating the hardware and being able to verify
> environment would help in removing potential noise.. but
> let us work with the setup you have. Do you know what CPU
> it is you're running on?


Thanks! I think it's skylake
root@instance-3:~/ovs# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping:            3
CPU MHz:             2000.176
BogoMIPS:            4000.35
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            39424K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc
cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2
x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl
xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities

lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 03)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)

>
> It seems you have EMC enabled (as per OVS defaults). The stats posted show
> an approx 10:1 ratio on hits in EMC and DPCLS. This likely adds noise to the
> measurements - as only 10% of the packets hit the changes in DPCLS.
>
> Also in the perf top profile dp_netdev_input__ takes more cycles than
> miniflow_extract, and the memcmp() is present, indicating EMC is consuming
> CPU cycles to perform its duties.
>
> I guess our simple test case is failing to show what we're trying to measure,
> as you know a EMC likes low flow counts, all explaining why DPCLS is
> only ~2% of CPU time.
>
> <snip>
> Removed details of CPU profiles & PMD stats for AVX512 and Generic DPCLS
> removed to trim conversation. Very helpful to see into your system, and I'm
> a big fan of perf top and friends - so this was useful to see, thanks!
> (Future readers, check the mailing list "thread" view for previous post's 
> details).
>
>
> > Is there any thing I should double check?
>
> Would you mind re-testing with EMC disabled? Likely DPCLS will show up as a
> much larger % in the CPU profile, and this might provide some new insights.
>
OK, with EMC disabled, the performance gap is a little better.
Now we don't see memcmp.

=== generic ===
drop rate: 8.65Mpps
pmd thread numa_id 0 core_id 1:
  packets received: 223168512
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 0
  smc hits: 0
  megaflow hits: 223167820
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 659
  avg. packets per output batch: 0.00
  idle cycles: 0 (0.00%)
  processing cycles: 51969566520 (100.00%)
  avg cycles per packet: 232.87 (51969566520/223168512)
  avg processing cycles per packet: 232.87 (51969566520/223168512)

  19.17%  pmd-c01/id:9  ovs-vswitchd        [.]
dpcls_subtable_lookup_mf_u0w4_u1w1
  18.93%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_extract
  16.15%  pmd-c01/id:9  ovs-vswitchd        [.] eth_pcap_rx_infinite
  11.34%  pmd-c01/id:9  ovs-vswitchd        [.] dp_netdev_input__
  10.51%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_hash_5tuple
   6.88%  pmd-c01/id:9  ovs-vswitchd        [.] free_dpdk_buf
   5.63%  pmd-c01/id:9  ovs-vswitchd        [.] fast_path_processing
   4.95%  pmd-c01/id:9  ovs-vswitchd        [.] cmap_find_batch

=== AVX512 ===
drop rate: 8.28Mpps
pmd thread numa_id 0 core_id 1:
  packets received: 138495296
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 0
  smc hits: 0
  megaflow hits: 138494847
  avg. subtable lookups per megaflow hit: 1.00
  miss with success upcall: 1
  miss with failed upcall: 416
  avg. packets per output batch: 0.00
  idle cycles: 0 (0.00%)
  processing cycles: 33452482260 (100.00%)
  avg cycles per packet: 241.54 (33452482260/138495296)
  avg processing cycles per packet: 241.54 (33452482260/138495296)

  19.78%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_extract
  17.73%  pmd-c01/id:9  ovs-vswitchd        [.] eth_pcap_rx_infinite
  13.53%  pmd-c01/id:9  ovs-vswitchd        [.] dpcls_avx512_gather_skx_mf_4_1
  12.00%  pmd-c01/id:9  ovs-vswitchd        [.] dp_netdev_input__
  10.94%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_hash_5tuple
   7.80%  pmd-c01/id:9  ovs-vswitchd        [.] free_dpdk_buf
   5.97%  pmd-c01/id:9  ovs-vswitchd        [.] fast_path_processing
   5.23%  pmd-c01/id:9  ovs-vswitchd        [.] cmap_find_batch

I'm not able to get current cpu frequency, probably due to running in VM?
root@instance-3:~/ovs# modprobe acpi-cpufreq
root@instance-3:~/ovs# cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpuf...@vger.kernel.org, please.
analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
  maximum transition latency: 4294.55 ms.

Regards,
William
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

Reply via email to