On Fri, May 29, 2020 at 4:47 AM Van Haaren, Harry <harry.van.haa...@intel.com> wrote: > > > -----Original Message----- > > From: William Tu <u9012...@gmail.com> > > Sent: Friday, May 29, 2020 2:19 AM > > To: Van Haaren, Harry <harry.van.haa...@intel.com> > > Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org > > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather > > implementation > > > > On Wed, May 27, 2020 at 12:21:43PM +0000, Van Haaren, Harry wrote: > <snip hashing details> > > > As a result, hashing identical data in different .c files produces a > > > different hash > > values. > > > > > > From OVS docs > > > (http://docs.openvswitch.org/en/latest/intro/install/general/) > > the following > > > enables native ISA for your build, or else just enable SSE4.2 and > > > popcount: > > > ./configure CFLAGS="-g -O2 -march=native" > > > ./configure CFLAGS="-g -O2 -march=nehalem" > > > > Hi Harry, > > Thanks for the info! > > I can make it work now, with > > ./configure CFLAGS="-g -O2 -msse4.2 -march=native" > > OK - that's good - the root cause of the bug/hash-mismatch is confirmed! > > > > using similar setup > > ovs-ofctl add-flow br0 'actions=drop' > > ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5 > > ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk \ > > options:dpdk- > > devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1 > > > > The performance seems a little worse (9.7Mpps -> 8.7Mpps). > > I wonder whether it's due to running it in VM (however I don't > > have physical machine). > > Performance degradations are not expected, let me try understand > the below performance data posted, and work through it. > > Agree that isolating the hardware and being able to verify > environment would help in removing potential noise.. but > let us work with the setup you have. Do you know what CPU > it is you're running on?
Thanks! I think it's skylake root@instance-3:~/ovs# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) CPU @ 2.00GHz Stepping: 3 CPU MHz: 2000.176 BogoMIPS: 4000.35 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 39424K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 03) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) > > It seems you have EMC enabled (as per OVS defaults). The stats posted show > an approx 10:1 ratio on hits in EMC and DPCLS. This likely adds noise to the > measurements - as only 10% of the packets hit the changes in DPCLS. > > Also in the perf top profile dp_netdev_input__ takes more cycles than > miniflow_extract, and the memcmp() is present, indicating EMC is consuming > CPU cycles to perform its duties. > > I guess our simple test case is failing to show what we're trying to measure, > as you know a EMC likes low flow counts, all explaining why DPCLS is > only ~2% of CPU time. > > <snip> > Removed details of CPU profiles & PMD stats for AVX512 and Generic DPCLS > removed to trim conversation. Very helpful to see into your system, and I'm > a big fan of perf top and friends - so this was useful to see, thanks! > (Future readers, check the mailing list "thread" view for previous post's > details). > > > > Is there any thing I should double check? > > Would you mind re-testing with EMC disabled? Likely DPCLS will show up as a > much larger % in the CPU profile, and this might provide some new insights. > OK, with EMC disabled, the performance gap is a little better. Now we don't see memcmp. === generic === drop rate: 8.65Mpps pmd thread numa_id 0 core_id 1: packets received: 223168512 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 0 smc hits: 0 megaflow hits: 223167820 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 659 avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 51969566520 (100.00%) avg cycles per packet: 232.87 (51969566520/223168512) avg processing cycles per packet: 232.87 (51969566520/223168512) 19.17% pmd-c01/id:9 ovs-vswitchd [.] dpcls_subtable_lookup_mf_u0w4_u1w1 18.93% pmd-c01/id:9 ovs-vswitchd [.] miniflow_extract 16.15% pmd-c01/id:9 ovs-vswitchd [.] eth_pcap_rx_infinite 11.34% pmd-c01/id:9 ovs-vswitchd [.] dp_netdev_input__ 10.51% pmd-c01/id:9 ovs-vswitchd [.] miniflow_hash_5tuple 6.88% pmd-c01/id:9 ovs-vswitchd [.] free_dpdk_buf 5.63% pmd-c01/id:9 ovs-vswitchd [.] fast_path_processing 4.95% pmd-c01/id:9 ovs-vswitchd [.] cmap_find_batch === AVX512 === drop rate: 8.28Mpps pmd thread numa_id 0 core_id 1: packets received: 138495296 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 0 smc hits: 0 megaflow hits: 138494847 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 416 avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 33452482260 (100.00%) avg cycles per packet: 241.54 (33452482260/138495296) avg processing cycles per packet: 241.54 (33452482260/138495296) 19.78% pmd-c01/id:9 ovs-vswitchd [.] miniflow_extract 17.73% pmd-c01/id:9 ovs-vswitchd [.] eth_pcap_rx_infinite 13.53% pmd-c01/id:9 ovs-vswitchd [.] dpcls_avx512_gather_skx_mf_4_1 12.00% pmd-c01/id:9 ovs-vswitchd [.] dp_netdev_input__ 10.94% pmd-c01/id:9 ovs-vswitchd [.] miniflow_hash_5tuple 7.80% pmd-c01/id:9 ovs-vswitchd [.] free_dpdk_buf 5.97% pmd-c01/id:9 ovs-vswitchd [.] fast_path_processing 5.23% pmd-c01/id:9 ovs-vswitchd [.] cmap_find_batch I'm not able to get current cpu frequency, probably due to running in VM? root@instance-3:~/ovs# modprobe acpi-cpufreq root@instance-3:~/ovs# cpufreq-info cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009 Report errors and bugs to cpuf...@vger.kernel.org, please. analyzing CPU 0: no or unknown cpufreq driver is active on this CPU maximum transition latency: 4294.55 ms. Regards, William _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev