Hi Harry, Thanks for your feedback!
> > > > On Wed, May 06, 2020 at 02:06:09PM +0100, Harry van Haaren wrote: > > > > > This commit adds an AVX-512 dpcls lookup implementation. > > > > > It uses the AVX-512 SIMD ISA to perform multiple miniflow > > > > > operations in parallel. > > <snip lots of code/patch contents for readability> > > > Hi Harry, > > > > I managed to find a machine with avx512 in google cloud and did some > > performance testing. I saw lower performance when enabling avx512, > > I believe I did something wrong. Do you mind having a look: > > <snip> > > > > 3) start ovs and set avx and traffic gen > > ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5 > > ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk > > options:dpdk-devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1 > > The output of the first command (enabling the AVX512 lookup) posts some > output to Log INFO, please ensure its there? > > 2020-05-20T09:39:09Z|00262|dpif_netdev_lookup|INFO|Subtable function > 'avx512_gather' set priority to 4 > 2020-05-20T09:39:09Z|00006|dpif_netdev(pmd-c15/id:99)|INFO|reprobing sub > func, 5 1 > Yes, got these info log. ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3 ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true ovs-ofctl add-flow br0 'actions=drop' ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5 ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk \ options:dpdk-devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1 LOG: 2020-05-20T13:49:26.542Z|00047|dpdk|INFO|DPDK Enabled - initialized 2020-05-20T13:49:26.544Z|00048|connmgr|INFO|br0<->unix#2: 1 flow_mods in the last 0 s (1 adds) 2020-05-20T13:49:26.547Z|00049|dpif_netdev_lookup|INFO|Subtable function 'avx512_gather' set priority to 5 2020-05-20T13:49:26.553Z|00050|netdev_dpdk|INFO|Device 'vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1' attached to DPDK 2020-05-20T13:49:26.553Z|00051|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 0 created. 2020-05-20T13:49:26.554Z|00052|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 1 created. 2020-05-20T13:49:26.554Z|00053|dpif_netdev|INFO|There are 2 pmd threads on numa node 0 2020-05-20T13:49:26.554Z|00054|dpdk|INFO|Device with port_id=0 already stopped 2020-05-20T13:49:26.648Z|00055|netdev_dpdk|WARN|Rx checksum offload is not supported on port 0 2020-05-20T13:49:26.648Z|00056|netdev_dpdk|WARN|Interface tg0 does not support MTU configuration, max packet size supported is 1500. 2020-05-20T13:49:26.648Z|00057|netdev_dpdk|INFO|Port 0: 02:70:63:61:70:00 2020-05-20T13:49:26.648Z|00058|dpif_netdev|INFO|Core 0 on numa node 0 assigned port 'tg0' rx queue 0 (measured processing cycles 0). 2020-05-20T13:49:26.648Z|00059|bridge|INFO|bridge br0: added interface tg0 on port 1 2020-05-20T13:49:26.648Z|00001|ofproto_dpif_upcall(pmd-c00/id:9)|WARN|upcall_cb failure: ukey installation fails 2020-05-20T13:49:27.562Z|00002|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 > > > 4) dp flows with miniflow info <snip> > It seems the "packets:0, bytes:0,used:never" tags indicate that there is no > traffic hitting these rules at all? > Output here (with traffic running for a while) shows: > packets:621588543, bytes:37295312580, used:0.000s, dp:ovs, actions:dpdk1, > dp-extra-info:miniflow_bits(4,1) > Thanks, this is the hit rules: root@instance-3:~/ovs# ovs-appctl dpctl/dump-flows -m | grep -v never flow-dump from pmd on cpu core: 0 ufid:f06998a0-9ff8-4ee5-b12f-5d7e2fcc7f0f, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:01/00:00:00:00:00:00,dst=42:01:0a:b6:00:02/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=169.254.169.254/0.0.0.0,dst=10.182.0.2/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=80/0,dst=51642/0),tcp_flags(0/0), packets:3942096, bytes:2511115152, used:0.001s, flags:P., dp:ovs, actions:drop, dp-extra-info:miniflow_bits(4,1) ufid:cb3a6eac-3a7d-4e0d-a145-414dd482b4b9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:01/00:00:00:00:00:00,dst=42:01:0a:b6:00:02/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=169.254.169.254/0.0.0.0,dst=10.182.0.2/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=80/0,dst=51650/0),tcp_flags(0/0), packets:2779552, bytes:172332224, used:0.000s, flags:S., dp:ovs, actions:drop, dp-extra-info:miniflow_bits(4,1) ufid:781f3f48-ffd7-424f-ae99-62158ba05cbd, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:02/00:00:00:00:00:00,dst=42:01:0a:b6:00:01/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=10.182.0.2/0.0.0.0,dst=169.254.169.254/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=51650/0,dst=80/0),tcp_flags(0/0), packets:637373, bytes:216706820, used:0.000s, flags:P., dp:ovs, actions:drop, dp-extra-info:miniflow_bits(4,1) > > > 5) pmd-stat-show > > root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show > > pmd thread numa_id 0 core_id 0: > > packets received: 19838528 > > packet recirculations: 0 > > avg. datapath passes per packet: 1.00 > > emc hits: 0 > > smc hits: 0 > > megaflow hits: 0 > > avg. subtable lookups per megaflow hit: 0.00 (---> this doesn't > > look right ....) > > miss with success upcall: 78 > > miss with failed upcall: 19838418 > > avg. packets per output batch: 2.00 > > idle cycles: 0 (0.00%) > > processing cycles: 103431787838 (100.00%) > > avg cycles per packet: 5213.68 (103431787838/19838528) > > avg processing cycles per packet: 5213.68 (103431787838/19838528) > > Would you try the pmd-stats-show command before setting the AVX512 lookup? > If the issue is still present it would indicate its not related to the exact > lookup > implementation. Before setting AVX512 ### Scalar Lookup pmd thread numa_id 0 core_id 0: packets received: 77470176 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 70423947 smc hits: 0 megaflow hits: 7045897 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 331 avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 19596627706 (100.00%) avg cycles per packet: 252.96 (19596627706/77470176) avg processing cycles per packet: 252.96 (19596627706/77470176) ### AVX512 Lookup (restart ovs-vswitchd with additional command "dpif-netdev/subtable-lookup-set avx512_gather 5" pmd thread numa_id 0 core_id 0: packets received: 1178784 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 0 smc hits: 0 megaflow hits: 0 avg. subtable lookups per megaflow hit: 0.00 miss with success upcall: 13 miss with failed upcall: 1178739 ---> this looks not right avg. packets per output batch: 0.00 idle cycles: 0 (0.00%) processing cycles: 5408870500 (100.00%) avg cycles per packet: 4588.52 (5408870500/1178784) avg processing cycles per packet: 4588.52 (5408870500/1178784) > > > > 6) gdb also looks not right..., I didn't see any avx512 instructions > > (gdb) b avx512_lookup_impl > > Breakpoint 2 at 0x55e92342a8df: avx512_lookup_impl. (4 locations) > > Dump of assembler code for function dpcls_avx512_gather_skx_mf_5_1: > > 96 const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0]; > > 0x000055e92342a8df <+31>: mov 0x30(%rdi),%r8 > > 97 const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1]; > > 0x000055e92342a8e3 <+35>: mov 0x38(%rdi),%r9 > > 98 ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0); > <snip some ASM> > > (gdb) disas dpcls_avx512_gather_skx_mf_5_1 > <snip preamble> > 0x0000555556103f34 <+724>: vmovdqu64 0x28(%rdi),%zmm2{%k1}{z} > 0x0000555556103f3e <+734>: vmovdqu64 0x18(%rcx),%zmm0{%k1}{z} > 0x0000555556103f48 <+744>: vpandd %zmm0,%zmm1,%zmm0 > 0x0000555556103f4e <+750>: vpcmpeqq %zmm2,%zmm0,%k7{%k1} > > Disassembly here shows AVX512 register usage here, as expected. OK, tried (gdb) disas dpcls_avx512_gather_skx_mf_5_1 and works for me. I can see avx512 instructions. setting breakpoint at run time also work (gdb) b dpcls_avx512_gather_skx_mf_4_1 Thread 13 "pmd-c00/id:9" hit Breakpoint 1, dpcls_avx512_gather_skx_mf_4_1 (subtable=0x7f732c008210, keys_map=1, keys=0x7f733af2a798, rules=0x7f733af2a7a0) at lib/dpif-netdev-lookup-avx512-gather.c:212 212 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1) Using perf record also show 3.11% pmd-c00/id:9 ovs-vswitchd [.] dpcls_avx512_gather_skx_mf_4_1 dpcls_avx512_gather_skx_mf_4_1 /usr/local/sbin/ovs-vswitchd [Percent: local period] 0.48 │ lea -0x1(%rdi),%rax │ _mm512_srli_epi64(): │ return (__m512i) __builtin_ia32_psrlqi512_mask ((__v8di) __A, __B, │ vpsrlq $0x4,%zmm0,%zmm1 │ _mm512_shuffle_epi8(): 0.80 │ vpandd %zmm3,%zmm0,%zmm0 │ avx512_lookup_impl(): │ ULLONG_FOR_EACH_1 (i, keys_map) { 0.32 │ and %rax,%rdi │ _mm512_shuffle_epi8(): 0.16 │ vpandd %zmm1,%zmm3,%zmm1 0.48 │ vpshufb %zmm0,%zmm4,%zmm0 0.80 │ vpshufb %zmm1,%zmm4,%zmm1 > > Note the "avx512_lookup_impl" is a static function in a .c file, so it is not > visible > outside the compilation unit. Further, it is also marked "ALWAYS_INLINE", so > even > inside the compilation unit, there isn't a symbol with that name. I'm > surprised GDB > let me set a breakpoint on it. Disassembling it doesn't work: > (gdb) b avx512_lookup_impl > Breakpoint 2 at 0x5555561035af: avx512_lookup_impl. (4 locations) > (gdb) disas avx512_lookup_impl > No symbol "avx512_lookup_impl" in current context. > > The functions it is inlined into are available for disassembly, as their > symbols > do exist in the binary. (Sidenote: Going to add dpcls_ to the _any function > for > consistency in naming with the others); > dpcls_avx512_gather_skx_mf_4_0 > dpcls_avx512_gather_skx_mf_4_1 > dpcls_avx512_gather_skx_mf_5_1 > avx512_gather_any > > Disassembling the _any version of the avx512 lookup function here > shows the AVX512 instructions, using ZMM registers and {k} masks. > (gdb) disas avx512_gather_mf_any > Dump of assembler code for function avx512_gather_mf_any: > 0x0000555556103fb0 <+0>: lea 0x8(%rsp),%r10 > 0x0000555556103fb5 <+5>: and $0xffffffffffffffc0,%rsp > 0x0000555556103fb9 <+9>: pushq -0x8(%r10) > <skipping preamble/pushes etc, to the fun AVX512 part> > 0x00005555561040dd <+301>: vpandd %zmm0,%zmm5,%zmm0 > 0x00005555561040e3 <+307>: or %rdi,%rax > 0x00005555561040e6 <+310>: test %r8,%r8 > 0x00005555561040e9 <+313>: kmovb %eax,%k4 > 0x00005555561040ed <+317>: vpsrlq $0x4,%zmm0,%zmm2 > 0x00005555561040f4 <+324>: vpandd %zmm3,%zmm0,%zmm0 > 0x00005555561040fa <+330>: vpandd %zmm2,%zmm3,%zmm2 > 0x0000555556104100 <+336>: vpshufb %zmm0,%zmm4,%zmm0 > 0x0000555556104106 <+342>: vpshufb %zmm2,%zmm4,%zmm2 > 0x000055555610410c <+348>: vpaddb %zmm2,%zmm0,%zmm0 > 0x0000555556104112 <+354>: vpsadbw %zmm7,%zmm0,%zmm0 > 0x0000555556104118 <+360>: vpaddq %zmm1,%zmm0,%zmm0 > 0x000055555610411e <+366>: vmovdqa64 %zmm8,%zmm1 > 0x0000555556104124 <+372>: vpgatherqq 0x18(%r9,%zmm0,8),%zmm1{%k3} > 0x000055555610412c <+380>: vpandq %zmm6,%zmm1,%zmm0{%k4}{z} > > Would you try some of the above and see can it be reproduced? btw, I saw every second ovs is doing reprobing 2020-05-20T14:15:15.113Z|00373|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 2020-05-20T14:15:16.129Z|00374|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 2020-05-20T14:15:17.138Z|00375|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 2020-05-20T14:15:18.150Z|00376|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 2020-05-20T14:15:19.170Z|00377|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 2020-05-20T14:15:20.184Z|00378|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 2020-05-20T14:15:21.219Z|00379|dpif_netdev(pmd-c00/id:9)|INFO|reprobing sub func, 4 1 Regards, William _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev