Hi Florin, I ran some experiments by going cross numa, and see that I am not able to go beyond 50G. I tried with a different number of worker threads (5, 8 and 10), and going upto 10 iperf servers. I am attaching the show run output with 10 workers. When I run the same experiment in Linux, I don't see a difference in the bandwidth - iperf in both numa nodes are able to achieve 100G. Do you have any suggestions on other experiments to try?
I also did try 1 iperf connection - and the bandwidth dropped from 33G to 23G for the same numa core vs different. Thanks, Vijay On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi VIjay, > > > On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Vijay, >> >> >> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >> >> Hi Florin, >> >> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Vijay, >>> >>> Quick replies inline. >>> >>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>> >>> Hi Florin, >>> >>> Thanks once again for looking at this issue. Please see inline: >>> >>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Vijay, >>>> >>>> Inline. >>>> >>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>>> >>>> Hi Florin, >>>> >>>> Thanks for the response. Please see inline: >>>> >>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com> >>>> wrote: >>>> >>>>> Hi Vijay, >>>>> >>>>> Cool experiment. More inline. >>>>> >>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >>>>> connected to another server running VPP using 100G NICs. Both servers are >>>>> Intel Xeon with 24 cores. >>>>> >>>>> May I ask the frequency for those cores? Also what type of nic are you >>>>> using? >>>>> >>>> >>>> 2700 MHz. >>>> >>>> >>>> Probably this somewhat limits throughput per single connection compared >>>> to my testbed where the Intel cpu boosts to 4GHz. >>>> >>> >>> Please see below, I noticed an anomaly. >>> >>> >>>> The nic is a Pensando DSC100. >>>> >>>> >>>> Okay, not sure what to expect there. Since this mostly stresses the rx >>>> side, what’s the number of rx descriptors? Typically I test with 256, with >>>> more connections higher throughput you might need more. >>>> >>> >>> This is the default - comments seem to suggest that is 1024. I don't see >>> any rx queue empty errors on the nic, which probably means there are >>> sufficient buffers. >>> >>> >>> Reasonable. Might want to try to reduce it down to 256 but performance >>> will depend a lot on other things as well. >>> >> >> This seems to help, but I do get rx queue empty nic drops. More below. >> >> >> That’s somewhat expected to happen either when 1) the peer tries to probe >> for more throughput and bursts a bit more than we can handle 2) a full vpp >> dispatch takes too long, which could happen because of the memcpy in >> tcp-established. >> >> >> >>> >>> > I am trying to push 100G traffic from the iperf Linux TCP client by >>>>> starting 10 parallel iperf connections on different port numbers and >>>>> pinning them to different cores on the sender side. On the VPP receiver >>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3 >>>>> using VCL library as follows >>>>> > >>>>> > taskset 0x00400 sh -c >>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >>>>> > taskset 0x00800 sh -c >>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >>>>> > ... >>>>> > >>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client: >>>>> > >>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >>>>> > ... >>>>> >>>>> Could you try first with only 1 iperf server/client pair, just to see >>>>> where performance is with that? >>>>> >>>> >>>> These are the numbers I get >>>> rx-fifo-size 65536: ~8G >>>> rx-fifo-size 524288: 22G >>>> rx-fifo-size 4000000: 25G >>>> >>>> >>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the >>>> vector rate (and the errors) in this case also? >>>> >>> >>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve >>> performance. Not sure if this is an issue with the dpdk driver for the nic. >>> Anyway in the "show hardware" flags I see now that tcp and udp checksum >>> offloads are enabled: >>> >>> root@server:~# vppctl show hardware >>> Name Idx Link Hardware >>> eth0 1 up dsc1 >>> Link speed: 100 Gbps >>> Ethernet address 00:ae:cd:03:79:51 >>> ### UNKNOWN ### >>> carrier up full duplex mtu 9000 >>> flags: admin-up pmd maybe-multiseg rx-ip4-cksum >>> Devargs: >>> rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) >>> tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) >>> pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 >>> numa 0 >>> max rx packet len: 9208 >>> promiscuous: unicast off all-multicast on >>> vlan offload: strip off filter off qinq off >>> rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum >>> vlan-filter >>> jumbo-frame scatter >>> rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame >>> scatter >>> tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum >>> tcp-tso >>> outer-ipv4-cksum multi-segs mbuf-fast-free >>> outer-udp-cksum >>> tx offload active: multi-segs >>> rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >>> rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >>> tx burst function: ionic_xmit_pkts >>> rx burst function: ionic_recv_pkts >>> >>> With this I get better performance per iperf3 connection - about 30.5G. >>> Show run output attached (1connection.txt) >>> >>> >>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum >>> computation when possible but it currently (unless some of the pending >>> patches were merged) does not mark the packet appropriately and ip4-local >>> will recompute/validate the checksum. From your logs, it seems ip4-local >>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 connections. >>> That’s a lot, so it seems to confirm that the checksum is recomputed. >>> >>> So, it’s somewhat counter intuitive the fact that performance improves. >>> How do the show run numbers change? Could be that performance worsens >>> because of tcp’s congestion recovery/flow control, i.e., the packets are >>> processes faster but some component starts dropping/queues get full. >>> >> >> That's interesting. I got confused by the "show hardware" output since it >> doesn't show any output against "tx offload active". You are right, though >> it definitely uses less cycles without this option present, so I took it >> out for further tests. I am attaching the show run output for both 1 >> connection and 7 connection case without this option present. With 1 >> connection, it appears VPP is not loaded at all since there is no batching >> happening? >> >> >> That’s probably because you’re using 9kB frames. It’s practically >> equivalent to LRO so vpp doesn’t need to work too much. Did throughput >> increase at all? >> > > Throughput varied between 26-30G. > > > Sounds reasonable for the cpu frequency. > > > >> >> With 7 connections I do see it getting around 90-92G. When I drop the rx >> queue to 256, I do see some nic drops, but performance improves and I am >> getting 99G now. >> >> >> Awesome! >> >> Can you please explain why this makes a difference? Does it have to do >> with caches? >> >> >> There’s probably several things at play. First of all, we back pressure >> the sender with minimal cost, i.e., we minimize the data that we queue and >> we just drop as soon as we run out of space. So instead of us trying to >> buffer large bursts and deal with them later, we force the sender to drop >> the rate. Second, as you already guessed, this probably improves cache >> utilization because we end up touching fewer buffers. >> > > I see. I was trying to accomplish something similar by limiting the > rx-fifo-size (rmem in linux) for each connection. So there is no issue with > the ring size being equal to the VPP batch size? While VPP is working on a > batch, what happens if more packets come in? > > > They will be dropped. Typically tcp pacing should make sure that packets > are not delivered in bursts, instead they’re spread over an rtt. For > instance, see how small the vector rate is for 1 connection. Even if you > multiply it by 4 (to reach 100Gbps) the vector rate is still small. > > > >> >> >> Are the other cores kind of unusable now due to being on a different >> numa? With Linux TCP, I believe I was able to use most of the cores and >> scale the number of connections. >> >> >> They’re all usable but it’s just that cross-numa memcpy is more expensive >> (session layer buffers the data for the apps in the shared memory fifos). >> As the sessions are scaled up, each session will carry less data, so moving >> some of them to the other numa should not be a problem. But it all >> ultimately depends on the efficiency of the UPI interconnect. >> > > > Sure, I will try these experiments. > > > Sounds good. Let me know how it goes. > > Regards, > Florin > > > Thanks, > > Vijay > > >
root@server:~# vppctl show run; vppctl show error; vppctl show tcp stats Thread 0 vpp_main (lcore 0) Time 5.7, 10 sec internal node vector rate 0.00 loops/sec 1321718.14 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call api-rx-from-ring any wait 0 0 1 4.04e4 0.00 cnat-scanner-process any wait 0 0 6 1.45e4 0.00 dpdk-process any wait 0 0 2 2.42e4 0.00 fib-walk any wait 0 0 3 1.34e4 0.00 ikev2-manager-process any wait 0 0 6 1.41e4 0.00 ip4-full-reassembly-expire-wal any wait 0 0 1 1.48e4 0.00 ip4-sv-reassembly-expire-walk any wait 0 0 1 1.22e4 0.00 ip6-full-reassembly-expire-wal any wait 0 0 1 1.39e4 0.00 ip6-mld-process any wait 0 0 6 5.97e3 0.00 ip6-ra-process any wait 0 0 6 5.06e3 0.00 ip6-sv-reassembly-expire-walk any wait 0 0 1 1.38e4 0.00 session-queue-main polling 988272 0 0 1.08e2 0.00 session-queue-process any wait 0 0 5 5.27e3 0.00 statseg-collector-process time wait 0 0 1 1.54e5 0.00 unix-cli-local:14 active 3 0 6 5.59e8 0.00 unix-cli-new-session any wait 0 0 7 5.39e3 0.00 unix-epoll-input polling 988272 0 0 1.22e4 0.00 wg-timer-manager any wait 0 0 567 1.31e3 0.00 --------------- Thread 1 vpp_wk_0 (lcore 1) Time 5.7, 10 sec internal node vector rate 1.00 loops/sec 6457147.63 vector rates in 5.2893e-1, out 0.0000e0, drop 5.2893e-1, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 37708180 3 0 1.29e9 0.00 drop active 3 3 0 3.97e3 1.00 error-drop active 3 3 0 5.08e3 1.00 ethernet-input active 3 3 0 2.58e3 1.00 llc-input active 3 3 0 2.16e3 1.00 session-queue polling 37708180 0 0 1.51e2 0.00 unix-epoll-input polling 36789 0 0 6.05e2 0.00 --------------- Thread 2 vpp_wk_1 (lcore 2) Time 5.7, 10 sec internal node vector rate 15.03 loops/sec 3736.73 vector rates in 9.7673e4, out 3.8728e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 21966 532020 0 7.71e2 24.22 dsc1-output active 21966 21966 0 3.97e2 1.00 dsc1-tx active 21966 21966 0 7.66e2 1.00 ethernet-input active 21966 532020 0 3.69e1 24.22 ip4-input-no-checksum active 21966 532020 0 3.86e1 24.22 ip4-local active 21966 532020 0 4.66e1 24.22 ip4-lookup active 21966 553986 0 4.25e1 25.22 ip4-rewrite active 21966 21966 0 3.58e2 1.00 session-queue polling 21966 21966 0 2.49e3 1.00 tcp4-established active 21966 532020 0 2.22e4 24.22 tcp4-input active 21966 532020 0 8.35e1 24.22 tcp4-output active 21966 21966 0 4.56e2 1.00 unix-epoll-input polling 21 0 0 2.81e3 0.00 --------------- Thread 3 vpp_wk_2 (lcore 3) Time 5.7, 10 sec internal node vector rate 0.00 loops/sec 6487712.48 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 37929582 0 0 1.05e2 0.00 session-queue polling 37929582 0 0 1.46e2 0.00 unix-epoll-input polling 37004 0 0 5.81e2 0.00 --------------- Thread 4 vpp_wk_3 (lcore 4) Time 5.7, 10 sec internal node vector rate 6.77 loops/sec 23743.38 vector rates in 1.0241e5, out 8.9279e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 169652 530220 0 8.41e2 3.13 dsc1-output active 50638 50638 0 3.75e2 1.00 dsc1-tx active 50638 50638 0 5.77e2 1.00 ethernet-input active 50638 530220 0 6.01e1 10.47 ip4-input-no-checksum active 50638 530220 0 6.48e1 10.47 ip4-local active 50638 530220 0 6.98e1 10.47 ip4-lookup active 52422 580858 0 6.79e1 11.08 ip4-rewrite active 50638 50638 0 3.07e2 1.00 session-queue polling 169652 50638 0 2.60e3 .29 tcp4-established active 50638 530220 0 2.18e4 10.47 tcp4-input active 50638 530220 0 1.13e2 10.47 tcp4-output active 50638 50638 0 4.29e2 1.00 unix-epoll-input polling 165 0 0 2.59e3 0.00 --------------- Thread 5 vpp_wk_4 (lcore 5) Time 5.7, 10 sec internal node vector rate 6.72 loops/sec 14214.08 vector rates in 1.0323e5, out 9.0450e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 192249 534183 0 8.43e2 2.78 dsc1-output active 51302 51302 0 5.27e2 1.00 dsc1-tx active 51302 51302 0 6.42e2 1.00 ethernet-input active 51302 534183 0 5.90e1 10.41 ip4-input-no-checksum active 51302 534183 0 6.57e1 10.41 ip4-local active 51302 534183 0 6.99e1 10.41 ip4-lookup active 53310 585485 0 6.84e1 10.98 ip4-rewrite active 51302 51302 0 3.08e2 1.00 session-queue polling 192249 51302 0 2.65e3 .27 tcp4-established active 51302 534183 0 2.16e4 10.41 tcp4-input active 51302 534183 0 1.15e2 10.41 tcp4-output active 51302 51302 0 4.35e2 1.00 unix-epoll-input polling 187 0 0 2.92e3 0.00 --------------- Thread 6 vpp_wk_5 (lcore 6) Time 5.7, 10 sec internal node vector rate 6.71 loops/sec 12804.05 vector rates in 1.0392e5, out 9.1274e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 193823 537666 0 8.47e2 2.77 dsc1-output active 51769 51769 0 3.88e2 1.00 dsc1-tx active 51769 51769 0 5.84e2 1.00 ethernet-input active 51769 537666 0 6.09e1 10.39 ip4-input-no-checksum active 51769 537666 0 6.65e1 10.39 ip4-local active 51769 537666 0 6.98e1 10.39 ip4-lookup active 53803 589435 0 6.84e1 10.96 ip4-rewrite active 51769 51769 0 3.05e2 1.00 session-queue polling 193823 51769 0 2.64e3 .27 tcp4-established active 51769 537666 0 2.15e4 10.39 tcp4-input active 51769 537666 0 1.11e2 10.39 tcp4-output active 51769 51769 0 4.52e2 1.00 unix-epoll-input polling 190 0 0 2.79e3 0.00 --------------- Thread 7 vpp_wk_6 (lcore 7) Time 5.7, 10 sec internal node vector rate 6.94 loops/sec 23387.01 vector rates in 1.0199e5, out 8.6644e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 123459 529372 0 8.24e2 4.29 dsc1-output active 49143 49143 0 3.82e2 1.00 dsc1-tx active 49143 49143 0 5.76e2 1.00 ethernet-input active 49143 529372 0 5.81e1 10.77 ip4-input-no-checksum active 49143 529372 0 6.36e1 10.77 ip4-local active 49143 529372 0 6.85e1 10.77 ip4-lookup active 50285 578515 0 6.71e1 11.50 ip4-rewrite active 49143 49143 0 3.05e2 1.00 session-queue polling 123459 49143 0 2.55e3 .39 tcp4-established active 49143 529372 0 2.19e4 10.77 tcp4-input active 49143 529372 0 1.11e2 10.77 tcp4-output active 49143 49143 0 4.42e2 1.00 unix-epoll-input polling 121 0 0 2.91e3 0.00 --------------- Thread 8 vpp_wk_7 (lcore 8) Time 5.7, 10 sec internal node vector rate 6.78 loops/sec 29254.26 vector rates in 1.0262e5, out 8.9327e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 164555 531357 0 8.38e2 3.23 dsc1-output active 50665 50665 0 3.76e2 1.00 dsc1-tx active 50665 50665 0 5.64e2 1.00 ethernet-input active 50665 531357 0 6.99e1 10.49 ip4-input-no-checksum active 50665 531357 0 6.52e1 10.49 ip4-local active 50665 531357 0 6.94e1 10.49 ip4-lookup active 52381 582022 0 6.81e1 11.11 ip4-rewrite active 50665 50665 0 3.09e2 1.00 session-queue polling 164555 50665 0 2.79e3 .31 tcp4-established active 50665 531357 0 2.17e4 10.49 tcp4-input active 50665 531357 0 1.11e2 10.49 tcp4-output active 50665 50665 0 4.41e2 1.00 unix-epoll-input polling 161 0 0 2.54e3 0.00 --------------- Thread 9 vpp_wk_8 (lcore 9) Time 5.7, 10 sec internal node vector rate 15.03 loops/sec 3745.72 vector rates in 9.7569e4, out 3.8681e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 21939 531456 0 7.78e2 24.22 dsc1-output active 21939 21939 0 4.17e2 1.00 dsc1-tx active 21939 21939 0 7.74e2 1.00 ethernet-input active 21939 531456 0 3.41e1 24.22 ip4-input-no-checksum active 21939 531456 0 3.97e1 24.22 ip4-local active 21939 531456 0 4.69e1 24.22 ip4-lookup active 21939 553395 0 4.24e1 25.22 ip4-rewrite active 21939 21939 0 3.29e2 1.00 session-queue polling 21939 21939 0 2.56e3 1.00 tcp4-established active 21939 531456 0 2.22e4 24.22 tcp4-input active 21939 531456 0 8.49e1 24.22 tcp4-output active 21939 21939 0 4.55e2 1.00 unix-epoll-input polling 21 0 0 2.91e3 0.00 --------------- Thread 10 vpp_wk_9 (lcore 10) Time 5.7, 10 sec internal node vector rate 6.76 loops/sec 17685.06 vector rates in 1.0209e5, out 8.9169e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 188203 528461 0 8.53e2 2.81 dsc1-output active 50575 50575 0 3.77e2 1.00 dsc1-tx active 50575 50575 0 5.76e2 1.00 ethernet-input active 50575 528461 0 5.96e1 10.45 ip4-input-no-checksum active 50575 528461 0 6.71e1 10.45 ip4-local active 50575 528461 0 7.02e1 10.45 ip4-lookup active 52470 579036 0 6.94e1 11.04 ip4-rewrite active 50575 50575 0 3.18e2 1.00 session-queue polling 188203 50575 0 2.72e3 .27 tcp4-established active 50575 528461 0 2.19e4 10.45 tcp4-input active 50575 528461 0 1.10e2 10.45 tcp4-output active 50575 50575 0 4.55e2 1.00 unix-epoll-input polling 183 0 0 2.51e3 0.00 Count Node Reason 3 llc-input unknown llc ssap/dsap 21977 session-queue Packets transmitted 532286 tcp4-established Packets pushed into rx fifo 21977 tcp4-output Packets sent 50658 session-queue Packets transmitted 530475 tcp4-established Packets pushed into rx fifo 50658 tcp4-output Packets sent 51320 session-queue Packets transmitted 534409 tcp4-established Packets pushed into rx fifo 51320 tcp4-output Packets sent 51791 session-queue Packets transmitted 537914 tcp4-established Packets pushed into rx fifo 51791 tcp4-output Packets sent 49166 session-queue Packets transmitted 529610 tcp4-established Packets pushed into rx fifo 49166 tcp4-output Packets sent 50689 session-queue Packets transmitted 531611 tcp4-established Packets pushed into rx fifo 50689 tcp4-output Packets sent 21950 session-queue Packets transmitted 531725 tcp4-established Packets pushed into rx fifo 21950 tcp4-output Packets sent 50590 session-queue Packets transmitted 528710 tcp4-established Packets pushed into rx fifo 50590 tcp4-output Packets sent Thread 0: Thread 1: Thread 2: Thread 3: Thread 4: Thread 5: Thread 6: Thread 7: Thread 8: Thread 9: Thread 10:
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17392): https://lists.fd.io/g/vpp-dev/message/17392 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-