Hi Florin, Thanks for the response. Please see inline:
On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Vijay, > > Cool experiment. More inline. > > > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> wrote: > > > > Hi, > > > > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine connected > to another server running VPP using 100G NICs. Both servers are Intel Xeon > with 24 cores. > > May I ask the frequency for those cores? Also what type of nic are you > using? > 2700 MHz. The nic is a Pensando DSC100. > > > I am trying to push 100G traffic from the iperf Linux TCP client by > starting 10 parallel iperf connections on different port numbers and > pinning them to different cores on the sender side. On the VPP receiver > side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3 > using VCL library as follows > > > > taskset 0x00400 sh -c > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & > > taskset 0x00800 sh -c > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & > > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 > > ... > > > > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client: > > > > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 > > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 > > ... > > Could you try first with only 1 iperf server/client pair, just to see > where performance is with that? > These are the numbers I get rx-fifo-size 65536: ~8G rx-fifo-size 524288: 22G rx-fifo-size 4000000: 25G rx-fifo-size 8000000: 25G > > > > > I see that I am not able to push beyond 50-60G. I tried different sizes > for the vcl rx-fifo-size - 64K, 256K and 1M. With 1M fifo size, I see that > tcp latency as reported on the client increases, but not a significant > improvement in bandwidth. Are there any suggestions to achieve 100G > bandwidth? I am using a vpp build from master. > > Depends a lot on how many connections you’re running in parallel. With > only one connection, buffer occupancy might go up, so 1-2MB might be > better. > With the current run I increased this to 8000000. > > Could you also check how busy vpp is with “clear run” wait at least 1 > second and then “show run”. That will give you per node/worker vector > rates. If they go above 100 vectors/dispatch the workers are busy so you > could increase their number and implicitly the number of sessions. Note > however that RSS is not perfect so you can get more connections on one > worker. > I am attaching the output of this to the email (10 iperf connections, 4 worker threads) > > > > Pasting below the output of vpp and vcl conf files: > > > > cpu { > > main-core 0 > > workers 10 > > You can pin vpp’s workers to cores with: corelist-workers c1,c3-cN to > avoid overlap with iperf. You might want to start with 1 worker and work > your way up from there. In my testing, 1 worker should be enough to > saturate a 40Gbps nic with 1 iperf connection. Maybe you need a couple more > to reach 100, but I wouldn’t expect more. > I changed this to 4 cores and pinned them as you suggested. > > > } > > > > buffers { > > buffers-per-numa 65536 > > Unless you need the buffers for something else, 16k might be enough. > > > default data-size 9216 > > Hm, no idea about the impact of this on performance. Session layer can > build chained buffers so you can also try with this removed to see if it > changes anything. > For now, I kept this setting. > > > } > > > > dpdk { > > dev 0000:15:00.0 { > > name eth0 > > num-rx-queues 10 > > Keep this in sync with the number of workers > > > } > > enable-tcp-udp-checksum > > This enables sw checksum. For better performance, you’ll have to remove > it. It will be needed however if you want to turn tso on. > ok. removed. > > > } > > > > session { > > evt_qs_memfd_seg > > } > > socksvr { socket-name /tmp/vpp-api.sock} > > > > tcp { > > mtu 9216 > > max-rx-fifo 262144 > > This is only used to compute the window scale factor. Given that your > fifos might be larger, I would remove it. By default the value is 32MB and > gives a wnd_scale of 10 (should be okay). > When I was testing with Linux TCP stack on both sides, I was restricting the receive window per socket using net.ipv4.tcp_rmem to get better latency numbers. I want to mimic that with VPP. What is the right way to restrict the rcv_wnd on VPP? > > > } > > > > vcl.conf: > > vcl { > > max-workers 1 > > No need to constrain it > > > rx-fifo-size 262144 > > tx-fifo-size 262144 > > As previously mentioned you can configure them to be larger. > Made them 8000000. Attaching the show run output with 4 workers to this email. Still getting about 50G. Thanks, Vijay
vpp# show run Thread 0 vpp_main (lcore 0) Time 3.0, 10 sec internal node vector rate 0.00 loops/sec 1279589.36 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call cnat-scanner-process any wait 0 0 3 5.74e3 0.00 dpdk-process any wait 0 0 1 1.84e4 0.00 fib-walk any wait 0 0 2 9.05e3 0.00 ikev2-manager-process any wait 0 0 3 7.19e3 0.00 ip6-mld-process any wait 0 0 3 1.58e3 0.00 ip6-ra-process any wait 0 0 3 3.69e3 0.00 session-queue-main polling 526430 0 0 1.06e2 0.00 session-queue-process any wait 0 0 3 4.99e3 0.00 unix-cli-local:1 active 0 0 9 3.41e5 0.00 unix-epoll-input polling 526430 0 0 1.22e4 0.00 wg-timer-manager any wait 0 0 300 3.33e2 0.00 --------------- Thread 1 vpp_wk_0 (lcore 1) Time 3.0, 10 sec internal node vector rate 140.92 loops/sec 655.72 vector rates in 1.6822e5, out 1.9057e3, drop 6.6564e-1, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 1952 499712 0 4.82e2 256.00 drop active 2 2 0 4.20e3 1.00 dsc1-output active 1952 5726 0 5.18e2 2.93 dsc1-tx active 1952 5726 0 8.42e2 2.93 error-drop active 2 2 0 3.13e3 1.00 ethernet-input active 1952 499712 0 2.49e1 256.00 ip4-input-no-checksum active 1952 499710 0 2.24e1 255.99 ip4-local active 1952 499710 0 2.62e1 255.99 ip4-lookup active 3904 505436 0 2.87e1 129.47 ip4-rewrite active 1952 5726 0 2.77e2 2.93 llc-input active 2 2 0 5.83e3 1.00 session-queue polling 1952 5726 0 2.24e3 2.93 snap-input active 1 1 0 8.42e3 1.00 tcp4-established active 1952 499710 0 1.25e4 255.99 tcp4-input active 1952 499710 0 7.05e1 255.99 tcp4-output active 1952 5726 0 1.00e3 2.93 unix-epoll-input polling 2 0 0 3.30e3 0.00 --------------- Thread 2 vpp_wk_1 (lcore 2) Time 3.0, 10 sec internal node vector rate 124.65 loops/sec 838.86 vector rates in 1.7466e5, out 8.1308e2, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 2443 522342 0 4.95e2 213.81 dsc1-output active 2443 2443 0 1.25e3 1.00 dsc1-tx active 2443 2443 0 1.71e3 1.00 ethernet-input active 2443 522342 0 2.56e1 213.81 ip4-input-no-checksum active 2443 522342 0 2.42e1 213.81 ip4-local active 2443 522342 0 2.69e1 213.81 ip4-lookup active 3253 524785 0 2.79e1 161.32 ip4-rewrite active 2443 2443 0 5.84e2 1.00 session-queue polling 2443 2443 0 3.77e3 1.00 tcp4-established active 2443 522342 0 1.19e4 213.81 tcp4-input active 2443 522342 0 6.78e1 213.81 tcp4-output active 2443 2443 0 2.03e3 1.00 unix-epoll-input polling 3 0 0 2.88e3 0.00 --------------- Thread 3 vpp_wk_2 (lcore 3) Time 3.0, 10 sec internal node vector rate 140.97 loops/sec 663.92 vector rates in 1.6899e5, out 1.9101e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 1961 502016 0 4.82e2 256.00 dsc1-output active 1961 5739 0 5.05e2 2.93 dsc1-tx active 1961 5739 0 9.02e2 2.93 ethernet-input active 1961 502016 0 2.41e1 256.00 ip4-input-no-checksum active 1961 502016 0 2.28e1 256.00 ip4-local active 1961 502016 0 2.58e1 256.00 ip4-lookup active 3922 507755 0 2.86e1 129.46 ip4-rewrite active 1961 5739 0 2.90e2 2.93 session-queue polling 1961 5739 0 2.12e3 2.93 tcp4-established active 1961 502016 0 1.24e4 256.00 tcp4-input active 1961 502016 0 7.00e1 256.00 tcp4-output active 1961 5739 0 7.43e2 2.93 unix-epoll-input polling 2 0 0 3.58e3 0.00 --------------- Thread 4 vpp_wk_3 (lcore 4) Time 3.0, 10 sec internal node vector rate 140.97 loops/sec 1016.21 vector rates in 2.5521e5, out 2.8436e3, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 2962 758272 0 4.67e2 256.00 dsc1-output active 2962 8544 0 3.96e2 2.88 dsc1-tx active 2962 8544 0 7.67e2 2.88 ethernet-input active 2962 758272 0 2.17e1 256.00 ip4-input-no-checksum active 2962 758272 0 2.11e1 256.00 ip4-local active 2962 758272 0 2.55e1 256.00 ip4-lookup active 5924 766816 0 2.59e1 129.44 ip4-rewrite active 2962 8544 0 2.49e2 2.88 session-queue polling 2962 8544 0 1.66e3 2.88 tcp4-established active 2962 758272 0 8.03e3 256.00 tcp4-input active 2962 758272 0 7.07e1 256.00 tcp4-output active 2962 8544 0 5.75e2 2.88 unix-epoll-input polling 2 0 0 4.58e3 0.00
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17381): https://lists.fd.io/g/vpp-dev/message/17381 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-