Hi Florin, In the 1 iperf connection test, I get different numbers every time I run. When I ran today
- iperf and vpp in the same numa core as pci device: 50Gbps (although in different runs I saw 30Gbps also) - vpp in the same numa core as pci device, iperf in the other numa : 28Gbps - vpp and iperf in the other numa as pci device : 36Gbps But these numbers vary from test to test. But I was never able to get beyond 50G with 10connections with iperf on the other numa node. As I mentioned in the previous email, when I repeat this test with Linux TCP as the server, I am able to get 100G no matter which cores I start iperf on. Thanks, Vijay On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Vijay, > > In this sort of setup, with few connections, probably it’s inevitable to > lose throughput because of the cross-numa memcpy. In your 1 iperf > connection test, did you only change iperf’s numa or vpp’s worker as well? > > Regards, > Florin > > On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > I ran some experiments by going cross numa, and see that I am not able to > go beyond 50G. I tried with a different number of worker threads (5, 8 and > 10), and going upto 10 iperf servers. I am attaching the show run output > with 10 workers. When I run the same experiment in Linux, I don't see a > difference in the bandwidth - iperf in both numa nodes are able to achieve > 100G. Do you have any suggestions on other experiments to try? > > I also did try 1 iperf connection - and the bandwidth dropped from 33G to > 23G for the same numa core vs different. > > Thanks, > > Vijay > > On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi VIjay, >> >> >> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >> >> Hi Florin, >> >> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> >> wrote: >> >>> Hi Vijay, >>> >>> >>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >>> >>> Hi Florin, >>> >>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Vijay, >>>> >>>> Quick replies inline. >>>> >>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>>> >>>> Hi Florin, >>>> >>>> Thanks once again for looking at this issue. Please see inline: >>>> >>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> >>>> wrote: >>>> >>>>> Hi Vijay, >>>>> >>>>> Inline. >>>>> >>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>>>> >>>>> Hi Florin, >>>>> >>>>> Thanks for the response. Please see inline: >>>>> >>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Vijay, >>>>>> >>>>>> Cool experiment. More inline. >>>>>> >>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >>>>>> connected to another server running VPP using 100G NICs. Both servers are >>>>>> Intel Xeon with 24 cores. >>>>>> >>>>>> May I ask the frequency for those cores? Also what type of nic are >>>>>> you using? >>>>>> >>>>> >>>>> 2700 MHz. >>>>> >>>>> >>>>> Probably this somewhat limits throughput per single connection >>>>> compared to my testbed where the Intel cpu boosts to 4GHz. >>>>> >>>> >>>> Please see below, I noticed an anomaly. >>>> >>>> >>>>> The nic is a Pensando DSC100. >>>>> >>>>> >>>>> Okay, not sure what to expect there. Since this mostly stresses the rx >>>>> side, what’s the number of rx descriptors? Typically I test with 256, with >>>>> more connections higher throughput you might need more. >>>>> >>>> >>>> This is the default - comments seem to suggest that is 1024. I don't >>>> see any rx queue empty errors on the nic, which probably means there are >>>> sufficient buffers. >>>> >>>> >>>> Reasonable. Might want to try to reduce it down to 256 but performance >>>> will depend a lot on other things as well. >>>> >>> >>> This seems to help, but I do get rx queue empty nic drops. More below. >>> >>> >>> That’s somewhat expected to happen either when 1) the peer tries to >>> probe for more throughput and bursts a bit more than we can handle 2) a >>> full vpp dispatch takes too long, which could happen because of the memcpy >>> in tcp-established. >>> >>> >>> >>>> >>>> > I am trying to push 100G traffic from the iperf Linux TCP client by >>>>>> starting 10 parallel iperf connections on different port numbers and >>>>>> pinning them to different cores on the sender side. On the VPP receiver >>>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running >>>>>> iperf3 >>>>>> using VCL library as follows >>>>>> > >>>>>> > taskset 0x00400 sh -c >>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >>>>>> > taskset 0x00800 sh -c >>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >>>>>> > ... >>>>>> > >>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client: >>>>>> > >>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >>>>>> > ... >>>>>> >>>>>> Could you try first with only 1 iperf server/client pair, just to see >>>>>> where performance is with that? >>>>>> >>>>> >>>>> These are the numbers I get >>>>> rx-fifo-size 65536: ~8G >>>>> rx-fifo-size 524288: 22G >>>>> rx-fifo-size 4000000: 25G >>>>> >>>>> >>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the >>>>> vector rate (and the errors) in this case also? >>>>> >>>> >>>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve >>>> performance. Not sure if this is an issue with the dpdk driver for the nic. >>>> Anyway in the "show hardware" flags I see now that tcp and udp checksum >>>> offloads are enabled: >>>> >>>> root@server:~# vppctl show hardware >>>> Name Idx Link Hardware >>>> eth0 1 up dsc1 >>>> Link speed: 100 Gbps >>>> Ethernet address 00:ae:cd:03:79:51 >>>> ### UNKNOWN ### >>>> carrier up full duplex mtu 9000 >>>> flags: admin-up pmd maybe-multiseg rx-ip4-cksum >>>> Devargs: >>>> rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) >>>> tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) >>>> pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 >>>> numa 0 >>>> max rx packet len: 9208 >>>> promiscuous: unicast off all-multicast on >>>> vlan offload: strip off filter off qinq off >>>> rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum >>>> vlan-filter >>>> jumbo-frame scatter >>>> rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame >>>> scatter >>>> tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum >>>> tcp-tso >>>> outer-ipv4-cksum multi-segs mbuf-fast-free >>>> outer-udp-cksum >>>> tx offload active: multi-segs >>>> rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >>>> rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >>>> tx burst function: ionic_xmit_pkts >>>> rx burst function: ionic_recv_pkts >>>> >>>> With this I get better performance per iperf3 connection - about 30.5G. >>>> Show run output attached (1connection.txt) >>>> >>>> >>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum >>>> computation when possible but it currently (unless some of the pending >>>> patches were merged) does not mark the packet appropriately and ip4-local >>>> will recompute/validate the checksum. From your logs, it seems ip4-local >>>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 connections. >>>> That’s a lot, so it seems to confirm that the checksum is recomputed. >>>> >>>> So, it’s somewhat counter intuitive the fact that performance improves. >>>> How do the show run numbers change? Could be that performance worsens >>>> because of tcp’s congestion recovery/flow control, i.e., the packets are >>>> processes faster but some component starts dropping/queues get full. >>>> >>> >>> That's interesting. I got confused by the "show hardware" output since >>> it doesn't show any output against "tx offload active". You are right, >>> though it definitely uses less cycles without this option present, so I >>> took it out for further tests. I am attaching the show run output for both >>> 1 connection and 7 connection case without this option present. With 1 >>> connection, it appears VPP is not loaded at all since there is no batching >>> happening? >>> >>> >>> That’s probably because you’re using 9kB frames. It’s practically >>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput >>> increase at all? >>> >> >> Throughput varied between 26-30G. >> >> >> Sounds reasonable for the cpu frequency. >> >> >> >>> >>> With 7 connections I do see it getting around 90-92G. When I drop the rx >>> queue to 256, I do see some nic drops, but performance improves and I am >>> getting 99G now. >>> >>> >>> Awesome! >>> >>> Can you please explain why this makes a difference? Does it have to do >>> with caches? >>> >>> >>> There’s probably several things at play. First of all, we back pressure >>> the sender with minimal cost, i.e., we minimize the data that we queue and >>> we just drop as soon as we run out of space. So instead of us trying to >>> buffer large bursts and deal with them later, we force the sender to drop >>> the rate. Second, as you already guessed, this probably improves cache >>> utilization because we end up touching fewer buffers. >>> >> >> I see. I was trying to accomplish something similar by limiting the >> rx-fifo-size (rmem in linux) for each connection. So there is no issue with >> the ring size being equal to the VPP batch size? While VPP is working on a >> batch, what happens if more packets come in? >> >> >> They will be dropped. Typically tcp pacing should make sure that packets >> are not delivered in bursts, instead they’re spread over an rtt. For >> instance, see how small the vector rate is for 1 connection. Even if you >> multiply it by 4 (to reach 100Gbps) the vector rate is still small. >> >> >> >>> >>> >>> Are the other cores kind of unusable now due to being on a different >>> numa? With Linux TCP, I believe I was able to use most of the cores and >>> scale the number of connections. >>> >>> >>> They’re all usable but it’s just that cross-numa memcpy is more >>> expensive (session layer buffers the data for the apps in the shared memory >>> fifos). As the sessions are scaled up, each session will carry less data, >>> so moving some of them to the other numa should not be a problem. But it >>> all ultimately depends on the efficiency of the UPI interconnect. >>> >> >> >> Sure, I will try these experiments. >> >> >> Sounds good. Let me know how it goes. >> >> Regards, >> Florin >> >> >> Thanks, >> >> Vijay >> >> >> <show_run_10_conn_cross_numa.txt> > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17403): https://lists.fd.io/g/vpp-dev/message/17403 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-