Hi Florin, Got it. So what you are saying is that TCP applications cannot directly be linked with VPP. They have to be a separate process and go through the VCL library, although they can be optimized to avoid 1 extra memcpy. In future, memcpy _may_ be avoided completely, but the applications have to still reside as a separate process.
Thanks, Vijay On Tue, Sep 15, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> wrote: > Hi Vijay, > > Currently, builtin applications can only receive data from tcp in a > session’s rx fifo. That’s a deliberate choice because, at scale, out of > order data could end up consuming a lot of buffers, i.e., buffers are > queued but cannot be consumed by the app until the gaps are filled. Still, > builtin apps can avoid the extra memcpy vcl needs to do for traditional > apps. > > Now, there have been talks and we have been considering the option of > linking vlib buffers into the fifos (to avoid the memcpy) but there’s no > ETA for that. > > Regards, > Florin > > On Sep 15, 2020, at 11:32 AM, Vijay Sampath <vsamp...@gmail.com> wrote: > > Hi Florin, > > Sure yes, and better still would be for the app to integrate directly with > VPP to even avoid the shared fifo copy, I assume. It's just that the VCL > library gives a quick way to get some benchmark numbers with existing > applications. Thanks for all the help. I have a much better idea now. > > Thanks, > > Vijay > > On Tue, Sep 15, 2020 at 11:25 AM Florin Coras <fcoras.li...@gmail.com> > wrote: > >> Hi Vijay, >> >> Yes, that is the case for this iperf3 test. The data is already in user >> space, and could be passed to the app in the shape of iovecs, to avoid the >> extra memcpy, but the app would need to be changed to have it release the >> memory whenever it’s done reading it. In case of iperf3 it would be on the >> spot, because it discards it. >> >> For completeness, note that we don’t currently have vcl apis to expose >> the fifo chunks as iovecs, but they shouldn’t be that difficult. >> >> Regards, >> Florin >> >> On Sep 15, 2020, at 10:47 AM, Vijay Sampath <vsamp...@gmail.com> wrote: >> >> Hi Florin, >> >> I just realized that maybe in the VPP case there is an extra copy - once >> from mbuf to shared fifo, and once from shared fifo to application buffer. >> In Linux, there is probably just the copy from kernel space to user space. >> Please correct me if I am wrong. If so, what I am doing is not an apples to >> apples comparison. >> >> Thanks, >> >> Vijay >> >> On Tue, Sep 15, 2020 at 8:54 AM Vijay Sampath <vsamp...@gmail.com> wrote: >> >>> Hi Florin, >>> >>> In the 1 iperf connection test, I get different numbers every time I >>> run. When I ran today >>> >>> - iperf and vpp in the same numa core as pci device: 50Gbps (although in >>> different runs I saw 30Gbps also) >>> - vpp in the same numa core as pci device, iperf in the other numa : >>> 28Gbps >>> - vpp and iperf in the other numa as pci device : 36Gbps >>> >>> But these numbers vary from test to test. But I was never able to get >>> beyond 50G with 10connections with iperf on the other numa node. As I >>> mentioned in the previous email, when I repeat this test with Linux TCP as >>> the server, I am able to get 100G no matter which cores I start iperf on. >>> >>> Thanks, >>> >>> Vijay >>> >>> On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com> >>> wrote: >>> >>>> Hi Vijay, >>>> >>>> In this sort of setup, with few connections, probably it’s inevitable >>>> to lose throughput because of the cross-numa memcpy. In your 1 iperf >>>> connection test, did you only change iperf’s numa or vpp’s worker as well? >>>> >>>> Regards, >>>> Florin >>>> >>>> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com> wrote: >>>> >>>> Hi Florin, >>>> >>>> I ran some experiments by going cross numa, and see that I am not able >>>> to go beyond 50G. I tried with a different number of worker threads (5, 8 >>>> and 10), and going upto 10 iperf servers. I am attaching the show run >>>> output with 10 workers. When I run the same experiment in Linux, I don't >>>> see a difference in the bandwidth - iperf in both numa nodes are able to >>>> achieve 100G. Do you have any suggestions on other experiments to try? >>>> >>>> I also did try 1 iperf connection - and the bandwidth dropped from 33G >>>> to 23G for the same numa core vs different. >>>> >>>> Thanks, >>>> >>>> Vijay >>>> >>>> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com> >>>> wrote: >>>> >>>>> Hi VIjay, >>>>> >>>>> >>>>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi Florin, >>>>> >>>>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Vijay, >>>>>> >>>>>> >>>>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Florin, >>>>>> >>>>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Vijay, >>>>>>> >>>>>>> Quick replies inline. >>>>>>> >>>>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Florin, >>>>>>> >>>>>>> Thanks once again for looking at this issue. Please see inline: >>>>>>> >>>>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Vijay, >>>>>>>> >>>>>>>> Inline. >>>>>>>> >>>>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Florin, >>>>>>>> >>>>>>>> Thanks for the response. Please see inline: >>>>>>>> >>>>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras < >>>>>>>> fcoras.li...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Vijay, >>>>>>>>> >>>>>>>>> Cool experiment. More inline. >>>>>>>>> >>>>>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> > Hi, >>>>>>>>> > >>>>>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine >>>>>>>>> connected to another server running VPP using 100G NICs. Both servers >>>>>>>>> are >>>>>>>>> Intel Xeon with 24 cores. >>>>>>>>> >>>>>>>>> May I ask the frequency for those cores? Also what type of nic are >>>>>>>>> you using? >>>>>>>>> >>>>>>>> >>>>>>>> 2700 MHz. >>>>>>>> >>>>>>>> >>>>>>>> Probably this somewhat limits throughput per single connection >>>>>>>> compared to my testbed where the Intel cpu boosts to 4GHz. >>>>>>>> >>>>>>> >>>>>>> Please see below, I noticed an anomaly. >>>>>>> >>>>>>> >>>>>>>> The nic is a Pensando DSC100. >>>>>>>> >>>>>>>> >>>>>>>> Okay, not sure what to expect there. Since this mostly stresses the >>>>>>>> rx side, what’s the number of rx descriptors? Typically I test with >>>>>>>> 256, >>>>>>>> with more connections higher throughput you might need more. >>>>>>>> >>>>>>> >>>>>>> This is the default - comments seem to suggest that is 1024. I don't >>>>>>> see any rx queue empty errors on the nic, which probably means there are >>>>>>> sufficient buffers. >>>>>>> >>>>>>> >>>>>>> Reasonable. Might want to try to reduce it down to 256 but >>>>>>> performance will depend a lot on other things as well. >>>>>>> >>>>>> >>>>>> This seems to help, but I do get rx queue empty nic drops. More below. >>>>>> >>>>>> >>>>>> That’s somewhat expected to happen either when 1) the peer tries to >>>>>> probe for more throughput and bursts a bit more than we can handle 2) a >>>>>> full vpp dispatch takes too long, which could happen because of the >>>>>> memcpy >>>>>> in tcp-established. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client >>>>>>>>> by starting 10 parallel iperf connections on different port numbers >>>>>>>>> and >>>>>>>>> pinning them to different cores on the sender side. On the VPP >>>>>>>>> receiver >>>>>>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running >>>>>>>>> iperf3 >>>>>>>>> using VCL library as follows >>>>>>>>> > >>>>>>>>> > taskset 0x00400 sh -c >>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" & >>>>>>>>> > taskset 0x00800 sh -c >>>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so >>>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" & >>>>>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64 >>>>>>>>> > ... >>>>>>>>> > >>>>>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client: >>>>>>>>> > >>>>>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000 >>>>>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001 >>>>>>>>> > ... >>>>>>>>> >>>>>>>>> Could you try first with only 1 iperf server/client pair, just to >>>>>>>>> see where performance is with that? >>>>>>>>> >>>>>>>> >>>>>>>> These are the numbers I get >>>>>>>> rx-fifo-size 65536: ~8G >>>>>>>> rx-fifo-size 524288: 22G >>>>>>>> rx-fifo-size 4000000: 25G >>>>>>>> >>>>>>>> >>>>>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the >>>>>>>> vector rate (and the errors) in this case also? >>>>>>>> >>>>>>> >>>>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to >>>>>>> improve performance. Not sure if this is an issue with the dpdk driver >>>>>>> for >>>>>>> the nic. Anyway in the "show hardware" flags I see now that tcp and udp >>>>>>> checksum offloads are enabled: >>>>>>> >>>>>>> root@server:~# vppctl show hardware >>>>>>> Name Idx Link Hardware >>>>>>> eth0 1 up dsc1 >>>>>>> Link speed: 100 Gbps >>>>>>> Ethernet address 00:ae:cd:03:79:51 >>>>>>> ### UNKNOWN ### >>>>>>> carrier up full duplex mtu 9000 >>>>>>> flags: admin-up pmd maybe-multiseg rx-ip4-cksum >>>>>>> Devargs: >>>>>>> rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1) >>>>>>> tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1) >>>>>>> pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 >>>>>>> numa 0 >>>>>>> max rx packet len: 9208 >>>>>>> promiscuous: unicast off all-multicast on >>>>>>> vlan offload: strip off filter off qinq off >>>>>>> rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum >>>>>>> vlan-filter >>>>>>> jumbo-frame scatter >>>>>>> rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame >>>>>>> scatter >>>>>>> tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum >>>>>>> tcp-tso >>>>>>> outer-ipv4-cksum multi-segs mbuf-fast-free >>>>>>> outer-udp-cksum >>>>>>> tx offload active: multi-segs >>>>>>> rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >>>>>>> rss active: ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6 >>>>>>> tx burst function: ionic_xmit_pkts >>>>>>> rx burst function: ionic_recv_pkts >>>>>>> >>>>>>> With this I get better performance per iperf3 connection - about >>>>>>> 30.5G. Show run output attached (1connection.txt) >>>>>>> >>>>>>> >>>>>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum >>>>>>> computation when possible but it currently (unless some of the pending >>>>>>> patches were merged) does not mark the packet appropriately and >>>>>>> ip4-local >>>>>>> will recompute/validate the checksum. From your logs, it seems ip4-local >>>>>>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 >>>>>>> connections. >>>>>>> That’s a lot, so it seems to confirm that the checksum is recomputed. >>>>>>> >>>>>>> So, it’s somewhat counter intuitive the fact that performance >>>>>>> improves. How do the show run numbers change? Could be that performance >>>>>>> worsens because of tcp’s congestion recovery/flow control, i.e., the >>>>>>> packets are processes faster but some component starts dropping/queues >>>>>>> get >>>>>>> full. >>>>>>> >>>>>> >>>>>> That's interesting. I got confused by the "show hardware" output >>>>>> since it doesn't show any output against "tx offload active". You are >>>>>> right, though it definitely uses less cycles without this option present, >>>>>> so I took it out for further tests. I am attaching the show run output >>>>>> for >>>>>> both 1 connection and 7 connection case without this option present. >>>>>> With 1 >>>>>> connection, it appears VPP is not loaded at all since there is no >>>>>> batching >>>>>> happening? >>>>>> >>>>>> >>>>>> That’s probably because you’re using 9kB frames. It’s practically >>>>>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput >>>>>> increase at all? >>>>>> >>>>> >>>>> Throughput varied between 26-30G. >>>>> >>>>> >>>>> Sounds reasonable for the cpu frequency. >>>>> >>>>> >>>>> >>>>>> >>>>>> With 7 connections I do see it getting around 90-92G. When I drop the >>>>>> rx queue to 256, I do see some nic drops, but performance improves and I >>>>>> am >>>>>> getting 99G now. >>>>>> >>>>>> >>>>>> Awesome! >>>>>> >>>>>> Can you please explain why this makes a difference? Does it have >>>>>> to do with caches? >>>>>> >>>>>> >>>>>> There’s probably several things at play. First of all, we back >>>>>> pressure the sender with minimal cost, i.e., we minimize the data that we >>>>>> queue and we just drop as soon as we run out of space. So instead of us >>>>>> trying to buffer large bursts and deal with them later, we force the >>>>>> sender >>>>>> to drop the rate. Second, as you already guessed, this probably improves >>>>>> cache utilization because we end up touching fewer buffers. >>>>>> >>>>> >>>>> I see. I was trying to accomplish something similar by limiting the >>>>> rx-fifo-size (rmem in linux) for each connection. So there is no issue >>>>> with >>>>> the ring size being equal to the VPP batch size? While VPP is working on a >>>>> batch, what happens if more packets come in? >>>>> >>>>> >>>>> They will be dropped. Typically tcp pacing should make sure that >>>>> packets are not delivered in bursts, instead they’re spread over an rtt. >>>>> For instance, see how small the vector rate is for 1 connection. Even if >>>>> you multiply it by 4 (to reach 100Gbps) the vector rate is still small. >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> Are the other cores kind of unusable now due to being on a different >>>>>> numa? With Linux TCP, I believe I was able to use most of the cores and >>>>>> scale the number of connections. >>>>>> >>>>>> >>>>>> They’re all usable but it’s just that cross-numa memcpy is more >>>>>> expensive (session layer buffers the data for the apps in the shared >>>>>> memory >>>>>> fifos). As the sessions are scaled up, each session will carry less data, >>>>>> so moving some of them to the other numa should not be a problem. But it >>>>>> all ultimately depends on the efficiency of the UPI interconnect. >>>>>> >>>>> >>>>> >>>>> Sure, I will try these experiments. >>>>> >>>>> >>>>> Sounds good. Let me know how it goes. >>>>> >>>>> Regards, >>>>> Florin >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Vijay >>>>> >>>>> >>>>> <show_run_10_conn_cross_numa.txt> >>>> >>>> >>>> >> >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#17411): https://lists.fd.io/g/vpp-dev/message/17411 Mute This Topic: https://lists.fd.io/mt/76783803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-