Re: [vpp-dev] 100G with iperf3 server using VCL library

Vijay Sampath Tue, 15 Sep 2020 10:48:25 -0700

Hi Florin,

I just realized that maybe in the VPP case there is an extra copy - once
from mbuf to shared fifo, and once from shared fifo to application buffer.
In Linux, there is probably just the copy from kernel space to user space.
Please correct me if I am wrong. If so, what I am doing is not an apples to
apples comparison.


Thanks,

Vijay

On Tue, Sep 15, 2020 at 8:54 AM Vijay Sampath <vsamp...@gmail.com> wrote:

> Hi Florin,
>
> In the 1 iperf connection test, I get different numbers every time I run.
> When I ran today
>
> - iperf and vpp in the same numa core as pci device: 50Gbps (although in
> different runs I saw 30Gbps also)
> - vpp in the same numa core as pci device, iperf in the other numa : 28Gbps
> - vpp and iperf in the other numa as pci device : 36Gbps
>
> But these numbers vary from test to test. But I was never able to get
> beyond 50G with 10connections with iperf on the other numa node. As I
> mentioned in the previous email, when I repeat this test with Linux TCP as
> the server, I am able to get 100G no matter which cores I start iperf on.
>
> Thanks,
>
> Vijay
>
> On Mon, Sep 14, 2020 at 8:30 PM Florin Coras <fcoras.li...@gmail.com>
> wrote:
>
>> Hi Vijay,
>>
>> In this sort of setup, with few connections, probably it’s inevitable to
>> lose throughput because of the cross-numa memcpy. In your 1 iperf
>> connection test, did you only change iperf’s numa or vpp’s worker as well?
>>
>> Regards,
>> Florin
>>
>> On Sep 14, 2020, at 6:35 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>
>> Hi Florin,
>>
>> I ran some experiments by going cross numa, and see that I am not able to
>> go beyond 50G. I tried with a different number of worker threads (5, 8 and
>> 10), and going upto 10 iperf servers. I am attaching the show run output
>> with 10 workers. When I run the same experiment in Linux, I don't see a
>> difference in the bandwidth - iperf in both numa nodes are able to achieve
>> 100G. Do you have any suggestions on other experiments to try?
>>
>> I also did try 1 iperf connection - and the bandwidth dropped from 33G to
>> 23G for the same numa core vs different.
>>
>> Thanks,
>>
>> Vijay
>>
>> On Sat, Sep 12, 2020 at 2:40 PM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi VIjay,
>>>
>>>
>>> On Sep 12, 2020, at 12:06 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>>
>>> Hi Florin,
>>>
>>> On Sat, Sep 12, 2020 at 11:44 AM Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vijay,
>>>>
>>>>
>>>> On Sep 12, 2020, at 10:06 AM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>>>
>>>> Hi Florin,
>>>>
>>>> On Fri, Sep 11, 2020 at 11:23 PM Florin Coras <fcoras.li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Vijay,
>>>>>
>>>>> Quick replies inline.
>>>>>
>>>>> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
>>>>>
>>>>> Hi Florin,
>>>>>
>>>>> Thanks once again for looking at this issue. Please see inline:
>>>>>
>>>>> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vijay,
>>>>>>
>>>>>> Inline.
>>>>>>
>>>>>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Florin,
>>>>>>
>>>>>> Thanks for the response. Please see inline:
>>>>>>
>>>>>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vijay,
>>>>>>>
>>>>>>> Cool experiment. More inline.
>>>>>>>
>>>>>>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine
>>>>>>> connected to another server running VPP using 100G NICs. Both servers 
>>>>>>> are
>>>>>>> Intel Xeon with 24 cores.
>>>>>>>
>>>>>>> May I ask the frequency for those cores? Also what type of nic are
>>>>>>> you using?
>>>>>>>
>>>>>>
>>>>>> 2700 MHz.
>>>>>>
>>>>>>
>>>>>> Probably this somewhat limits throughput per single connection
>>>>>> compared to my testbed where the Intel cpu boosts to 4GHz.
>>>>>>
>>>>>
>>>>> Please see below, I noticed an anomaly.
>>>>>
>>>>>
>>>>>> The nic is a Pensando DSC100.
>>>>>>
>>>>>>
>>>>>> Okay, not sure what to expect there. Since this mostly stresses the
>>>>>> rx side, what’s the number of rx descriptors? Typically I test with 256,
>>>>>> with more connections higher throughput you might need more.
>>>>>>
>>>>>
>>>>> This is the default - comments seem to suggest that is 1024. I don't
>>>>> see any rx queue empty errors on the nic, which probably means there are
>>>>> sufficient buffers.
>>>>>
>>>>>
>>>>> Reasonable. Might want to try to reduce it down to 256 but performance
>>>>> will depend a lot on other things as well.
>>>>>
>>>>
>>>> This seems to help, but I do get rx queue empty nic drops. More below.
>>>>
>>>>
>>>> That’s somewhat expected to happen either when 1) the peer tries to
>>>> probe for more throughput and bursts a bit more than we can handle 2) a
>>>> full vpp dispatch takes too long, which could happen because of the memcpy
>>>> in tcp-established.
>>>>
>>>>
>>>>
>>>>>
>>>>> > I am trying to push 100G traffic from the iperf Linux TCP client by
>>>>>>> starting 10 parallel iperf connections on different port numbers and
>>>>>>> pinning them to different cores on the sender side. On the VPP receiver
>>>>>>> side I have 10 worker threads and 10 rx-queues in dpdk, and running 
>>>>>>> iperf3
>>>>>>> using VCL library as follows
>>>>>>> >
>>>>>>> > taskset 0x00400 sh -c
>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>>>>>>> > taskset 0x00800 sh -c
>>>>>>> "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so
>>>>>>> VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>>>>>>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>>>>>>> > ...
>>>>>>> >
>>>>>>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>>>>>>> >
>>>>>>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>>>>>>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>>>>>>> > ...
>>>>>>>
>>>>>>> Could you try first with only 1 iperf server/client pair, just to
>>>>>>> see where performance is with that?
>>>>>>>
>>>>>>
>>>>>> These are the numbers I get
>>>>>> rx-fifo-size 65536: ~8G
>>>>>> rx-fifo-size 524288: 22G
>>>>>> rx-fifo-size 4000000: 25G
>>>>>>
>>>>>>
>>>>>> Okay, so 4MB is probably the sweet spot. Btw, could you check the
>>>>>> vector rate (and the errors) in this case also?
>>>>>>
>>>>>
>>>>> I noticed that adding "enable-tcp-udp-checksum" back seems to improve
>>>>> performance. Not sure if this is an issue with the dpdk driver for the 
>>>>> nic.
>>>>> Anyway in the "show hardware" flags I see now that tcp and udp checksum
>>>>> offloads are enabled:
>>>>>
>>>>> root@server:~# vppctl show hardware
>>>>>               Name                Idx   Link  Hardware
>>>>> eth0                               1     up   dsc1
>>>>>   Link speed: 100 Gbps
>>>>>   Ethernet address 00:ae:cd:03:79:51
>>>>>   ### UNKNOWN ###
>>>>>     carrier up full duplex mtu 9000
>>>>>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>>>>>     Devargs:
>>>>>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>>>>>     pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00
>>>>> numa 0
>>>>>     max rx packet len: 9208
>>>>>     promiscuous: unicast off all-multicast on
>>>>>     vlan offload: strip off filter off qinq off
>>>>>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum
>>>>> vlan-filter
>>>>>                        jumbo-frame scatter
>>>>>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame
>>>>> scatter
>>>>>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum
>>>>> tcp-tso
>>>>>                        outer-ipv4-cksum multi-segs mbuf-fast-free
>>>>> outer-udp-cksum
>>>>>     tx offload active: multi-segs
>>>>>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>>>>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>>>>>     tx burst function: ionic_xmit_pkts
>>>>>     rx burst function: ionic_recv_pkts
>>>>>
>>>>> With this I get better performance per iperf3 connection - about
>>>>> 30.5G. Show run output attached (1connection.txt)
>>>>>
>>>>>
>>>>> Interesting. Yes, dpdk does request offload rx ip/tcp checksum
>>>>> computation when possible but it currently (unless some of the pending
>>>>> patches were merged) does not mark the packet appropriately and ip4-local
>>>>> will recompute/validate the checksum. From your logs, it seems ip4-local
>>>>> needs ~1.8e3 cycles in the 1 connection setup and ~3.1e3 for 7 
>>>>> connections.
>>>>> That’s a lot, so it seems to confirm that the checksum is recomputed.
>>>>>
>>>>> So, it’s somewhat counter intuitive the fact that performance
>>>>> improves. How do the show run numbers change? Could be that performance
>>>>> worsens because of tcp’s congestion recovery/flow control, i.e., the
>>>>> packets are processes faster but some component starts dropping/queues get
>>>>> full.
>>>>>
>>>>
>>>> That's interesting. I got confused by the "show hardware" output since
>>>> it doesn't show any output against "tx offload active". You are right,
>>>> though it definitely uses less cycles without this option present, so I
>>>> took it out for further tests. I am attaching the show run output for both
>>>> 1 connection and 7 connection case without this option present. With 1
>>>> connection, it appears VPP is not loaded at all since there is no batching
>>>> happening?
>>>>
>>>>
>>>> That’s probably because you’re using 9kB frames. It’s practically
>>>> equivalent to LRO so vpp doesn’t need to work too much. Did throughput
>>>> increase at all?
>>>>
>>>
>>> Throughput varied between 26-30G.
>>>
>>>
>>> Sounds reasonable for the cpu frequency.
>>>
>>>
>>>
>>>>
>>>> With 7 connections I do see it getting around 90-92G. When I drop the
>>>> rx queue to 256, I do see some nic drops, but performance improves and I am
>>>> getting 99G now.
>>>>
>>>>
>>>> Awesome!
>>>>
>>>> Can you please explain why this makes a difference? Does it have to do
>>>> with caches?
>>>>
>>>>
>>>> There’s probably several things at play. First of all, we back pressure
>>>> the sender with minimal cost, i.e., we minimize the data that we queue and
>>>> we just drop as soon as we run out of space. So instead of us trying to
>>>> buffer large bursts and deal with them later, we force the sender to drop
>>>> the rate. Second, as you already guessed, this probably improves cache
>>>> utilization because we end up touching fewer buffers.
>>>>
>>>
>>> I see. I was trying to accomplish something similar by limiting the
>>> rx-fifo-size (rmem in linux) for each connection. So there is no issue with
>>> the ring size being equal to the VPP batch size? While VPP is working on a
>>> batch, what happens if more packets come in?
>>>
>>>
>>> They will be dropped. Typically tcp pacing should make sure that packets
>>> are not delivered in bursts, instead they’re spread over an rtt. For
>>> instance, see how small the vector rate is for 1 connection. Even if you
>>> multiply it by 4 (to reach 100Gbps) the vector rate is still small.
>>>
>>>
>>>
>>>>
>>>>
>>>> Are the other cores kind of unusable now due to being on a different
>>>> numa? With Linux TCP, I believe I was able to use most of the cores and
>>>> scale the number of connections.
>>>>
>>>>
>>>> They’re all usable but it’s just that cross-numa memcpy is more
>>>> expensive (session layer buffers the data for the apps in the shared memory
>>>> fifos). As the sessions are scaled up, each session will carry less data,
>>>> so moving some of them to the other numa should not be a problem. But it
>>>> all ultimately depends on the efficiency of the UPI interconnect.
>>>>
>>>
>>>
>>> Sure, I will try these experiments.
>>>
>>>
>>> Sounds good. Let me know how it goes.
>>>
>>> Regards,
>>> Florin
>>>
>>>
>>> Thanks,
>>>
>>> Vijay
>>>
>>>
>>> <show_run_10_conn_cross_numa.txt>
>>
>>
>>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17404): https://lists.fd.io/g/vpp-dev/message/17404
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] 100G with iperf3 server using VCL library

Reply via email to