Re: [vpp-dev] 100G with iperf3 server using VCL library

Florin Coras Fri, 11 Sep 2020 23:24:10 -0700

Hi Vijay, 

Quick replies inline.


> On Sep 11, 2020, at 7:27 PM, Vijay Sampath <vsamp...@gmail.com> wrote:
> 
> Hi Florin,
> 
> Thanks once again for looking at this issue. Please see inline:
> 
> On Fri, Sep 11, 2020 at 2:06 PM Florin Coras <fcoras.li...@gmail.com 
> <mailto:fcoras.li...@gmail.com>> wrote:
> Hi Vijay, 
> 
> Inline.
> 
>> On Sep 11, 2020, at 1:08 PM, Vijay Sampath <vsamp...@gmail.com 
>> <mailto:vsamp...@gmail.com>> wrote:
>> 
>> Hi Florin,
>> 
>> Thanks for the response. Please see inline:
>> 
>> On Fri, Sep 11, 2020 at 10:42 AM Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> Hi Vijay, 
>> 
>> Cool experiment. More inline. 
>> 
>> > On Sep 11, 2020, at 9:42 AM, Vijay Sampath <vsamp...@gmail.com 
>> > <mailto:vsamp...@gmail.com>> wrote:
>> > 
>> > Hi,
>> > 
>> > I am using iperf3 as a client on an Ubuntu 18.04 Linux machine connected 
>> > to another server running VPP using 100G NICs. Both servers are Intel Xeon 
>> > with 24 cores.
>> 
>> May I ask the frequency for those cores? Also what type of nic are you using?
>> 
>> 2700 MHz. 
> 
> Probably this somewhat limits throughput per single connection compared to my 
> testbed where the Intel cpu boosts to 4GHz. 
>  
> Please see below, I noticed an anomaly. 
> 
> 
>> The nic is a Pensando DSC100.
> 
> Okay, not sure what to expect there. Since this mostly stresses the rx side, 
> what’s the number of rx descriptors? Typically I test with 256, with more 
> connections higher throughput you might need more. 
>  
> This is the default - comments seem to suggest that is 1024. I don't see any 
> rx queue empty errors on the nic, which probably means there are sufficient 
> buffers. 

Reasonable. Might want to try to reduce it down to 256 but performance will 
depend a lot on other things as well. 

>> > I am trying to push 100G traffic from the iperf Linux TCP client by 
>> > starting 10 parallel iperf connections on different port numbers and 
>> > pinning them to different cores on the sender side. On the VPP receiver 
>> > side I have 10 worker threads and 10 rx-queues in dpdk, and running iperf3 
>> > using VCL library as follows
>> > 
>> > taskset 0x00400 sh -c 
>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9000" &
>> > taskset 0x00800 sh -c 
>> > "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libvcl_ldpreload.so 
>> > VCL_CONFIG=/etc/vpp/vcl.conf iperf3 -s -4 -p 9001" &
>> > taskset 0x01000 sh -c "LD_PRELOAD=/usr/lib/x86_64
>> > ...
>> > 
>> > MTU is set to 9216 everywhere, and TCP MSS set to 8200 on client:
>> > 
>> > taskset 0x0001 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9000
>> > taskset 0x0002 iperf3 -c 10.1.1.102 -M 8200 -Z -t 6000 -p 9001
>> > ...
>> 
>> Could you try first with only 1 iperf server/client pair, just to see where 
>> performance is with that? 
>> 
>> These are the numbers I get
>> rx-fifo-size 65536: ~8G
>> rx-fifo-size 524288: 22G
>> rx-fifo-size 4000000: 25G
> 
> Okay, so 4MB is probably the sweet spot. Btw, could you check the vector rate 
> (and the errors) in this case also?  
> 
> I noticed that adding "enable-tcp-udp-checksum" back seems to improve 
> performance. Not sure if this is an issue with the dpdk driver for the nic. 
> Anyway in the "show hardware" flags I see now that tcp and udp checksum 
> offloads are enabled:
> 
> root@server:~# vppctl show hardware
>               Name                Idx   Link  Hardware
> eth0                               1     up   dsc1
>   Link speed: 100 Gbps
>   Ethernet address 00:ae:cd:03:79:51
>   ### UNKNOWN ###
>     carrier up full duplex mtu 9000
>     flags: admin-up pmd maybe-multiseg rx-ip4-cksum
>     Devargs:
>     rx: queues 4 (max 16), desc 1024 (min 16 max 32768 align 1)
>     tx: queues 5 (max 16), desc 1024 (min 16 max 32768 align 1)
>     pci: device 1dd8:1002 subsystem 1dd8:400a address 0000:15:00.00 numa 0
>     max rx packet len: 9208
>     promiscuous: unicast off all-multicast on
>     vlan offload: strip off filter off qinq off
>     rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter
>                        jumbo-frame scatter
>     rx offload active: ipv4-cksum udp-cksum tcp-cksum jumbo-frame scatter
>     tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso
>                        outer-ipv4-cksum multi-segs mbuf-fast-free 
> outer-udp-cksum
>     tx offload active: multi-segs
>     rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>     rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp ipv6-udp ipv6
>     tx burst function: ionic_xmit_pkts
>     rx burst function: ionic_recv_pkts
> 
> With this I get better performance per iperf3 connection - about 30.5G. Show 
> run output attached (1connection.txt)

Interesting. Yes, dpdk does request offload rx ip/tcp checksum computation when 
possible but it currently (unless some of the pending patches were merged) does 
not mark the packet appropriately and ip4-local will recompute/validate the 
checksum. From your logs, it seems ip4-local needs ~1.8e3 cycles in the 1 
connection setup and ~3.1e3 for 7 connections. That’s a lot, so it seems to 
confirm that the checksum is recomputed. 

So, it’s somewhat counter intuitive the fact that performance improves. How do 
the show run numbers change? Could be that performance worsens because of tcp’s 
congestion recovery/flow control, i.e., the packets are processes faster but 
some component starts dropping/queues get full. 

> 
>  
>> rx-fifo-size 8000000: 25G
>>  
>> 
>> > 
>> > I see that I am not able to push beyond 50-60G. I tried different sizes 
>> > for the vcl rx-fifo-size - 64K, 256K and 1M. With 1M fifo size, I see that 
>> > tcp latency as reported on the client increases, but not a significant 
>> > improvement in bandwidth. Are there any suggestions to achieve 100G 
>> > bandwidth? I am using a vpp build from master.
>> 
>> Depends a lot on how many connections you’re running in parallel. With only 
>> one connection, buffer occupancy might go up, so 1-2MB might be better. 
>> 
>> With the current run I increased this to 8000000. 
>> 
>> Could you also check how busy vpp is with “clear run” wait at least 1 second 
>> and then “show run”. That will give you per node/worker vector rates. If 
>> they go above 100 vectors/dispatch the workers are busy so you could 
>> increase their number and implicitly the number of sessions. Note however 
>> that RSS is not perfect so you can get more connections on one worker.  
>> 
>> I am attaching the output of this to the email (10 iperf connections, 4 
>> worker threads)
> 
> It’s clearly saturated. Could also do a “clear error”/“show error” and “clear 
> tcp stats”/“show tcp stats”? 
> 
> Because this is purely a server/receiver scenario for vpp, and because 
> tcp4-established seems to need a lot of clocks, make sure that iperf runs on 
> the same numa vpp’s workers and the nic run on. To see the nic’s numa, “show 
> hardware”. 
> 
> For instance, in my testbed at ~37.5Gbps and 1 connection, tcp4-established 
> needs around 7e2 clocks. In your case it goes as high as 1.2e4, so it doesn’t 
> look it’s only frequency related. 
> 
> I now repeated this test with all cores and nic on numa 0. Cores 1-4 are used 
> by VPP and 5-11 by iperf. I get about 63G. I am attaching the vpp statistics 
> for this case (7connection.txt). Looks like in this case nothing is hashing 
> to core 4.

It can happen, that’s why I would recommend first optimizing for 1 connection 
and 1 worker and then try to increase the number of connections/workers. Still, 
good to see you’re now getting the same performance with only 4 connections :-)

> 
>> > Pasting below the output of vpp and vcl conf files:
>> > 
>> > cpu {
>> >   main-core 0
>> >   workers 10
>> 
>> You can pin vpp’s workers to cores with: corelist-workers c1,c3-cN to avoid 
>> overlap with iperf. You might want to start with 1 worker and work your way 
>> up from there. In my testing, 1 worker should be enough to saturate a 40Gbps 
>> nic with 1 iperf connection. Maybe you need a couple more to reach 100, but 
>> I wouldn’t expect more. 
>> 
>> I changed this to 4 cores and pinned them as you suggested.
> 
> See above wrt how vpp’s workers, iperf and the nic should all be on the same 
> numa. Make sure iperf and vpp’s workers don’t overlap. 
> 
> Done.
>  
> 
>>  
>> 
>> > }
>> > 
>> > buffers {
>> >   buffers-per-numa 65536
>> 
>> Unless you need the buffers for something else, 16k might be enough. 
>> 
>> >   default data-size 9216
>> 
>> Hm, no idea about the impact of this on performance. Session layer can build 
>> chained buffers so you can also try with this removed to see if it changes 
>> anything. 
>> 
>> For now, I kept this setting.
> 
> If possible, try with 1460 mtu and 2kB buffers, to see if that changes 
> anything. 
> 
> Sure I will try this. I am hitting some issues with the link not coming up 
> when I reduce the buffer data-size. It could be a driver issue.

Understood. 
>  
>> 
>> > }
>> > 
>> > dpdk {
>> >   dev 0000:15:00.0 {
>> >         name eth0
>> >         num-rx-queues 10
>> 
>> Keep this in sync with the number of workers
>> 
>> >   }
>> >   enable-tcp-udp-checksum
>> 
>> This enables sw checksum. For better performance, you’ll have to remove it. 
>> It will be needed however if you want to turn tso on. 
>> 
>> ok. removed.
>>  
>> 
>> > }
>> > 
>> > session {
>> >   evt_qs_memfd_seg
>> > }
>> > socksvr { socket-name /tmp/vpp-api.sock}
>> > 
>> > tcp {
>> >   mtu 9216
>> >   max-rx-fifo 262144
>> 
>> This is only used to compute the window scale factor. Given that your fifos 
>> might be larger, I would remove it. By default the value is 32MB and gives a 
>> wnd_scale of 10 (should be okay). 
>> 
>> When I was testing with Linux TCP stack on both sides, I was restricting the 
>> receive window per socket using net.ipv4.tcp_rmem to get better latency 
>> numbers. I want to mimic that with VPP. What is the right way to restrict 
>> the rcv_wnd on VPP?
> 
> The rcv_wnd is controlled by the rx fifo size. This value will limit the 
> wnd_scale and the actual fifo size, if larger than 256kB, won’t be correctly 
> advertised. So it would be better to remove this and only control it from rx 
> fifo. 
> 
> Sure, so I assume rx-fifo-size in vcl.conf is a per socket fifo size? 

Yup. 

Regards, 
Florin

> 
> Thanks,
> 
> Vijay
> <1connection.txt><7connection.txt>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#17384): https://lists.fd.io/g/vpp-dev/message/17384
Mute This Topic: https://lists.fd.io/mt/76783803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] 100G with iperf3 server using VCL library

Reply via email to