Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Florin Coras Wed, 09 May 2018 09:23:58 -0700

Hi Luca, 

We don’t yet support pmtu in the stack so tcp uses a fixed 1460 mtu, unless you 
changed that, we shouldn’t generate jumbo packets. If we do, I’ll have to take 
a look at it :)


If you already had your transport protocol, using memif is the natural way to 
go. Using the session layer makes sense only if you can implement your 
transport within vpp in a way that leverages vectorization or if it can 
leverage the existing transports (see for instance the TLS implementation).

Until today [1] the stack did allow for excessive batching (generation of 
multiple frames in one dispatch loop) but we’re now restricting that to one. 
This is still far from proper pacing which is on our todo list. 

Florin

[1] https://gerrit.fd.io/r/#/c/12439/ <https://gerrit.fd.io/r/#/c/12439/>


> On May 9, 2018, at 4:21 AM, Luca Muscariello (lumuscar) <lumus...@cisco.com> 
> wrote:
> 
> Florin,
>
> Thanks for the slide deck, I’ll check it soon.
>
> BTW, VPP/DPDK test was using jumbo frames by default so the TCP stack had a 
> little
> advantage wrt the Linux TCP stack which was using 1500B by default.
>
> By manually setting DPDK MTU to 1500B the goodput goes down to 8.5Gbps which 
> compares
> to 4.5Gbps for Linux w/o TSO. Also congestion window adaptation is not the 
> same.
>
> BTW, for what we’re doing it is difficult to reuse the VPP session layer as 
> it is.
> Our transport stack uses a different kind of namespace and mux/demux is also 
> different.
>
> We are using memif as underlying driver which does not seem to be a
> bottleneck as we can also control batching there. Also, we have our own
> shared memory downstream memif inside VPP through a plugin.
>
> What we observed is that delay-based congestion control does not like
> much VPP batching (batching in general) and we are using DBCG.
>
> Linux TSO has the same problem but has TCP pacing to limit bad effects of 
> bursts
> on RTT/losses and flow control laws.
>
> I guess you’re aware of these issues already.
>
> Luca
>
>
> From: Florin Coras <fcoras.li...@gmail.com>
> Date: Monday 7 May 2018 at 22:23
> To: Luca Muscariello <lumus...@cisco.com>
> Cc: Luca Muscariello <lumuscar+f...@cisco.com>, "vpp-dev@lists.fd.io" 
> <vpp-dev@lists.fd.io>
> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general.
>
> Yes, the whole host stack uses shared memory segments and fifos that the 
> session layer manages. For a brief description of the session layer see [1, 
> 2]. Apart from that, unfortunately, we don’t have any other dev 
> documentation. src/vnet/session/segment_manager.[ch] has some good examples 
> of how to allocate segments and fifos. Under application_interface.h check 
> app_[send|recv]_[stream|dgram]_raw for examples on how to read/write to the 
> fifos.  <>
>
> Now, regarding the the writing to the fifos: they are lock free but size 
> increments are atomic since the assumption is that we’ll always have one 
> reader and one writer. Still, batching helps. VCL doesn’t do it but iperf 
> probably does it. 
>
> Hope this helps, 
> Florin
>
> [1] https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture 
> <https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture>
> [2] https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf 
> <https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf>
> 
> 
>> On May 7, 2018, at 11:35 AM, Luca Muscariello (lumuscar) <lumus...@cisco.com 
>> <mailto:lumus...@cisco.com>> wrote:
>>
>> Florin,
>>
>> So the TCP stack does not connect to VPP using memif.
>> I’ll check the shared memory you mentioned.
>>
>> For our transport stack we’re using memif. Nothing to 
>> do with TCP though.
>>
>> Iperf3 to VPP there must be copies anyway. 
>> There must be some batching with timing though 
>> while doing these copies.
>>
>> Is there any doc of svm_fifo usage?
>>
>> Thanks
>> Luca 
>> 
>> On 7 May 2018, at 20:00, Florin Coras <fcoras.li...@gmail.com 
>> <mailto:fcoras.li...@gmail.com>> wrote:
>> 
>>> Hi Luca,
>>>
>>> I guess, as you did, that it’s vectorization. VPP is really good at pushing 
>>> packets whereas Linux is good at using all hw optimizations. 
>>>
>>> The stack uses it’s own shared memory mechanisms (check svm_fifo_t) but 
>>> given that you did the testing with iperf3, I suspect the edge is not 
>>> there. That is, I guess they’re not abusing syscalls with lots of small 
>>> writes. Moreover, the fifos are not zero-copy, apps do have to write to the 
>>> fifo and vpp has to packetize that data. 
>>>
>>> Florin
>>> 
>>> 
>>>> On May 7, 2018, at 10:29 AM, Luca Muscariello (lumuscar) 
>>>> <lumus...@cisco.com <mailto:lumus...@cisco.com>> wrote:
>>>>
>>>> Hi Florin 
>>>>
>>>> Thanks for the info.
>>>>
>>>> So, how do you explain VPP TCP stack beats Linux
>>>> implementation by doubling the goodput?
>>>> Does it come from vectorization? 
>>>> Any special memif optimization underneath?
>>>>
>>>> Luca 
>>>> 
>>>> On 7 May 2018, at 18:17, Florin Coras <fcoras.li...@gmail.com 
>>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>>> 
>>>>> Hi Luca, 
>>>>>
>>>>> We don’t yet support TSO because it requires support within all of vpp 
>>>>> (think tunnels). Still, it’s on our list. 
>>>>>
>>>>> As for crypto offload, we do have support for IPSec offload with QAT 
>>>>> cards and we’re now working with Ping and Ray from Intel on accelerating 
>>>>> the TLS OpenSSL engine also with QAT cards. 
>>>>>
>>>>> Regards, 
>>>>> Florin
>>>>> 
>>>>> 
>>>>>> On May 7, 2018, at 7:53 AM, Luca Muscariello <lumuscar+f...@cisco.com 
>>>>>> <mailto:lumuscar+f...@cisco.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> A few questions about the TCP stack and HW offloading.
>>>>>> Below is the experiment under test.
>>>>>>
>>>>>>   +------------+                          +-----------+
>>>>>>   |      +-----+                 DPDK-10GE|           |
>>>>>>   |Iperf3| TCP |      +------------+      |TCP   Iperf3
>>>>>>   |      +------------+Nexus Switch+------+           +
>>>>>>   |LXC   | VPP||      +------------+      |VPP |LXC   |
>>>>>>   +------------+  DPDK-10GE               +-----------+
>>>>>>
>>>>>>
>>>>>> Using the Linux kernel w/ or w/o TSO I get an iperf3 goodput of 9.5Gbps 
>>>>>> or 4.5Gbps.
>>>>>> Using VPP TCP stack I get 9.2Gbps, say max goodput as Linux w/ TSO.
>>>>>>
>>>>>> Is there any TSO implementation already in VPP one can take advantage of?
>>>>>>
>>>>>> Side question. Is there any crypto offloading service available in VPP?
>>>>>> Essentially  for the computation of RSA-1024/2048, EDCSA 192/256 
>>>>>> signatures.
>>>>>>                                                                    
>>>>>> Thanks
>>>>>> Luca
>>>>>>
>>>>> 
>>>>>
>>> 
>>>
>>>

Re: [vpp-dev] TCP performance - TSO - HW offloading in general.

Reply via email to