Re: [ovs-dev] [PATCH RFC dpdk-latest v3 0/1] Enable vhost async API's in OvS.

Pai G, Sunil Fri, 04 Mar 2022 08:17:39 -0800

> > > >>> This version of the patch seems to have negative impact on
> > > >>> performance
> > > >> for burst traffic profile[1].
> > > >>> Benefits seen with the previous version (v2) was up to ~1.6x for
> > > >>> 1568 byte
> > > >> packets compared to ~1.2x seen with the current design (v3) as
> > > >> measured on new Intel hardware that supports DSA [2] , CPU @
> 1.8Ghz.
> > > >>> The cause of the drop seems to be because of the excessive vhost
> > > >>> txq
> > > >> contention across the PMD threads.
> > > >>
> > > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> > > >> thread. can you confirm?
> > > >
> > > > Yes, the completion polls for a given txq happens on a single PMD
> > > thread(on the same thread where its corresponding rxq is being
> > > polled) but other threads can submit(enqueue) packets on the same
> > > txq,  which leads to contention.
> 
> It seems 40% perf degradation is caused by virtqueue contention between
> Rx and Tx PMD threads. But I am really curious about what causes up to 40%
> perf drop?
> It's core busy-waiting due to spin-lock or cache thrashing of virtqueue 
> struct?
> Or something else?
> 
> In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy-
> waiting.
> If OVS data path can also avoid busy-waiting, will it help on performance?
> Could we have a try?
> 
> > >
> > > Why this process can't be lockless?
> > > If we have to lock the device, maybe we can do both submission and
> > > completion from the thread that polls corresponding Rx queue?
> > > Tx threads may enqueue mbufs to some lockless ring inside the
> > > rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
> jobs
> > > to dma device and check completions.  No locks required.
> 
> The lockless ring is like batching or caching for Tx packets. It can be 
> directly
> done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx
> thread inserts packets to the ring, and Rx thread consumes packets from the
> ring and submits copy and polls completion.
> 
> Thanks,
> Jiayu
> > >
> >
> > Thank you for the comments, Ilya.
> >
> > Hi Jiayu, Maxime,
> >
> > Could I request your opinions on this from the vhost library perspective ?
> >
> > Thanks and regards,
> > Sunil
>


Hi All, 
 
An update on this. 

After we improved the software to work better with the hardware we don’t see 
the same drop in performance as before and we are now getting stable 
performance results.
We also investigated Ilya's lockless ring suggestion to reduce the amount of 
contention. 

The updated results are shown below where the result numbers are the relative 
gain compared to CPU-only for the 3 different methods tried for async.
In each case the configuration was: 4 dataplane threads, 32 vhost ports , vxlan 
traffic [1], lossy tests.

--------------------------------------------------------------------------------------------------------------------
||         Traffic type       ||                        burst mode[1]           
                                  ||
--------------------------------------------------------------------------------------------------------------------
|| Frame size/Implementation  ||  CPU  | work defer |   V3 patch   | V3 patch + 
lockless ring in OVS for async*   ||
--------------------------------------------------------------------------------------------------------------------
||            114             ||    1  |    0.85    |     0.74     |            
       0.77                       ||
--------------------------------------------------------------------------------------------------------------------
||            2098            ||    1  |    1.85    |     1.63     |            
       1.75                       ||
--------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------
||         Traffic type       ||                           scatter mode[1]      
                                  ||
--------------------------------------------------------------------------------------------------------------------
|| Frame size/Implementation  ||  CPU  | work defer |   V3 patch   | V3 patch + 
lockless ring in OVS for async*   ||
--------------------------------------------------------------------------------------------------------------------
||            114             ||   1   |     0.79   |     0.78     |            
   0.83                           ||
--------------------------------------------------------------------------------------------------------------------
||            2098            ||   1   |     1.51   |     1.50     |            
   1.60                           ||
--------------------------------------------------------------------------------------------------------------------
This data is based on new Intel hardware that supports DSA [2] , [email protected]
 

From an OVS code complexity point of view here are the 3 implementations ranked 
from most to least complex:

1. Work Defer. Complexity is added to dpif-netdev as well as netdev-dpdk, with 
async-free logic in both.
2. V3 + lockless ring. Complexity is added just to netdev-dpdk, with async-free 
logic in OVS under the RX API wrapper AND with lockless ring complexity added 
in netdev-dpdk.
3. V3. Complexity is added just to netdev-dpdk, with async-free logic in OVS 
under RX API wrapper.

In all the above implementations, the ownership (configure and use) of the 
dmadev resides with OVS in netdev-dpdk.


Defer work clearly provides the best performance but also adds the most 
complexity.
In our view the additional performance merits the additional complexity, but we 
are open to thoughts/comments from others.


*Note: DPDK rte_ring was used as the lockless ring with MP/SC mode.
[1]: 
https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf
 
[2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

Thanks and Regards,
Sunil

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC dpdk-latest v3 0/1] Enable vhost async API's in OvS.

Reply via email to