> > > >>> This version of the patch seems to have negative impact on
> > > >>> performance
> > > >> for burst traffic profile[1].
> > > >>> Benefits seen with the previous version (v2) was up to ~1.6x for
> > > >>> 1568 byte
> > > >> packets compared to ~1.2x seen with the current design (v3) as
> > > >> measured on new Intel hardware that supports DSA [2] , CPU @
> 1.8Ghz.
> > > >>> The cause of the drop seems to be because of the excessive vhost
> > > >>> txq
> > > >> contention across the PMD threads.
> > > >>
> > > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> > > >> thread. can you confirm?
> > > >
> > > > Yes, the completion polls for a given txq happens on a single PMD
> > > thread(on the same thread where its corresponding rxq is being
> > > polled) but other threads can submit(enqueue) packets on the same
> > > txq, which leads to contention.
>
> It seems 40% perf degradation is caused by virtqueue contention between
> Rx and Tx PMD threads. But I am really curious about what causes up to 40%
> perf drop?
> It's core busy-waiting due to spin-lock or cache thrashing of virtqueue
> struct?
> Or something else?
>
> In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy-
> waiting.
> If OVS data path can also avoid busy-waiting, will it help on performance?
> Could we have a try?
>
> > >
> > > Why this process can't be lockless?
> > > If we have to lock the device, maybe we can do both submission and
> > > completion from the thread that polls corresponding Rx queue?
> > > Tx threads may enqueue mbufs to some lockless ring inside the
> > > rte_vhost_enqueue_burst. Rx thread may dequeue them and submit
> jobs
> > > to dma device and check completions. No locks required.
>
> The lockless ring is like batching or caching for Tx packets. It can be
> directly
> done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx
> thread inserts packets to the ring, and Rx thread consumes packets from the
> ring and submits copy and polls completion.
>
> Thanks,
> Jiayu
> > >
> >
> > Thank you for the comments, Ilya.
> >
> > Hi Jiayu, Maxime,
> >
> > Could I request your opinions on this from the vhost library perspective ?
> >
> > Thanks and regards,
> > Sunil
>
Hi All,
An update on this.
After we improved the software to work better with the hardware we don’t see
the same drop in performance as before and we are now getting stable
performance results.
We also investigated Ilya's lockless ring suggestion to reduce the amount of
contention.
The updated results are shown below where the result numbers are the relative
gain compared to CPU-only for the 3 different methods tried for async.
In each case the configuration was: 4 dataplane threads, 32 vhost ports , vxlan
traffic [1], lossy tests.
--------------------------------------------------------------------------------------------------------------------
|| Traffic type || burst mode[1]
||
--------------------------------------------------------------------------------------------------------------------
|| Frame size/Implementation || CPU | work defer | V3 patch | V3 patch +
lockless ring in OVS for async* ||
--------------------------------------------------------------------------------------------------------------------
|| 114 || 1 | 0.85 | 0.74 |
0.77 ||
--------------------------------------------------------------------------------------------------------------------
|| 2098 || 1 | 1.85 | 1.63 |
1.75 ||
--------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------
|| Traffic type || scatter mode[1]
||
--------------------------------------------------------------------------------------------------------------------
|| Frame size/Implementation || CPU | work defer | V3 patch | V3 patch +
lockless ring in OVS for async* ||
--------------------------------------------------------------------------------------------------------------------
|| 114 || 1 | 0.79 | 0.78 |
0.83 ||
--------------------------------------------------------------------------------------------------------------------
|| 2098 || 1 | 1.51 | 1.50 |
1.60 ||
--------------------------------------------------------------------------------------------------------------------
This data is based on new Intel hardware that supports DSA [2] , [email protected]
From an OVS code complexity point of view here are the 3 implementations ranked
from most to least complex:
1. Work Defer. Complexity is added to dpif-netdev as well as netdev-dpdk, with
async-free logic in both.
2. V3 + lockless ring. Complexity is added just to netdev-dpdk, with async-free
logic in OVS under the RX API wrapper AND with lockless ring complexity added
in netdev-dpdk.
3. V3. Complexity is added just to netdev-dpdk, with async-free logic in OVS
under RX API wrapper.
In all the above implementations, the ownership (configure and use) of the
dmadev resides with OVS in netdev-dpdk.
Defer work clearly provides the best performance but also adds the most
complexity.
In our view the additional performance merits the additional complexity, but we
are open to thoughts/comments from others.
*Note: DPDK rte_ring was used as the lockless ring with MP/SC mode.
[1]:
https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf
[2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
Thanks and Regards,
Sunil
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev