Hi all,

> -----Original Message-----
> From: Pai G, Sunil <[email protected]>
> Sent: Wednesday, February 9, 2022 4:32 PM
> To: Ilya Maximets <[email protected]>; Maxime Coquelin
> <[email protected]>; [email protected]; Hu, Jiayu
> <[email protected]>
> Cc: Van Haaren, Harry <[email protected]>; Ferriter, Cian
> <[email protected]>; Stokes, Ian <[email protected]>;
> [email protected]; Mcnamara, John <[email protected]>
> Subject: RE: [PATCH RFC dpdk-latest v3 0/1] Enable vhost async API's in OvS.
> 
> > >>> This version of the patch seems to have negative impact on
> > >>> performance
> > >> for burst traffic profile[1].
> > >>> Benefits seen with the previous version (v2) was up to ~1.6x for
> > >>> 1568 byte
> > >> packets compared to ~1.2x seen with the current design (v3) as
> > >> measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
> > >>> The cause of the drop seems to be because of the excessive vhost
> > >>> txq
> > >> contention across the PMD threads.
> > >>
> > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> > >> thread. can you confirm?
> > >
> > > Yes, the completion polls for a given txq happens on a single PMD
> > thread(on the same thread where its corresponding rxq is being polled)
> > but other threads can submit(enqueue) packets on the same txq,  which
> > leads to contention.

It seems 40% perf degradation is caused by virtqueue contention between Rx and
Tx PMD threads. But I am really curious about what causes up to 40% perf drop?
It's core busy-waiting due to spin-lock or cache thrashing of virtqueue struct?
Or something else?

In the latest vhost patch, I have replaced spinlock to try-lock to avoid 
busy-waiting.
If OVS data path can also avoid busy-waiting, will it help on performance? 
Could we
have a try?

> >
> > Why this process can't be lockless?
> > If we have to lock the device, maybe we can do both submission and
> > completion from the thread that polls corresponding Rx queue?
> > Tx threads may enqueue mbufs to some lockless ring inside the
> > rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit jobs
> > to dma device and check completions.  No locks required.

The lockless ring is like batching or caching for Tx packets. It can be directly
done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx thread
inserts packets to the ring, and Rx thread consumes packets from the ring and
submits copy and polls completion.

Thanks,
Jiayu
> >
> 
> Thank you for the comments, Ilya.
> 
> Hi Jiayu, Maxime,
> 
> Could I request your opinions on this from the vhost library perspective ?
> 
> Thanks and regards,
> Sunil

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to