Hi all, > -----Original Message----- > From: Pai G, Sunil <[email protected]> > Sent: Wednesday, February 9, 2022 4:32 PM > To: Ilya Maximets <[email protected]>; Maxime Coquelin > <[email protected]>; [email protected]; Hu, Jiayu > <[email protected]> > Cc: Van Haaren, Harry <[email protected]>; Ferriter, Cian > <[email protected]>; Stokes, Ian <[email protected]>; > [email protected]; Mcnamara, John <[email protected]> > Subject: RE: [PATCH RFC dpdk-latest v3 0/1] Enable vhost async API's in OvS. > > > >>> This version of the patch seems to have negative impact on > > >>> performance > > >> for burst traffic profile[1]. > > >>> Benefits seen with the previous version (v2) was up to ~1.6x for > > >>> 1568 byte > > >> packets compared to ~1.2x seen with the current design (v3) as > > >> measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. > > >>> The cause of the drop seems to be because of the excessive vhost > > >>> txq > > >> contention across the PMD threads. > > >> > > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD > > >> thread. can you confirm? > > > > > > Yes, the completion polls for a given txq happens on a single PMD > > thread(on the same thread where its corresponding rxq is being polled) > > but other threads can submit(enqueue) packets on the same txq, which > > leads to contention.
It seems 40% perf degradation is caused by virtqueue contention between Rx and Tx PMD threads. But I am really curious about what causes up to 40% perf drop? It's core busy-waiting due to spin-lock or cache thrashing of virtqueue struct? Or something else? In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy-waiting. If OVS data path can also avoid busy-waiting, will it help on performance? Could we have a try? > > > > Why this process can't be lockless? > > If we have to lock the device, maybe we can do both submission and > > completion from the thread that polls corresponding Rx queue? > > Tx threads may enqueue mbufs to some lockless ring inside the > > rte_vhost_enqueue_burst. Rx thread may dequeue them and submit jobs > > to dma device and check completions. No locks required. The lockless ring is like batching or caching for Tx packets. It can be directly done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx thread inserts packets to the ring, and Rx thread consumes packets from the ring and submits copy and polls completion. Thanks, Jiayu > > > > Thank you for the comments, Ilya. > > Hi Jiayu, Maxime, > > Could I request your opinions on this from the vhost library perspective ? > > Thanks and regards, > Sunil _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
