Hi Still, stability and correctness are much more important than performance. As I said, with WC can benefit more than 20X perf. Comparing to this benefit, the difference between rte_wmb and rte_io_wmb is not that important. And in my test, the performance is not that bad with rte_wmb especially with large packets which are the normal use cases.
BTW, I've searched linux kernel codes and don't see any NTB device on arm platform. So I don't think you need to consider the perf hurt to arm. Best Regards Xiaoyun Li > -----Original Message----- > From: Gavin Hu [mailto:gavin...@arm.com] > Sent: Monday, December 23, 2019 16:38 > To: Li, Xiaoyun <xiaoyun...@intel.com>; Wu, Jingjing <jingjing...@intel.com> > Cc: dev@dpdk.org; Maslekar, Omkar <omkar.masle...@intel.com>; > sta...@dpdk.org; nd <n...@arm.com>; jer...@marvell.com; Honnappa > Nagarahalli <honnappa.nagaraha...@arm.com>; Richardson, Bruce > <bruce.richard...@intel.com>; nd <n...@arm.com> > Subject: RE: [dpdk-dev] [PATCH v2] raw/ntb: fix write memory barrier issue > > Hi Xiaoyun, > > > -----Original Message----- > > From: Li, Xiaoyun <xiaoyun...@intel.com> > > Sent: Monday, December 23, 2019 3:52 PM > > To: Gavin Hu <gavin...@arm.com>; Wu, Jingjing <jingjing...@intel.com> > > Cc: dev@dpdk.org; Maslekar, Omkar <omkar.masle...@intel.com>; > > sta...@dpdk.org; nd <n...@arm.com> > > Subject: RE: [dpdk-dev] [PATCH v2] raw/ntb: fix write memory barrier > > issue > > > > Hi > > I reconsidered and retested about this issue. > > I still need to use rte_wmb instead of using rte_io_wmb. > > > > Because to achieve high performance, ntb needs to turn on WC(write > > combining) feature. The perf difference with and without WC enabled is > > more than 20X. > > And when WC enabled, rte_io_wmb cannot make sure the instructions are > > in order only rte_wmb can make sure that. > > > > And in my retest, when sending 64 bytes packets, using rte_io_wmb will > > cause out-of-order issue and cause memory corruption on rx side. > > And using rte_wmb is fine. > That's true, as it is declared as 'write combine' region, even x86 is known as > strong ordered, it is the interconnect or PCI RC may do the reordering, 'write > combine', 'write coalescing', which caused this problem. > IMO, rte_io_*mb barriers on x86 should be promoted to stronger is WC is > involved(but that will sap performance for non-WC memories?). > https://code.dpdk.org/dpdk/latest/source/lib/librte_eal/common/include/arch/ > x86/rte_atomic.h#L78 > > Using rte_wmb will hurt performance for aarch64 also, as pci device memory > accesses to a single device are strongly ordered therefore the strongest > rte_wmb is not necessary. > > So I can only use v1 patch and suspend v2 patch in patchwork. > > > > Best Regards > > Xiaoyun Li > > > > > -----Original Message----- > > > From: Gavin Hu (Arm Technology China) [mailto:gavin...@arm.com] > > > Sent: Monday, December 16, 2019 18:50 > > > To: Li, Xiaoyun <xiaoyun...@intel.com>; Wu, Jingjing > > <jingjing...@intel.com> > > > Cc: dev@dpdk.org; Maslekar, Omkar <omkar.masle...@intel.com>; > > > sta...@dpdk.org; nd <n...@arm.com> > > > Subject: RE: [dpdk-dev] [PATCH v2] raw/ntb: fix write memory barrier > > issue > > > > > > > > > > > > > -----Original Message----- > > > > From: dev <dev-boun...@dpdk.org> On Behalf Of Xiaoyun Li > > > > Sent: Monday, December 16, 2019 9:59 AM > > > > To: jingjing...@intel.com > > > > Cc: dev@dpdk.org; omkar.masle...@intel.com; Xiaoyun Li > > > > <xiaoyun...@intel.com>; sta...@dpdk.org > > > > Subject: [dpdk-dev] [PATCH v2] raw/ntb: fix write memory barrier > > > > issue > > > > > > > > All buffers and ring info should be written before tail register update. > > > > This patch relocates the write memory barrier before updating tail > > > > register to avoid potential issues. > > > > > > > > Fixes: 11b5c7daf019 ("raw/ntb: add enqueue and dequeue functions") > > > > Cc: sta...@dpdk.org > > > > > > > > Signed-off-by: Xiaoyun Li <xiaoyun...@intel.com> > > > > --- > > > > v2: > > > > * Replaced rte_wmb with rte_io_wmb since rte_io_wmb is enough. > > > > --- > > > > drivers/raw/ntb/ntb.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/raw/ntb/ntb.c b/drivers/raw/ntb/ntb.c index > > > > ad7f6abfd..c7de86f36 100644 > > > > --- a/drivers/raw/ntb/ntb.c > > > > +++ b/drivers/raw/ntb/ntb.c > > > > @@ -683,8 +683,8 @@ ntb_enqueue_bufs(struct rte_rawdev *dev, > > > > sizeof(struct ntb_used) * nb1); > > > > rte_memcpy(txq->tx_used_ring, tx_used + nb1, > > > > sizeof(struct ntb_used) * nb2); > > > > + rte_io_wmb(); > > > As both txq->tx_used_ring and *txq->used_cnt are physically reside > > > in the > > PCI > > > device side, rte_io_wmb is correct to ensure the ordering. > > > > > > > *txq->used_cnt = txq->last_used; > > > > - rte_wmb(); > > > > > > > > /* update queue stats */ > > > > hw->ntb_xstats[NTB_TX_BYTES_ID + off] += bytes; @@ - > > 789,8 > > > +789,8 @@ > > > > ntb_dequeue_bufs(struct rte_rawdev *dev, > > > > sizeof(struct ntb_desc) * nb1); > > > > rte_memcpy(rxq->rx_desc_ring, rx_desc + nb1, > > > > sizeof(struct ntb_desc) * nb2); > > > > + rte_io_wmb(); > > > > *rxq->avail_cnt = rxq->last_avail; > > > > - rte_wmb(); > > > > > > > > /* update queue stats */ > > > > off = NTB_XSTATS_NUM * ((size_t)context + 1); > > > > -- > > > > 2.17.1 > > > > > > Reviewed-by: Gavin Hu <gavin...@arm.com>