On Mon, 2019-01-28 at 13:43 +0000, Ola Liljedahl wrote: > On Mon, 2019-01-28 at 13:34 +0000, Jerin Jacob Kollanukkaran wrote: > > On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote: > > > > -----Original Message----- > > > > From: Ola Liljedahl [mailto:ola.liljed...@arm.com] > > > > Sent: Wednesday, January 23, 2019 4:16 AM > > > > To: Eads, Gage <gage.e...@intel.com>; dev@dpdk.org > > > > Cc: olivier.m...@6wind.com; step...@networkplumber.org; nd > > > > <n...@arm.com>; Richardson, Bruce <bruce.richard...@intel.com>; > > > > arybche...@solarflare.com; Ananyev, Konstantin > > > > <konstantin.anan...@intel.com> > > > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking > > > > implementation > > > > > > > > s. > > > > > > > > > > > You can tell this code was written when I thought x86-64 was > > > > > the > > > > > only > > > > > viable target :). Yes, you are correct. > > > > > > > > > > With regards to using __atomic intrinsics, I'm planning on > > > > > taking > > > > > a > > > > > similar approach to the functions duplicated in > > > > > rte_ring_generic.h and > > > > > rte_ring_c11_mem.h: one version that uses rte_atomic > > > > > functions > > > > > (and > > > > > thus stricter memory ordering) and one that uses __atomic > > > > > intrinsics > > > > > (and thus can benefit from more relaxed memory ordering). > > > > What's the advantage of having two different implementations? > > > > What > > > > is the > > > > disadvantage? > > > > > > > > The existing ring buffer code originally had only the "legacy" > > > > implementation > > > > which was kept when the __atomic implementation was added. The > > > > reason > > > > claimed was that some older compilers for x86 do not support > > > > GCC > > > > __atomic > > > > builtins. But I thought there was consensus that new > > > > functionality > > > > could have > > > > only __atomic implementations. > > > > > > > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was > > > left > > > disabled for thunderx[1] for performance reasons. Assuming that > > > hasn't changed, the advantage to having two versions is to best > > > support all of DPDK's platforms. The disadvantage is of course > > > duplicated code and the additional maintenance burden. > > > > > > That said, if the thunderx maintainers are ok with it, I'm > > > certainly > > The ring code was so fundamental building block for DPDK, there > > was > > difference in performance and there was already legacy code so > > introducing C11_MEM_MODEL was justified IMO. > > > > For the nonblocking implementation, I am happy to test with > > three ARM64 microarchitectures and share the result with > > C11_MEM_MODEL > > vs non C11_MEM_MODLE performance. > We should ensure the C11 memory model version enforces minimal > ordering > requirements:
I agree. I think, We should have enough test case for performance measurement in order to choose algorithms and quantify the other variables like C11 vs non C11, LDXP/STXP vs CASP etc. > 1) when computing number of available slots, allow for underflow > (head and tail > observed in unexpected order) instead of imposing read order with an > additional > read barrier. > 2) We could cheat a little and use an explicit LoadStore barrier > instead of > store-release/cas-release in dequeue (which only reads the ring). At > least see > if this improves performance. See such a patch here: > https://github.com/ARM-software/progress64/commit/84c48e9c84100eb5b2d15e54f0dbf7 > 8dfa468805 > > Ideally, C/C++ would have an __ATOMIC_RELEASE_READSONLY memory model > to use in > situations where the shared data was only read before being released. > > > We may need to consider PPC also > > here. So IMO, based on the overall performance result may be can > > decide > > the new code direction. > Does PPC (64-bit POWER?) have support for double-word (128-bit) CAS? I dont know, I was telling wrt in general C11 mem model for PPC. > > -- > Ola Liljedahl, Networking System Architect, Arm > Phone +46706866373, Skype ola.liljedahl >