> > This is exactly the point I have been trying to make on this "C11 > atomics" thread. Maybe today, the C11 style atomics fit ARMv8.0 ISA > perfectly, but the day when ARM ISA will have proper "far atomics" - it's > not optimal any more. The atomics API is targeting "the multi-core > scalable" way of incrementing those in memory counters. That process does > not include aqc/rel retry cycle. > If the current odp_atomics.h is indented only for counters, then both the > name and the implementation are wrong.
"atomic" is more familiar for people that have used these (far) atomic instructions before (e.g. "atomic add" in ISA, not "counter add"). It's also in align with similar kernel and DPDK APIs, which makes porting work easier between these three. I think either one could be used, but "atomic" is more familiar. > > Acquire/release has nothing to do with LL/SC. Acquire and release > are memory orderings which can be associated with any atomic operation > (they don't make sense for non-atomic operations). Any atomic? Meaning also the "far atomic" instructions? If application uses only far atomics (others atomics are used through various lock implementations), does it need to define acq/rel ordering still? > ARMv8 load-acquire > is a load instruction that can be used e.g. in ticketlock_lock() when > waiting > for the 'current' variable to become equal to your ticket. Memory accesses > after this load must be prevented from moving up before the load-acquire. > Memory accesses before this load-acquire are allowed to move down after > load-acquire. A DMB or sync (PPC or MIPS) is unnecessarily heavy, why > wait for *all* preceding stores to be globally observable before we can > acquire > the lock? A "far" atomic update with release ordering makes sense when > incrementing the ticketlock 'current' variable in order to release the > lock. > This avoid the DMB or SYNC before the increment operation. We have > benchmarks that should the detrimental effects of full barriers. This is all correct for a lock implementation. When implementing different locks for ARMv8 you should take advantage of those features of ISA. You can optimize lock implementations for ARM as you wish. We are now discussing whether API needs to expose acq/rel/etc C11 memory models. I think it should not, it's too low level detail. > > > My odp_counter.h API uses relaxed memory order. fetch_and_add, add, > fetch_and_inc etc can be mapped directly to atomic corresponding > instructions > if such are available. See the implementation for OCTEON that uses > laa, saa, lai etc. > > > > > > As Victor and I have noted, SW lock implementation abstraction is not > hugely important goal for ODP API. GCC __atomic provide already pretty > good abstraction for that. If user really cares about lock (or lock free > algorithm) implementation, it's better to write it in assembly and takeout > all changes from any abstraction to spoil the algorithm. > I disagree 100% with this. There is no need to write anything at > all in assembler. The inline assembler in the atomics implementation > could be replaced by the proper compiler support. Indeed I asked if > we couldn't relax our requirement of C99 compatibility and allow C11 > usage in the implementation as well. But this as denied so I set out > to recreate the necessary support in a C99 compliant way. Victor has > pointed to a different approach which avoids the usage of a proprietary > atomics API and I will have a look at this. Victor and I have mentioned __atomics built-inns in this same context already many times before (during past months). It's implementation trade-off whether one uses __atomic or direct assembly (abstraction vs full control). Abstraction comes with a cost - e.g. you cannot be sure that a "relaxed __atomic add by one" always uses the optimal "atomic increment" instruction on all compiler versions, etc. It may generate functionally correct but less scalable sequence of instructions (e.g. by not using far atomics). As said above GCC __atomic is pretty good abstraction towards C11 atomics, and thus ODP API does not have to duplicate it. It's also safe to use __atomics in linux-generic implementation. > > I also believe that SW lock and synchronization performance will be very > important for some ODP implementations and I prefer not have reimplement > all of linux-generic just to be able to do it in a more efficient and > scalable way. > Doing it in linux-generic will also benefit many others, many ODP > implementation > might borrow SW-implementations from linux-generic. > True. You can #ifdef and optimize all lock/barrier implementations for ARM in linux-generic without changing the API. -Petri _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org http://lists.linaro.org/mailman/listinfo/lng-odp