Re: [lng-odp] [ODP/PATCH v2] Look ma, no barriers! C11 memory model

Savolainen, Petri (NSN - FI/Espoo) Mon, 10 Nov 2014 00:58:41 -0800

> > This is exactly the point I have been trying to make on this "C11
> atomics" thread. Maybe today, the C11 style atomics fit ARMv8.0 ISA
> perfectly, but the day when ARM ISA will have proper "far atomics" - it's
> not optimal any more. The atomics API is targeting "the multi-core
> scalable" way of incrementing those in  memory counters. That process does
> not include aqc/rel retry cycle.
> If the current odp_atomics.h is indented only for counters, then both the
> name and the implementation are wrong.


"atomic" is more familiar for people that have used these (far) atomic 
instructions before (e.g. "atomic add" in ISA, not "counter add"). It's also in 
align with similar kernel and DPDK APIs, which makes porting work easier 
between these three. I think either one could be used, but "atomic" is more 
familiar.


> 
> Acquire/release has nothing to do with LL/SC. Acquire and release
> are memory orderings which can be associated with any atomic operation
> (they don't make sense for non-atomic operations).

Any atomic? Meaning also the "far atomic" instructions? If application uses 
only far atomics (others atomics are used through various lock 
implementations), does it need to define acq/rel ordering still?

> ARMv8 load-acquire
> is a load instruction that can be used e.g. in ticketlock_lock() when
> waiting
> for the 'current' variable to become equal to your ticket. Memory accesses
> after this load must be prevented from moving up before the load-acquire.
> Memory accesses before this load-acquire are allowed to move down after
> load-acquire. A DMB or sync (PPC or MIPS) is unnecessarily heavy, why
> wait for *all* preceding stores to be globally observable before we can
> acquire
> the lock? A "far" atomic update with release ordering makes sense when
> incrementing the ticketlock 'current' variable in order to release the
> lock.
> This avoid the DMB or SYNC before the increment operation. We have
> benchmarks that should the detrimental effects of full barriers.

This is all correct for a lock implementation. When implementing different 
locks for ARMv8 you should take advantage of those features of ISA. You can 
optimize lock implementations for ARM as you wish.

We are now discussing whether API needs to expose acq/rel/etc C11 memory 
models. I think it should not, it's too low level detail.

> 
> 
> My odp_counter.h API uses relaxed memory order. fetch_and_add, add,
> fetch_and_inc etc can be mapped directly to atomic corresponding
> instructions
> if such are available. See the implementation for OCTEON that uses
> laa, saa, lai etc.
> 
> 
> >
> > As Victor and I have noted, SW lock implementation abstraction is not
> hugely important goal for ODP API. GCC __atomic provide already pretty
> good abstraction for that. If user really cares about lock (or lock free
> algorithm) implementation, it's better to write it in assembly and takeout
> all changes from any abstraction to spoil the algorithm.
> I disagree 100% with this. There is no need to write anything at
> all in assembler. The inline assembler in the atomics implementation
> could be replaced by the proper compiler support. Indeed I asked if
> we couldn't relax our requirement of C99 compatibility and allow C11
> usage in the implementation as well. But this as denied so I set out
> to recreate the necessary support in a C99 compliant way. Victor has
> pointed to a different approach which avoids the usage of a proprietary
> atomics API and I will have a look at this.

Victor and I have mentioned __atomics built-inns in this same context already 
many times before (during past months). It's implementation trade-off whether 
one uses __atomic or direct assembly (abstraction vs full control). Abstraction 
comes with a cost - e.g. you cannot be sure that a "relaxed __atomic add by 
one" always uses the optimal "atomic increment" instruction on all compiler 
versions, etc. It may generate functionally correct but less scalable sequence 
of instructions (e.g. by not using far atomics).

As said above GCC __atomic is pretty good abstraction towards C11 atomics, and 
thus ODP API does not have to duplicate it. It's also safe to use __atomics in 
linux-generic implementation.


> 
> I also believe that SW lock and synchronization performance will be very
> important for some ODP implementations and I prefer not have reimplement
> all of linux-generic just to be able to do it in a more efficient and
> scalable way.
> Doing it in linux-generic will also benefit many others, many ODP
> implementation
> might borrow SW-implementations from linux-generic.
>

True. You can #ifdef and optimize all lock/barrier implementations for ARM in 
linux-generic without changing the API. 

-Petri
 
_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] [ODP/PATCH v2] Look ma, no barriers! C11 memory model

Reply via email to