Oh, and forgot to mention the LL/SC style of CAS that's offered by some
architectures with weak (by default) memory models.  The
Load-Linked/Store-Conditional becomes a non-atomic operation underneath,
but the CPU ensures that the store is only done if the underlying cacheline
wasn't taken away (even if the value at the address is still the expected
one).

On Wed, Jan 4, 2017 at 2:59 PM, Vitaly Davidovich <vita...@gmail.com> wrote:

> Probably worth a mention that "CAS" is a bit too generic.  For instance,
> you can have weak and strong CAS, with some architectures only providing
> strong (e.g. intel) and some providing/allowing both.  Depending on whether
> a weak or strong CAS is used, the memory ordering/pipeline implications
> will be different (and thus the local, i.e. to the core executing it, cost
> will be higher/lower).  Then there are cases where the "CAS" operation can
> trigger bus escalation, e.g. lock'd instructions on intel where the memory
> crosses cachelines or is uncacheable memory, which will be more costly than
> if the bus lock doesn't have to be asserted.
>
> For cacheable memory, as Gil mentioned, I believe the basic gist of the
> model relies on the existing cache coherence protocol that's already built
> in (after all, plain stores to memory are already arbitrated properly by
> the caches).  The "CAS" instruction/use cases, however, can add
> ordering/memory constraints that will impact performance (no different in
> that sense than, say, a plain store followed by a full cpu memory fence).
>
> On Wed, Jan 4, 2017 at 2:39 PM, Avi Kivity <a...@scylladb.com> wrote:
>
>> Gil covered the implementation details; as to overhead, it can be quite
>> low if there is no cacheline contention.  Agner's tables list Skylake lock
>> cmpxchg as having a throughput of 1 insn per 18 cycles, which is fairly
>> amazing. However, as soon as you have contention, this tanks completely due
>> to the associated barriers stopping everything else while the data is moved
>> around.
>>
>>
>>
>>
>> On 01/04/2017 03:43 PM, Yunpeng Li wrote:
>>
>>> Hi there,
>>>      Could someone help to share some light on how hardware really do to
>>> implement atomic operations such as CAS? Especially what's the difference
>>> and overhead in the spectrum from single-thread-single-core-single-socket
>>> to hyper-thread-multi-core-multi-socket architectures.
>>>       The Google results is sort of either too hardcore or to high level
>>> concepts, it is great if someone can give a “middlecore” introduction for
>>> software guys like me😀
>>>
>>>       Thanks in advance
>>>       Yunpeng Li
>>>
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to