[lock-free] Atomic operations on two cachelines

Oleg Zabluda Fri, 21 Feb 2014 05:38:13 -0800

Currently, on Intel x86, there are no locked operations for operands, which 
are larger than one cacheline.
Locked operations with operand crossing cacheline are allowed, but they 
take 4-6K cc, probably implemented
as "Stop the world" ("#LOCK pin" emulation).


AFAIK, locked ops on a single cacheline are implemented like so:

0. flush "load buffers", if needed
1. get the cacheline C1 in E state.
2. stop listening to coherency traffic on C1
3. do the op
4. flush "store buffers" if needed (cacheline may go to M state)
5. start listening to coherency traffic on C1

In principle, locked instruction involving 2 cacheline can be
implemented (?) without taking any global "stop the world" locks,
similarly:

0. flush "load buffers", if needed
1. get cacheline C1 with lower address in E state
2. stop listening to coherency traffic on C1
3. get cacheline C2 with higher address in E state
4. stop listening to coherency traffic on C2
5. do the op
6. flush "store buffers" if needed (cachelines may go to M state)
7. start listening to coherency traffic on C1,C2.

I am not entirely sure why nobody ever (?) implemented it like that.
It's useful not so much for operands crossing cachlines, but for
dealing with larger operands (sorta like CMPXCHG16B). This could be
used instead of a common-case (?) of a small struct protected by an
intrusive lock (making things faster and getting rid of the lock).
Unlike RTM/HLE, there is no need for fallback mechanism, because it
can't be "permanently failing", because the "critical section" is
well-defined in advance.  On x86 the opcode may have been a LOCKed
string operation or a LOCKed SIMD operation (which may be a good idea
(?), even if aligned).

My best guess is that it is due to high cost of indivisible
instructions with multiple memory operands (x86 doesn't have any). The
cost is high, because a lot of expensive state has to be kept for
speculative execution rollback. This is why

1. string operations are divisible (visible as separate loads and stores).
2.  HLE/RTM abort on page faults/etc

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/c9611570-b818-40f6-8904-f052edc7522d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[lock-free] Atomic operations on two cachelines

Reply via email to