This RFC provides softmmu/cpu primitives for cmpxchg operations. These primitives are not TCG ops; they are meant to be called from target helper functions. They apply to both system and user modes.
To increase performance for architectures that have "locked" versions of regular ops (e.g. x86), helpers are also provided for most of these atomic ops (e.g. atomic_add_and_fetch). Performance numbers that compare atomic_* vs. cmpxchg-only emulation of atomics for x86 are available in patch 20's commit log. cmpxchg helpers are also used for emulating LL/SC (in arm/aarch64), starting from patch 21. # Correctness Using cmpxchg to emulate LL/SC is not correct, because it can suffer from the ABA problem. However, I argue that a fully correct implementation is not necessary for most real applications out there. In other words, while it would be trivial to write a parallel program that would incorrectly be emulated due to ABA, most parallel code is written assuming only cmpxchg is available, e.g. the kernel or gcc provide cmpxchg and other (simpler) "atomic" ops, but nothing that would prevent ABA. [*] [*] http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html I propose to have a cmpxchg-based implementation as the default, to then add an option to enable a worse-performing-yet-correct LL/SC implementation. I have a working solution that works in user and system modes, but requires instrumenting *all* stores with helpers, which has significant overhead when compared to cmpxchg. See this bar chart, with measurements taken on an Intel Haswell: http://imgur.com/KKb7S4t Overhead for those SPEC workloads is on average ~20% ("store tracking"). Note that most of this comes from calling the helpers ("store helpers only"). This implementation, however, scales well, since no single lock is taken for emulating atomic accesses (each cache line, when necessary, gets its own lock). If there's interest, I can share this implementation. # Performance and Scalability The cmpxchg-based implementation in this RFC scales. It doesn't require cross-CPU communication apart from the inevitable cache line contention in the guest workload. In other words, no external subsystems (e.g. TLBs, CPU loop) are touched, nor heavily-contended locks are added. The only locks in this RFC are added to emulate 16b atomics; for now just a small table of locks is used. Using locks for emulating these atomics is OK, since there are no "regular" (i.e. non-atomic) loads/stores that could race with 16b atomics. See the commit log in patches 22 (ARM) and 26 (aarch64) for a performance comparison vs. the existing linux-user emulation on a 64-core system. Tests are done using a newly added benchmark that stresses the cache hierarchy with a configurable level of contention (patch 19). # Testing All implementations (x86_64, ARM, aarch64) pass the ck_pr validation tests in concurrencykit[*] (ck/regressions/ck_pr_validate) in user-mode. I have not tested the "paired" flavours of LL/SC in aarch64, so help on this would be appreciated. I haven't even compile-tested on 32-bit hosts. [*] http://concurrencykit.org/ # Why this is an RFC, and not a PATCH set I'd like to have your input on the following: - Is it OK *not* to write on a failed cmpxchg on user-only x86? On system-mode we have the write TLB access, so we'd get a write fault (all atomic/cmpxchg ops fault as a write fault, which AFAIK is the right thing to do), so that should suffice. - What to do when atomic ops are used on something other than RAM? Should we have a "slow path" that is not atomic for these cases, or it's OK to assume code is bogus? For now, I just wrote XXX. - In target-i386 code, I might have added unnecessary temps; I'm not sure which temps need to retain a meaningful value after translating instructions, so what I did is to make sure they retain the same value they'd have if the lock prefix wasn't there. - Is it worth adding more checks to ensure that the LOCK prefix is where it should be? In some places I'm assuming "if (LOCK_PREFIX)" then we must have a memory operand (not a register operand). Assuming that instructions are always well-formed might be overly optimistic. - How/where to fail when cmpxchg16 is performed on an address that is not 16b-aligned? x86 requires cmpxchg16b to be aligned, but doesn't enforce alignment on other atomics. - Unaligned atomic ops: AFAIK only x86 supports them. I'm just punting on this since I'm passing the possibly-unaligned address to the hosts' compiler. AFAIK gcc deals with this by enlarging the size of the atomic to be used, but if for instance we're doing a cmpxchg8b on an unaligned address, very few hosts have a cmpxchg16b to deal with this. - What to do with architectures that cannot guarantee atomic regular accesses? For instance, when emulating a 64-bit system on a 32-bit one. This will break a lot of parallel code, unless we serialize all loads/stores. I assume we won't support MTTCG on these architectures, and be done with this, right? Otherwise we'd have to instrument all loads and stores. - ARM's cmpxchg syscall in linux-user could be improved; I've ignored it so far. Thanks, Emilio