No attempts to fix or update the text; these are left for the next patch in the series.
Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- docs/devel/atomics.rst | 446 +++++++++++++++++++++++++++++++++++++++++ docs/devel/atomics.txt | 403 ------------------------------------- docs/devel/index.rst | 1 + 3 files changed, 447 insertions(+), 403 deletions(-) create mode 100644 docs/devel/atomics.rst delete mode 100644 docs/devel/atomics.txt diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst new file mode 100644 index 0000000000..83ed3d6981 --- /dev/null +++ b/docs/devel/atomics.rst @@ -0,0 +1,446 @@ +========================= +Atomic operations in QEMU +========================= + +CPUs perform independent memory operations effectively in random order. +but this can be a problem for CPU-CPU interaction (including interactions +between QEMU and the guest). Multi-threaded programs use various tools +to instruct the compiler and the CPU to restrict the order to something +that is consistent with the expectations of the programmer. + +The most basic tool is locking. Mutexes, condition variables and +semaphores are used in QEMU, and should be the default approach to +synchronization. Anything else is considerably harder, but it's +also justified more often than one would like. The two tools that +are provided by ``qemu/atomic.h`` are memory barriers and atomic operations. + +Macros defined by ``qemu/atomic.h`` fall in three camps: + +- compiler barriers: ``barrier()``; + +- weak atomic access and manual memory barriers: ``atomic_read()``, + ``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``, + ``smp_mb_release()``, ``smp_read_barrier_depends()``; + +- sequentially consistent atomic access: everything else. + + +Compiler memory barrier +======================= + +``barrier()`` prevents the compiler from moving the memory accesses either +side of it to the other side. The compiler barrier has no direct effect +on the CPU, which may then reorder things however it wishes. + +``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some +architectures, CPU guarantees are strong enough that blocking compiler +optimizations already ensures the correct order of execution. In this +case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple +compiler barriers. + +Still, ``barrier()`` can be useful when writing code that can be interrupted +by signal handlers. + + +Sequentially consistent atomic access +===================================== + +Most of the operations in the ``qemu/atomic.h`` header ensure *sequential +consistency*, where "the result of any execution is the same as if the +operations of all the processors were executed in some sequential order, +and the operations of each individual processor appear in this sequence +in the order specified by its program". + +``qemu/atomic.h`` provides the following set of atomic read-modify-write +operations:: + + void atomic_inc(ptr) + void atomic_dec(ptr) + void atomic_add(ptr, val) + void atomic_sub(ptr, val) + void atomic_and(ptr, val) + void atomic_or(ptr, val) + + typeof(*ptr) atomic_fetch_inc(ptr) + typeof(*ptr) atomic_fetch_dec(ptr) + typeof(*ptr) atomic_fetch_add(ptr, val) + typeof(*ptr) atomic_fetch_sub(ptr, val) + typeof(*ptr) atomic_fetch_and(ptr, val) + typeof(*ptr) atomic_fetch_or(ptr, val) + typeof(*ptr) atomic_fetch_xor(ptr, val) + typeof(*ptr) atomic_fetch_inc_nonzero(ptr) + typeof(*ptr) atomic_xchg(ptr, val) + typeof(*ptr) atomic_cmpxchg(ptr, old, new) + +all of which return the old value of ``*ptr``. These operations are +polymorphic; they operate on any type that is as wide as a pointer. + +Similar operations return the new value of ``*ptr``:: + + typeof(*ptr) atomic_inc_fetch(ptr) + typeof(*ptr) atomic_dec_fetch(ptr) + typeof(*ptr) atomic_add_fetch(ptr, val) + typeof(*ptr) atomic_sub_fetch(ptr, val) + typeof(*ptr) atomic_and_fetch(ptr, val) + typeof(*ptr) atomic_or_fetch(ptr, val) + typeof(*ptr) atomic_xor_fetch(ptr, val) + +Sequentially consistent loads and stores can be done using:: + + atomic_fetch_add(ptr, 0) for loads + atomic_xchg(ptr, val) for stores + +However, they are quite expensive on some platforms, notably POWER and +Arm. Therefore, qemu/atomic.h provides two primitives with slightly +weaker constraints:: + + typeof(*ptr) atomic_mb_read(ptr) + void atomic_mb_set(ptr, val) + +The semantics of these primitives map to Java volatile variables, +and are strongly related to memory barriers as used in the Linux +kernel (see below). + +As long as you use atomic_mb_read and atomic_mb_set, accesses cannot +be reordered with each other, and it is also not possible to reorder +"normal" accesses around them. + +However, and this is the important difference between +atomic_mb_read/atomic_mb_set and sequential consistency, it is important +for both threads to access the same volatile variable. It is not the +case that everything visible to thread A when it writes volatile field f +becomes visible to thread B after it reads volatile field g. The store +and load have to "match" (i.e., be performed on the same volatile +field) to achieve the right semantics. + + +These operations operate on any type that is as wide as an int or smaller. + + +Weak atomic access and manual memory barriers +============================================= + +Compared to sequentially consistent atomic access, programming with +weaker consistency models can be considerably more complicated. +In general, if the algorithm you are writing includes both writes +and reads on the same side, it is generally simpler to use sequentially +consistent primitives. + +When using this model, variables are accessed with: + +- ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from + optimizing accesses out of existence and creating unsolicited + accesses, but do not otherwise impose any ordering on loads and + stores: both the compiler and the processor are free to reorder + them. + +- ``atomic_load_acquire()``, which guarantees the LOAD to appear to + happen, with respect to the other components of the system, + before all the LOAD or STORE operations specified afterwards. + Operations coming before ``atomic_load_acquire()`` can still be + reordered after it. + +- ``atomic_store_release()``, which guarantees the STORE to appear to + happen, with respect to the other components of the system, + after all the LOAD or STORE operations specified afterwards. + Operations coming after ``atomic_store_release()`` can still be + reordered after it. + +Restrictions to the ordering of accesses can also be specified +using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, +``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. + +Memory barriers control the order of references to shared memory. +They come in six kinds: + +- ``smp_rmb()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not + required to have any effect on stores. + +- ``smp_wmb()`` guarantees that all the STORE operations specified before + the barrier will appear to happen before all the STORE operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not + required to have any effect on loads. + +- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD or STORE operations + specified after the barrier with respect to the other components of + the system. + +- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* + the barrier will appear to happen after all the LOAD or STORE operations + specified *before* the barrier with respect to the other components of + the system. + +- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified + before the barrier will appear to happen before all the LOAD and + STORE operations specified after the barrier with respect to the other + components of the system. + + ``smp_mb()`` puts a partial ordering on both loads and stores. It is + stronger than both a read and a write memory barrier; it implies both + ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs + coming before the barrier from overtaking LOADs coming after the + barrier and vice versa. + +- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On + most processors, whenever two loads are performed such that the + second depends on the result of the first (e.g., the first load + retrieves the address to which the second load will be directed), + the processor will guarantee that the first LOAD will appear to happen + before the second with respect to the other components of the system. + However, this is not always true---for example, it was not true on + Alpha processors. Whenever this kind of access happens to shared + memory (that is not protected by a lock), a read barrier is needed, + and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. + + Note that the first load really has to have a _data_ dependency and not + a control dependency. If the address for the second load is dependent + on the first load, but the dependency is through a conditional rather + than actually loading the address itself, then it's a _control_ + dependency and a full read barrier or better is required. + + +This is the set of barriers that is required *between* two ``atomic_read()`` +and ``atomic_set()`` operations to achieve sequential consistency: + + +----------------+-------------------------------------------------------+ + | | 2nd operation | + | +------------------+-----------------+------------------+ + | 1st operation | (after last) | atomic_read | atomic_set | + +----------------+------------------+-----------------+------------------+ + | (before first) | .. | none | smp_mb_release() | + +----------------+------------------+-----------------+------------------+ + | atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ | + +----------------+------------------+-----------------+------------------+ + | atomic_set | none | smp_mb() [3]_ | smp_wmb() | + +----------------+------------------+-----------------+------------------+ + + .. [1] Or smp_read_barrier_depends(). + + .. [2] This requires a load-store barrier. This is achieved by + either smp_mb_acquire() or smp_mb_release(). + + .. [3] This requires a store-load barrier. On most machines, the only + way to achieve this is a full barrier. + + +You can see that the two possible definitions of ``atomic_mb_read()`` +and ``atomic_mb_set()`` are the following: + + 1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() + | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() + + 2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() + | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); + +Usually the former is used, because ``smp_mb()`` is expensive and a program +normally has more reads than writes. Therefore it makes more sense to +make ``atomic_mb_set()`` the more expensive operation. + +There are two common cases in which atomic_mb_read and atomic_mb_set +generate too many memory barriers, and thus it can be useful to manually +place barriers, or use atomic_load_acquire/atomic_store_release instead: + +- when a data structure has one thread that is always a writer + and one thread that is always a reader, manual placement of + memory barriers makes the write side faster. Furthermore, + correctness is easy to check for in this case using the "pairing" + trick that is explained below: + + +----------------------------------------------------------------------+ + | thread 1 | + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | (other writes) | | + | atomic_mb_set(&a, x) | atomic_store_release(&a, x) | + | atomic_mb_set(&b, y) | atomic_store_release(&b, y) | + +-----------------------------------+----------------------------------+ + + +----------------------------------------------------------------------+ + | thread 2 | + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) | + | x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) | + | (other reads) | | + +-----------------------------------+----------------------------------+ + + Note that the barrier between the stores in thread 1, and between + the loads in thread 2, has been optimized here to a write or a + read memory barrier respectively. On some architectures, notably + ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as + smp_mb, but smp_rmb and/or smp_wmb are more efficient. + +- sometimes, a thread is accessing many variables that are otherwise + unrelated to each other (for example because, apart from the current + thread, exactly one other thread will read or write each of these + variables). In this case, it is possible to "hoist" the implicit + barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside + a loop. For example, the above definition ``atomic_mb_read()`` gives + the following transformation: + + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | n = 0; | n = 0; | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); | + | | smp_mb_acquire(); | + +-----------------------------------+----------------------------------+ + + Similarly, atomic_mb_set() can be transformed as follows: + + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | + | | smp_mb(); | + +-----------------------------------+----------------------------------+ + + + The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``. + +The two tricks can be combined. In this case, splitting a loop in +two lets you hoist the barriers out of the loops _and_ eliminate the +expensive ``smp_mb()``: + + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | + | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | + | atomic_mb_set(&b[i], false); | smb_wmb(); | + | } | for (i = 0; i < 10; i++) | + | | atomic_set(&a[i], false); | + | | smp_mb(); | + +-----------------------------------+----------------------------------+ + + +Memory barrier pairing +---------------------- + +A useful rule of thumb is that memory barriers should always, or almost +always, be paired with another barrier. In the case of QEMU, however, +note that the other barrier may actually be in a driver that runs in +the guest! + +For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()`` +both count as read barriers. A read barrier shall pair with a write +barrier or a full barrier; a write barrier shall pair with a read +barrier or a full barrier. A full barrier can pair with anything. +For example: + + +--------------------+------------------------------+ + | thread 1 | thread 2 | + +====================+==============================+ + | :: | :: | + | | | + | a = 1; | | + | smp_wmb(); | | + | b = 2; | x = b; | + | | smp_rmb(); | + | | y = a; | + +--------------------+------------------------------+ + +Note that the "writing" thread is accessing the variables in the +opposite order as the "reading" thread. This is expected: stores +before the write barrier will normally match the loads after the +read barrier, and vice versa. The same is true for more than 2 +access and for data dependency barriers: + + +----------------------+------------------------------+ + | thread 1 | thread 2 | + +======================+==============================+ + | :: | :: | + | | | + | b[2] = 1; | | + | smp_wmb(); | | + | x->i = 2; | | + | smp_wmb(); | | + | a = x; | x = a; | + | | smp_read_barrier_depends(); | + | | y = x->i; | + | | smp_read_barrier_depends(); | + | | z = b[y]; | + +----------------------+------------------------------+ + +``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``. +and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``. + + +Comparison with Linux kernel memory barriers +============================================ + +Here is a list of differences between Linux kernel atomic operations +and memory barriers, and the equivalents in QEMU: + +- atomic operations in Linux are always on a 32-bit int type and + use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic + and use normal C types. + +- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee + at all. Linux 4.1 updated them to implement volatile + semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). + + QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed + semantics if the compiler supports it, and volatile semantics otherwise. + Both semantics prevent the compiler from doing certain transformations; + the difference is that atomic accesses are guaranteed to be atomic, + while volatile accesses aren't. Thus, in the volatile case we just cross + our fingers hoping that the compiler will generate atomic accesses, + since we assume the variables passed are machine-word sized and + properly aligned. + + No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux + or QEMU. + +- atomic read-modify-write operations in Linux are of three kinds: + + ===================== ========================================= + ``atomic_OP`` returns void + ``atomic_OP_return`` returns new value of the variable + ``atomic_fetch_OP`` returns the old value of the variable + ``atomic_cmpxchg`` returns the old value of the variable + ===================== ========================================= + + In QEMU, the second kind does not exist. Currently Linux has + atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. + +- different atomic read-modify-write operations in Linux imply + a different set of memory barriers; in QEMU, all of them enforce + sequential consistency, which means they imply full memory barriers + before and after the operation. + +- Linux does not have an equivalent of ``atomic_mb_set()``. In particular, + note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``. + ``atomic_mb_read()`` compiles to the same instructions as Linux's + ``smp_load_acquire()``, but this should be treated as an implementation + detail. + +Sources +======= + +- ``Documentation/memory-barriers.txt`` from the Linux kernel diff --git a/docs/devel/atomics.txt b/docs/devel/atomics.txt deleted file mode 100644 index 67bdf82628..0000000000 --- a/docs/devel/atomics.txt +++ /dev/null @@ -1,403 +0,0 @@ -CPUs perform independent memory operations effectively in random order. -but this can be a problem for CPU-CPU interaction (including interactions -between QEMU and the guest). Multi-threaded programs use various tools -to instruct the compiler and the CPU to restrict the order to something -that is consistent with the expectations of the programmer. - -The most basic tool is locking. Mutexes, condition variables and -semaphores are used in QEMU, and should be the default approach to -synchronization. Anything else is considerably harder, but it's -also justified more often than one would like. The two tools that -are provided by qemu/atomic.h are memory barriers and atomic operations. - -Macros defined by qemu/atomic.h fall in three camps: - -- compiler barriers: barrier(); - -- weak atomic access and manual memory barriers: atomic_read(), - atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_mb_acquire(), - smp_mb_release(), smp_read_barrier_depends(); - -- sequentially consistent atomic access: everything else. - - -COMPILER MEMORY BARRIER -======================= - -barrier() prevents the compiler from moving the memory accesses either -side of it to the other side. The compiler barrier has no direct effect -on the CPU, which may then reorder things however it wishes. - -barrier() is mostly used within qemu/atomic.h itself. On some -architectures, CPU guarantees are strong enough that blocking compiler -optimizations already ensures the correct order of execution. In this -case, qemu/atomic.h will reduce stronger memory barriers to simple -compiler barriers. - -Still, barrier() can be useful when writing code that can be interrupted -by signal handlers. - - -SEQUENTIALLY CONSISTENT ATOMIC ACCESS -===================================== - -Most of the operations in the qemu/atomic.h header ensure *sequential -consistency*, where "the result of any execution is the same as if the -operations of all the processors were executed in some sequential order, -and the operations of each individual processor appear in this sequence -in the order specified by its program". - -qemu/atomic.h provides the following set of atomic read-modify-write -operations: - - void atomic_inc(ptr) - void atomic_dec(ptr) - void atomic_add(ptr, val) - void atomic_sub(ptr, val) - void atomic_and(ptr, val) - void atomic_or(ptr, val) - - typeof(*ptr) atomic_fetch_inc(ptr) - typeof(*ptr) atomic_fetch_dec(ptr) - typeof(*ptr) atomic_fetch_add(ptr, val) - typeof(*ptr) atomic_fetch_sub(ptr, val) - typeof(*ptr) atomic_fetch_and(ptr, val) - typeof(*ptr) atomic_fetch_or(ptr, val) - typeof(*ptr) atomic_fetch_xor(ptr, val) - typeof(*ptr) atomic_fetch_inc_nonzero(ptr) - typeof(*ptr) atomic_xchg(ptr, val) - typeof(*ptr) atomic_cmpxchg(ptr, old, new) - -all of which return the old value of *ptr. These operations are -polymorphic; they operate on any type that is as wide as a pointer. - -Similar operations return the new value of *ptr: - - typeof(*ptr) atomic_inc_fetch(ptr) - typeof(*ptr) atomic_dec_fetch(ptr) - typeof(*ptr) atomic_add_fetch(ptr, val) - typeof(*ptr) atomic_sub_fetch(ptr, val) - typeof(*ptr) atomic_and_fetch(ptr, val) - typeof(*ptr) atomic_or_fetch(ptr, val) - typeof(*ptr) atomic_xor_fetch(ptr, val) - -Sequentially consistent loads and stores can be done using: - - atomic_fetch_add(ptr, 0) for loads - atomic_xchg(ptr, val) for stores - -However, they are quite expensive on some platforms, notably POWER and -Arm. Therefore, qemu/atomic.h provides two primitives with slightly -weaker constraints: - - typeof(*ptr) atomic_mb_read(ptr) - void atomic_mb_set(ptr, val) - -The semantics of these primitives map to Java volatile variables, -and are strongly related to memory barriers as used in the Linux -kernel (see below). - -As long as you use atomic_mb_read and atomic_mb_set, accesses cannot -be reordered with each other, and it is also not possible to reorder -"normal" accesses around them. - -However, and this is the important difference between -atomic_mb_read/atomic_mb_set and sequential consistency, it is important -for both threads to access the same volatile variable. It is not the -case that everything visible to thread A when it writes volatile field f -becomes visible to thread B after it reads volatile field g. The store -and load have to "match" (i.e., be performed on the same volatile -field) to achieve the right semantics. - - -These operations operate on any type that is as wide as an int or smaller. - - -WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS -============================================= - -Compared to sequentially consistent atomic access, programming with -weaker consistency models can be considerably more complicated. -In general, if the algorithm you are writing includes both writes -and reads on the same side, it is generally simpler to use sequentially -consistent primitives. - -When using this model, variables are accessed with: - -- atomic_read() and atomic_set(); these prevent the compiler from - optimizing accesses out of existence and creating unsolicited - accesses, but do not otherwise impose any ordering on loads and - stores: both the compiler and the processor are free to reorder - them. - -- atomic_load_acquire(), which guarantees the LOAD to appear to - happen, with respect to the other components of the system, - before all the LOAD or STORE operations specified afterwards. - Operations coming before atomic_load_acquire() can still be - reordered after it. - -- atomic_store_release(), which guarantees the STORE to appear to - happen, with respect to the other components of the system, - after all the LOAD or STORE operations specified afterwards. - Operations coming after atomic_store_release() can still be - reordered after it. - -Restrictions to the ordering of accesses can also be specified -using the memory barrier macros: smp_rmb(), smp_wmb(), smp_mb(), -smp_mb_acquire(), smp_mb_release(), smp_read_barrier_depends(). - -Memory barriers control the order of references to shared memory. -They come in six kinds: - -- smp_rmb() guarantees that all the LOAD operations specified before - the barrier will appear to happen before all the LOAD operations - specified after the barrier with respect to the other components of - the system. - - In other words, smp_rmb() puts a partial ordering on loads, but is not - required to have any effect on stores. - -- smp_wmb() guarantees that all the STORE operations specified before - the barrier will appear to happen before all the STORE operations - specified after the barrier with respect to the other components of - the system. - - In other words, smp_wmb() puts a partial ordering on stores, but is not - required to have any effect on loads. - -- smp_mb_acquire() guarantees that all the LOAD operations specified before - the barrier will appear to happen before all the LOAD or STORE operations - specified after the barrier with respect to the other components of - the system. - -- smp_mb_release() guarantees that all the STORE operations specified *after* - the barrier will appear to happen after all the LOAD or STORE operations - specified *before* the barrier with respect to the other components of - the system. - -- smp_mb() guarantees that all the LOAD and STORE operations specified - before the barrier will appear to happen before all the LOAD and - STORE operations specified after the barrier with respect to the other - components of the system. - - smp_mb() puts a partial ordering on both loads and stores. It is - stronger than both a read and a write memory barrier; it implies both - smp_mb_acquire() and smp_mb_release(), but it also prevents STOREs - coming before the barrier from overtaking LOADs coming after the - barrier and vice versa. - -- smp_read_barrier_depends() is a weaker kind of read barrier. On - most processors, whenever two loads are performed such that the - second depends on the result of the first (e.g., the first load - retrieves the address to which the second load will be directed), - the processor will guarantee that the first LOAD will appear to happen - before the second with respect to the other components of the system. - However, this is not always true---for example, it was not true on - Alpha processors. Whenever this kind of access happens to shared - memory (that is not protected by a lock), a read barrier is needed, - and smp_read_barrier_depends() can be used instead of smp_rmb(). - - Note that the first load really has to have a _data_ dependency and not - a control dependency. If the address for the second load is dependent - on the first load, but the dependency is through a conditional rather - than actually loading the address itself, then it's a _control_ - dependency and a full read barrier or better is required. - - -This is the set of barriers that is required *between* two atomic_read() -and atomic_set() operations to achieve sequential consistency: - - | 2nd operation | - |-----------------------------------------------| - 1st operation | (after last) | atomic_read | atomic_set | - ---------------+----------------+-------------+----------------| - (before first) | | none | smp_mb_release | - ---------------+----------------+-------------+----------------| - atomic_read | smp_mb_acquire | smp_rmb | ** | - ---------------+----------------+-------------+----------------| - atomic_set | none | smp_mb()*** | smp_wmb() | - ---------------+----------------+-------------+----------------| - - * Or smp_read_barrier_depends(). - - ** This requires a load-store barrier. This is achieved by - either smp_mb_acquire() or smp_mb_release(). - - *** This requires a store-load barrier. On most machines, the only - way to achieve this is a full barrier. - - -You can see that the two possible definitions of atomic_mb_read() -and atomic_mb_set() are the following: - - 1) atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() - atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() - - 2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() - atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); - -Usually the former is used, because smp_mb() is expensive and a program -normally has more reads than writes. Therefore it makes more sense to -make atomic_mb_set() the more expensive operation. - -There are two common cases in which atomic_mb_read and atomic_mb_set -generate too many memory barriers, and thus it can be useful to manually -place barriers, or use atomic_load_acquire/atomic_store_release instead: - -- when a data structure has one thread that is always a writer - and one thread that is always a reader, manual placement of - memory barriers makes the write side faster. Furthermore, - correctness is easy to check for in this case using the "pairing" - trick that is explained below: - - thread 1 thread 1 - ------------------------- ------------------------ - (other writes) - atomic_mb_set(&a, x) atomic_store_release(&a, x) - atomic_mb_set(&b, y) atomic_store_release(&b, y) - - => - thread 2 thread 2 - ------------------------- ------------------------ - y = atomic_mb_read(&b) y = atomic_load_acquire(&b) - x = atomic_mb_read(&a) x = atomic_load_acquire(&a) - (other reads) - - Note that the barrier between the stores in thread 1, and between - the loads in thread 2, has been optimized here to a write or a - read memory barrier respectively. On some architectures, notably - ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as - smp_mb, but smp_rmb and/or smp_wmb are more efficient. - -- sometimes, a thread is accessing many variables that are otherwise - unrelated to each other (for example because, apart from the current - thread, exactly one other thread will read or write each of these - variables). In this case, it is possible to "hoist" the implicit - barriers provided by atomic_mb_read() and atomic_mb_set() outside - a loop. For example, the above definition atomic_mb_read() gives - the following transformation: - - n = 0; n = 0; - for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) - n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]); - smp_mb_acquire(); - - Similarly, atomic_mb_set() can be transformed as follows: - - smp_mb_release(); - for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) - atomic_mb_set(&a[i], false); atomic_set(&a[i], false); - smp_mb(); - - - The other thread can still use atomic_mb_read()/atomic_mb_set(). - -The two tricks can be combined. In this case, splitting a loop in -two lets you hoist the barriers out of the loops _and_ eliminate the -expensive smp_mb(): - - smp_mb_release(); - for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++) - atomic_mb_set(&a[i], false); atomic_set(&a[i], false); - atomic_mb_set(&b[i], false); smb_wmb(); - } for (i = 0; i < 10; i++) - atomic_set(&a[i], false); - smp_mb(); - - -Memory barrier pairing ----------------------- - -A useful rule of thumb is that memory barriers should always, or almost -always, be paired with another barrier. In the case of QEMU, however, -note that the other barrier may actually be in a driver that runs in -the guest! - -For the purposes of pairing, smp_read_barrier_depends() and smp_rmb() -both count as read barriers. A read barrier shall pair with a write -barrier or a full barrier; a write barrier shall pair with a read -barrier or a full barrier. A full barrier can pair with anything. -For example: - - thread 1 thread 2 - =============== =============== - a = 1; - smp_wmb(); - b = 2; x = b; - smp_rmb(); - y = a; - -Note that the "writing" thread is accessing the variables in the -opposite order as the "reading" thread. This is expected: stores -before the write barrier will normally match the loads after the -read barrier, and vice versa. The same is true for more than 2 -access and for data dependency barriers: - - thread 1 thread 2 - =============== =============== - b[2] = 1; - smp_wmb(); - x->i = 2; - smp_wmb(); - a = x; x = a; - smp_read_barrier_depends(); - y = x->i; - smp_read_barrier_depends(); - z = b[y]; - -smp_wmb() also pairs with atomic_mb_read() and smp_mb_acquire(). -and smp_rmb() also pairs with atomic_mb_set() and smp_mb_release(). - - -COMPARISON WITH LINUX KERNEL MEMORY BARRIERS -============================================ - -Here is a list of differences between Linux kernel atomic operations -and memory barriers, and the equivalents in QEMU: - -- atomic operations in Linux are always on a 32-bit int type and - use a boxed atomic_t type; atomic operations in QEMU are polymorphic - and use normal C types. - -- Originally, atomic_read and atomic_set in Linux gave no guarantee - at all. Linux 4.1 updated them to implement volatile - semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE). - - QEMU's atomic_read/set implement, if the compiler supports it, C11 - atomic relaxed semantics, and volatile semantics otherwise. - Both semantics prevent the compiler from doing certain transformations; - the difference is that atomic accesses are guaranteed to be atomic, - while volatile accesses aren't. Thus, in the volatile case we just cross - our fingers hoping that the compiler will generate atomic accesses, - since we assume the variables passed are machine-word sized and - properly aligned. - No barriers are implied by atomic_read/set in either Linux or QEMU. - -- atomic read-modify-write operations in Linux are of three kinds: - - atomic_OP returns void - atomic_OP_return returns new value of the variable - atomic_fetch_OP returns the old value of the variable - atomic_cmpxchg returns the old value of the variable - - In QEMU, the second kind does not exist. Currently Linux has - atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. - -- different atomic read-modify-write operations in Linux imply - a different set of memory barriers; in QEMU, all of them enforce - sequential consistency, which means they imply full memory barriers - before and after the operation. - -- Linux does not have an equivalent of atomic_mb_set(). In particular, - note that smp_store_mb() is a little weaker than atomic_mb_set(). - atomic_mb_read() compiles to the same instructions as Linux's - smp_load_acquire(), but this should be treated as an implementation - detail. - -SOURCES -======= - -* Documentation/memory-barriers.txt from the Linux kernel - -* "The JSR-133 Cookbook for Compiler Writers", available at - http://g.oswego.edu/dl/jmm/cookbook.html diff --git a/docs/devel/index.rst b/docs/devel/index.rst index b734ba4655..a9e1200dff 100644 --- a/docs/devel/index.rst +++ b/docs/devel/index.rst @@ -17,6 +17,7 @@ Contents: loads-stores memory migration + atomics stable-process testing decodetree -- 2.18.2