Re: [Qemu-devel] [PATCH 04/30] add a header file for atomic operations

liu ping fan Tue, 02 Jul 2013 19:26:36 -0700

On Sat, Jun 29, 2013 at 2:26 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> We're already using them in several places, but __sync builtins are just
> too ugly to type, and do not provide seqcst load/store operations.
>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>  docs/atomics.txt         | 345 
> +++++++++++++++++++++++++++++++++++++++++++++++
>  hw/display/qxl.c         |   3 +-
>  hw/virtio/vhost.c        |   9 +-
>  include/qemu/atomic.h    | 190 +++++++++++++++++++++-----
>  migration.c              |   3 +-
>  tests/test-thread-pool.c |   8 +-
>  6 files changed, 514 insertions(+), 44 deletions(-)
>  create mode 100644 docs/atomics.txt
>
> diff --git a/docs/atomics.txt b/docs/atomics.txt
> new file mode 100644
> index 0000000..e2ce04b
> --- /dev/null
> +++ b/docs/atomics.txt
> @@ -0,0 +1,345 @@
> +CPUs perform independent memory operations effectively in random order.
> +but this can be a problem for CPU-CPU interaction (including interactions
> +between QEMU and the guest).  Multi-threaded programs use various tools
> +to instruct the compiler and the CPU to restrict the order to something
> +that is consistent with the expectations of the programmer.
> +
> +The most basic tool is locking.  Mutexes, condition variables and
> +semaphores are used in QEMU, and should be the default approach to
> +synchronization.  Anything else is considerably harder, but it's
> +also justified more often than one would like.  The two tools that
> +are provided by qemu/atomic.h are memory barriers and atomic operations.
> +
> +Macros defined by qemu/atomic.h fall in three camps:
> +
> +- compiler barriers: barrier();
> +
> +- weak atomic access and manual memory barriers: atomic_read(),
> +  atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
> +
> +- sequentially consistent atomic access: everything else.
> +
> +
> +COMPILER MEMORY BARRIER
> +=======================
> +
> +barrier() prevents the compiler from moving the memory accesses either
> +side of it to the other side.  The compiler barrier has no direct effect
> +on the CPU, which may then reorder things however it wishes.
> +
> +barrier() is mostly used within qemu/atomic.h itself.  On some
> +architectures, CPU guarantees are strong enough that blocking compiler
> +optimizations already ensures the correct order of execution.  In this
> +case, qemu/atomic.h will reduce stronger memory barriers to simple
> +compiler barriers.
> +
> +Still, barrier() can be useful when writing code that can be interrupted
> +by signal handlers.
> +
> +
> +SEQUENTIALLY CONSISTENT ATOMIC ACCESS
> +=====================================
> +
> +Most of the operations in the qemu/atomic.h header ensure *sequential
> +consistency*, where "the result of any execution is the same as if the
> +operations of all the processors were executed in some sequential order,
> +and the operations of each individual processor appear in this sequence
> +in the order specified by its program".
> +
> +qemu/atomic.h provides the following set of atomic read-modify-write
> +operations:
> +
> +    typeof(*ptr) atomic_inc(ptr)
> +    typeof(*ptr) atomic_dec(ptr)
> +    typeof(*ptr) atomic_add(ptr, val)
> +    typeof(*ptr) atomic_sub(ptr, val)
> +    typeof(*ptr) atomic_and(ptr, val)
> +    typeof(*ptr) atomic_or(ptr, val)
> +    typeof(*ptr) atomic_xchg(ptr, val
> +    typeof(*ptr) atomic_cmpxchg(ptr, old, new)
> +
> +all of which return the old value of *ptr.  These operations are
> +polymorphic; they operate on any type that is as wide as an int.
> +
> +Sequentially consistent loads and stores can be done using:
> +
> +    atomic_add(ptr, 0) for loads
> +    atomic_xchg(ptr, val) for stores
> +
> +However, they are quite expensive on some platforms, notably POWER and
> +ARM.  Therefore, qemu/atomic.h provides two primitives with slightly
> +weaker constraints:
> +
> +    typeof(*ptr) atomic_mb_read(ptr)
> +    void         atomic_mb_set(ptr, val)
> +
> +The semantics of these primitives map to Java volatile variables,
> +and are strongly related to memory barriers as used in the Linux
> +kernel (see below).
> +
> +As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
> +be reordered with each other, and it is also not possible to reorder
> +"normal" accesses around them.
> +
> +However, and this is the important difference between
> +atomic_mb_read/atomic_mb_set and sequential consistency, it is important
> +for both threads to access the same volatile variable.  It is not the
> +case that everything visible to thread A when it writes volatile field f
> +becomes visible to thread B after it reads volatile field g. The store
> +and load have to "match" (i.e., be performed on the same volatile
> +field) to achieve the right semantics.
> +
> +
> +These operations operate on any type that is as wide as an int or smaller.
> +
> +
> +WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
> +=============================================
> +
> +Compared to sequentially consistent atomic access, programming with
> +weaker consistency models can be considerably more complicated.
> +In general, if the algorithm you are writing includes both writes
> +and reads on the same side, it is generally simpler to use sequentially
> +consistent primitives.
> +
> +When using this model, variables are accessed with atomic_read() and
> +atomic_set(), and restrictions to the ordering of accesses is enforced
> +using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends()
> +memory barriers.
> +
> +atomic_read() and atomic_set() prevents the compiler from using
> +optimizations that might otherwise optimize accesses out of existence
> +on the one hand, or that might create unsolicited accesses on the other.
> +In general this should not have any effect, because the same compiler
> +barriers are already implied by memory barriers.  However, it is useful
> +to do so, because it tells readers which variables are shared with
> +other threads, and which are local to the current thread or protected
> +by other, more mundane means.
> +
> +Memory barriers control the order of references to shared memory.
> +They come in four kinds:
> +
> +- smp_rmb() guarantees that all the LOAD operations specified before
> +  the barrier will appear to happen before all the LOAD operations
> +  specified after the barrier with respect to the other components of
> +  the system.
> +
> +  In other words, smp_rmb() puts a partial ordering on loads, but is not
> +  required to have any effect on stores.
> +
> +- smp_wmb() guarantees that all the STORE operations specified before
> +  the barrier will appear to happen before all the STORE operations
> +  specified after the barrier with respect to the other components of
> +  the system.
> +
> +  In other words, smp_wmb() puts a partial ordering on stores, but is not
> +  required to have any effect on loads.
> +
> +- smp_mb() guarantees that all the LOAD and STORE operations specified
> +  before the barrier will appear to happen before all the LOAD and
> +  STORE operations specified after the barrier with respect to the other
> +  components of the system.
> +
> +  smp_mb() puts a partial ordering on both loads and stores.  It is
> +  stronger than both a read and a write memory barrier; it implies both
> +  smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the
> +  barrier from overtaking LOADs coming after the barrier and vice versa.
> +
> +- smp_read_barrier_depends() is a weaker kind of read barrier.  On
> +  most processors, whenever two loads are performed such that the
> +  second depends on the result of the first (e.g., the first load
> +  retrieves the address to which the second load will be directed),
> +  the processor will guarantee that the first LOAD will appear to happen
> +  before the second with respect to the other components of the system.
> +  However, this is not always true---for example, it was not true on
> +  Alpha processors.  Whenever this kind of access happens to shared
> +  memory (that is not protected by a lock), a read barrier is needed,
> +  and smp_read_barrier_depends() can be used instead of smp_rmb().
> +
> +  Note that the first load really has to have a _data_ dependency and not
> +  a control dependency.  If the address for the second load is dependent
> +  on the first load, but the dependency is through a conditional rather
> +  than actually loading the address itself, then it's a _control_
> +  dependency and a full read barrier or better is required.
> +
> +
> +This is the set of barriers that is required *between* two atomic_read()
> +and atomic_set() operations to achieve sequential consistency:
> +
> +                    |               2nd operation             |
> +                    |-----------------------------------------|
> +     1st operation  | (after last) | atomic_read | atomic_set |
> +     ---------------+--------------+-------------+------------|
> +     (before first) |              | none        | smp_wmb()  |
> +     ---------------+--------------+-------------+------------|
> +     atomic_read    | smp_rmb()    | smp_rmb()*  | **         |
> +     ---------------+--------------+-------------+------------|
> +     atomic_set     | none         | smp_mb()*** | smp_wmb()  |
> +     ---------------+--------------+-------------+------------|
> +
> +       * Or smp_read_barrier_depends().
> +
> +      ** This requires a load-store barrier.  How to achieve this varies
> +         depending on the machine, but in practice smp_rmb()+smp_wmb()
> +         should have the desired effect.  For example, on PowerPC the
> +         lwsync instruction is a combined load-load, load-store and
> +         store-store barrier.
> +
> +     *** This requires a store-load barrier.  On most machines, the only
> +         way to achieve this is a full barrier.
> +
> +
> +You can see that the two possible definitions of atomic_mb_read()
> +and atomic_mb_set() are the following:
> +
> +    1) atomic_mb_read(p)   = atomic_read(p); smp_rmb()
> +       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb()
> +
> +    2) atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_rmb()
> +       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v);
> +
> +Usually the former is used, because smp_mb() is expensive and a program
> +normally has more reads than writes.  Therefore it makes more sense to
> +make atomic_mb_set() the more expensive operation.
> +
> +There are two common cases in which atomic_mb_read and atomic_mb_set
> +generate too many memory barriers, and thus it can be useful to manually
> +place barriers instead:
> +
> +- when a data structure has one thread that is always a writer
> +  and one thread that is always a reader, manual placement of
> +  memory barriers makes the write side faster.  Furthermore,
> +  correctness is easy to check for in this case using the "pairing"
> +  trick that is explained below:
> +
> +     thread 1                                thread 1
> +     -------------------------               ------------------------
> +     (other writes)
> +                                             smp_wmb()
> +     atomic_mb_set(&a, x)                    atomic_set(&a, x)
> +                                             smp_wmb()
> +     atomic_mb_set(&b, y)                    atomic_set(&b, y)
> +
> +                                       =>
> +     thread 2                                thread 2
> +     -------------------------               ------------------------
> +     y = atomic_mb_read(&b)                  y = atomic_read(&b)
> +                                             smp_rmb()
> +     x = atomic_mb_read(&a)                  x = atomic_read(&a)
> +                                             smp_rmb()
> +
> +- sometimes, a thread is accessing many variables that are otherwise
> +  unrelated to each other (for example because, apart from the current
> +  thread, exactly one other thread will read or write each of these
> +  variables).  In this case, it is possible to "hoist" the implicit
> +  barriers provided by atomic_mb_read() and atomic_mb_set() outside
> +  a loop.  For example, the above definition atomic_mb_read() gives
> +  the following transformation:
> +
> +     n = 0;                                  n = 0;
> +     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
> +       n += atomic_mb_read(&a[i]);             n += atomic_read(&a[i]);
> +                                             smp_rmb();
> +
> +  Similarly, atomic_mb_set() can be transformed as follows:
> +  smp_mb():
> +
> +                                             smp_wmb();
> +     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
> +       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
> +                                             smp_mb();
> +
> +
> +The two tricks can be combined.  In this case, splitting a loop in
> +two lets you hoist the barriers out of the loops _and_ eliminate the
> +expensive smp_mb():
> +
> +                                             smp_wmb();
> +     for (i = 0; i < 10; i++) {        =>    for (i = 0; i < 10; i++)
> +       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
> +       atomic_mb_set(&b[i], false);          smb_wmb();
> +     }                                       for (i = 0; i < 10; i++)
> +                                               atomic_set(&a[i], false);
> +                                             smp_mb();
> +
> +  The other thread can still use atomic_mb_read()/atomic_mb_set()
> +
> +
> +Memory barrier pairing
> +----------------------
> +
> +A useful rule of thumb is that memory barriers should always, or almost
> +always, be paired with another barrier.  In the case of QEMU, however,
> +note that the other barrier may actually be in a driver that runs in
> +the guest!
> +
> +For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
> +both count as read barriers.  A read barriers shall pair with a write
> +barrier or a full barrier; a write barrier shall pair with a read
> +barrier or a full barrier.  A full barrier can pair with anything.
> +For example:
> +
> +        thread 1             thread 2
> +        ===============      ===============
> +        a = 1;
> +        smp_wmb();
> +        b = 2;               x = b;
> +                             smp_rmb();
> +                             y = a;
> +
> +Note that the "writing" thread are accessing the variables in the
> +opposite order as the "reading" thread.  This is expected: stores
> +before the write barrier will normally match the loads after the
> +read barrier, and vice versa.  The same is true for more than 2
> +access and for data dependency barriers:
> +
> +        thread 1             thread 2
> +        ===============      ===============
> +        b[2] = 1;
> +        smp_wmb();
> +        x->i = 2;
> +        smp_wmb();
> +        a = x;               x = a;
> +                             smp_read_barrier_depends();
> +                             y = x->i;
> +                             smp_read_barrier_depends();
> +                             z = b[y];
> +
> +smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs
> +with atomic_mb_set().
> +
> +
> +COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
> +============================================
> +
> +Here is a list of differences between Linux kernel atomic operations
> +and memory barriers, and the equivalents in QEMU:
> +
> +- atomic operations in Linux are always on a 32-bit int type and


32-bit?  int should be the integer type that the target processor is
most efficient working with.

Regards,
Pingfan
> +  use a boxed atomic_t type; atomic operations in QEMU are polymorphic
> +  and use normal C types.
> +
> +- atomic_read and atomic_set in Linux give no guarantee at all;
> +  atomic_read and atomic_set in QEMU include a compiler barrier
> +  (similar to the ACCESS_ONCE macro in Linux).
> +
> +- most atomic read-modify-write operations in Linux return void;
> +  in QEMU, all of them return the old value of the variable.
> +
> +- different atomic read-modify-write operations in Linux imply
> +  a different set of memory barriers; in QEMU, all of them enforce
> +  sequential consistency, which means they imply full memory barriers
> +  before and after the operation.
> +
> +- Linux does not have an equivalent of atomic_mb_read() and
> +  atomic_mb_set().  In particular, note that set_mb() is a little
> +  weaker than atomic_mb_set().
> +
> +
> +SOURCES
> +=======
> +
> +* Documentation/memory-barriers.txt from the Linux kernel
> +
> +* "The JSR-133 Cookbook for Compiler Writers", available at
> +  http://g.oswego.edu/dl/jmm/cookbook.html
> diff --git a/hw/display/qxl.c b/hw/display/qxl.c
> index 3862d7a..f24cb4e 100644
> --- a/hw/display/qxl.c
> +++ b/hw/display/qxl.c
> @@ -23,6 +23,7 @@
>  #include "qemu-common.h"
>  #include "qemu/timer.h"
>  #include "qemu/queue.h"
> +#include "qemu/atomic.h"
>  #include "monitor/monitor.h"
>  #include "sysemu/sysemu.h"
>  #include "trace.h"
> @@ -1726,7 +1727,7 @@ static void qxl_send_events(PCIQXLDevice *d, uint32_t 
> events)
>          trace_qxl_send_events_vm_stopped(d->id, events);
>          return;
>      }
> -    old_pending = __sync_fetch_and_or(&d->ram->int_pending, le_events);
> +    old_pending = atomic_or(&d->ram->int_pending, le_events);
>      if ((old_pending & le_events) == le_events) {
>          return;
>      }
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 96ab625..8f6ab13 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -16,6 +16,7 @@
>  #include <sys/ioctl.h>
>  #include "hw/virtio/vhost.h"
>  #include "hw/hw.h"
> +#include "qemu/atomic.h"
>  #include "qemu/range.h"
>  #include <linux/vhost.h>
>  #include "exec/address-spaces.h"
> @@ -47,11 +48,9 @@ static void vhost_dev_sync_region(struct vhost_dev *dev,
>              addr += VHOST_LOG_CHUNK;
>              continue;
>          }
> -        /* Data must be read atomically. We don't really
> -         * need the barrier semantics of __sync
> -         * builtins, but it's easier to use them than
> -         * roll our own. */
> -        log = __sync_fetch_and_and(from, 0);
> +        /* Data must be read atomically. We don't really need barrier 
> semantics
> +         * but it's easier to use atomic_* than roll our own. */
> +        log = atomic_xchg(from, 0);
>          while ((bit = sizeof(log) > sizeof(int) ?
>                  ffsll(log) : ffs(log))) {
>              hwaddr page_addr;
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 10becb6..04d64d0 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -1,68 +1,194 @@
> -#ifndef __QEMU_BARRIER_H
> -#define __QEMU_BARRIER_H 1
> +/*
> + * Simple interface for atomic operations.
> + *
> + * Copyright (C) 2013 Red Hat, Inc.
> + *
> + * Author: Paolo Bonzini <pbonz...@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
>
> -/* Compiler barrier */
> -#define barrier()   asm volatile("" ::: "memory")
> +#ifndef __QEMU_ATOMIC_H
> +#define __QEMU_ATOMIC_H 1
>
> -#if defined(__i386__)
> +#include "qemu/compiler.h"
>
> -#include "qemu/compiler.h"        /* QEMU_GNUC_PREREQ */
> +/* For C11 atomic ops */
>
> -/*
> - * Because of the strongly ordered x86 storage model, wmb() and rmb() are 
> nops
> - * on x86(well, a compiler barrier only).  Well, at least as long as
> - * qemu doesn't do accesses to write-combining memory or non-temporal
> - * load/stores from C code.
> - */
> -#define smp_wmb()   barrier()
> -#define smp_rmb()   barrier()
> +/* Compiler barrier */
> +#define barrier()   ({ asm volatile("" ::: "memory"); (void)0; })
> +
> +#ifndef __ATOMIC_RELAXED
>
>  /*
> - * We use GCC builtin if it's available, as that can use
> - * mfence on 32 bit as well, e.g. if built with -march=pentium-m.
> - * However, on i386, there seem to be known bugs as recently as 4.3.
> - * */
> -#if QEMU_GNUC_PREREQ(4, 4)
> -#define smp_mb() __sync_synchronize()
> + * We use GCC builtin if it's available, as that can use mfence on
> + * 32-bit as well, e.g. if built with -march=pentium-m. However, on
> + * i386 the spec is buggy, and the implementation followed it until
> + * 4.3 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793).
> + */
> +#if defined(__i386__) || defined(__x86_64__)
> +#if !QEMU_GNUC_PREREQ(4, 4)
> +#if defined __x86_64__
> +#define smp_mb()    ({ asm volatile("mfence" ::: "memory"); (void)0; })
>  #else
> -#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
> +#define smp_mb()    ({ asm volatile("lock; addl $0,0(%%esp) " ::: "memory"); 
> (void)0; })
> +#endif
> +#endif
>  #endif
>
> -#elif defined(__x86_64__)
>
> +#ifdef __alpha__
> +#define smp_read_barrier_depends()   asm volatile("mb":::"memory")
> +#endif
> +
> +#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
> +
> +/*
> + * Because of the strongly ordered storage model, wmb() and rmb() are nops
> + * here (a compiler barrier only).  QEMU doesn't do accesses to 
> write-combining
> + * qemu memory or non-temporal load/stores from C code.
> + */
>  #define smp_wmb()   barrier()
>  #define smp_rmb()   barrier()
> -#define smp_mb() asm volatile("mfence" ::: "memory")
> +
> +/*
> + * __sync_lock_test_and_set() is documented to be an acquire barrier only,
> + * but it is a full barrier at the hardware level.  Add a compiler barrier
> + * to make it a full barrier also at the compiler level.
> + */
> +#define atomic_xchg(ptr, i)    (barrier(), __sync_lock_test_and_set(ptr, i))
> +
> +/*
> + * Load/store with Java volatile semantics.
> + */
> +#define atomic_mb_set(ptr, i)  ((void)atomic_xchg(ptr, i))
>
>  #elif defined(_ARCH_PPC)
>
>  /*
>   * We use an eieio() for wmb() on powerpc.  This assumes we don't
>   * need to order cacheable and non-cacheable stores with respect to
> - * each other
> + * each other.
> + *
> + * smp_mb has the same problem as on x86 for not-very-new GCC
> + * (http://patchwork.ozlabs.org/patch/126184/, Nov 2011).
>   */
> -#define smp_wmb()   asm volatile("eieio" ::: "memory")
> -
> +#define smp_wmb()   ({ asm volatile("eieio" ::: "memory"); (void)0; })
>  #if defined(__powerpc64__)
> -#define smp_rmb()   asm volatile("lwsync" ::: "memory")
> +#define smp_rmb()   ({ asm volatile("lwsync" ::: "memory"); (void)0; })
>  #else
> -#define smp_rmb()   asm volatile("sync" ::: "memory")
> +#define smp_rmb()   ({ asm volatile("sync" ::: "memory"); (void)0; })
>  #endif
> +#define smp_mb()    ({ asm volatile("sync" ::: "memory"); (void)0; })
>
> -#define smp_mb()   asm volatile("sync" ::: "memory")
> +#endif /* _ARCH_PPC */
>
> -#else
> +#endif /* C11 atomics */
>
>  /*
>   * For (host) platforms we don't have explicit barrier definitions
>   * for, we use the gcc __sync_synchronize() primitive to generate a
>   * full barrier.  This should be safe on all platforms, though it may
> - * be overkill for wmb() and rmb().
> + * be overkill for smp_wmb() and smp_rmb().
>   */
> +#ifndef smp_mb
> +#define smp_mb()    __sync_synchronize()
> +#endif
> +
> +#ifndef smp_wmb
> +#ifdef __ATOMIC_RELEASE
> +#define smp_wmb()   __atomic_thread_fence(__ATOMIC_RELEASE)
> +#else
>  #define smp_wmb()   __sync_synchronize()
> -#define smp_mb()   __sync_synchronize()
> +#endif
> +#endif
> +
> +#ifndef smp_rmb
> +#ifdef __ATOMIC_ACQUIRE
> +#define smp_rmb()   __atomic_thread_fence(__ATOMIC_ACQUIRE)
> +#else
>  #define smp_rmb()   __sync_synchronize()
> +#endif
> +#endif
> +
> +#ifndef smp_read_barrier_depends
> +#ifdef __ATOMIC_CONSUME
> +#define smp_read_barrier_depends()   __atomic_thread_fence(__ATOMIC_CONSUME)
> +#else
> +#define smp_read_barrier_depends()   barrier()
> +#endif
> +#endif
>
> +#ifndef atomic_read
> +#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
>  #endif
>
> +#ifndef atomic_set
> +#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
> +#endif
> +
> +/* These have the same semantics as Java volatile variables.
> + * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
> + * "1. Issue a StoreStore barrier (wmb) before each volatile store."
> + *  2. Issue a StoreLoad barrier after each volatile store.
> + *     Note that you could instead issue one before each volatile load, but
> + *     this would be slower for typical programs using volatiles in which
> + *     reads greatly outnumber writes. Alternatively, if available, you
> + *     can implement volatile store as an atomic instruction (for example
> + *     XCHG on x86) and omit the barrier. This may be more efficient if
> + *     atomic instructions are cheaper than StoreLoad barriers.
> + *  3. Issue LoadLoad and LoadStore barriers after each volatile load."
> + *
> + * If you prefer to think in terms of "pairing" of memory barriers,
> + * an atomic_mb_read pairs with an atomic_mb_set.
> + *
> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> + *
> + * These are a bit weaker than __atomic_load/store with __ATOMIC_SEQ_CST
> + * (see docs/atomics.txt), and I'm not sure that __ATOMIC_ACQ_REL is enough.
> + * Just always use the barriers manually by the rules above.
> + */
> +#ifndef atomic_mb_read
> +#define atomic_mb_read(ptr)    ({           \
> +    typeof(*ptr) _val = atomic_read(ptr);   \
> +    smp_rmb();                              \
> +    _val;                                   \
> +})
> +#endif
> +
> +#ifndef atomic_mb_set
> +#define atomic_mb_set(ptr, i)  do {         \
> +    smp_wmb();                              \
> +    atomic_set(ptr, i);                     \
> +    smp_mb();                               \
> +} while (0)
> +#endif
> +
> +#ifndef atomic_xchg
> +#ifdef __ATOMIC_SEQ_CST
> +#define atomic_xchg(ptr, i)    ({                           \
> +    typeof(*ptr) _new = (i), _old;                          \
> +    __atomic_exchange(ptr, &_new, &_old, __ATOMIC_SEQ_CST); \
> +    _old;                                                   \
> +})
> +#elif defined __clang__
> +#define atomic_xchg(ptr, i)    __sync_exchange(ptr, i)
> +#else
> +/* __sync_lock_test_and_set() is documented to be an acquire barrier only.  
> */
> +#define atomic_xchg(ptr, i)    (smp_mb(), __sync_lock_test_and_set(ptr, i))
> +#endif
> +#endif
> +
> +/* Provide shorter names for GCC atomic builtins.  */
> +#define atomic_inc(ptr)        __sync_fetch_and_add(ptr, 1)
> +#define atomic_dec(ptr)        __sync_fetch_and_add(ptr, -1)
> +#define atomic_add             __sync_fetch_and_add
> +#define atomic_sub             __sync_fetch_and_sub
> +#define atomic_and             __sync_fetch_and_and
> +#define atomic_or              __sync_fetch_and_or
> +#define atomic_cmpxchg         __sync_val_compare_and_swap
> +
>  #endif
> diff --git a/migration.c b/migration.c
> index 058f9e6..83f5691 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -290,8 +290,7 @@ static void migrate_fd_cleanup(void *opaque)
>
>  static void migrate_finish_set_state(MigrationState *s, int new_state)
>  {
> -    if (__sync_val_compare_and_swap(&s->state, MIG_STATE_ACTIVE,
> -                                    new_state) == new_state) {
> +    if (atomic_cmpxchg(&s->state, MIG_STATE_ACTIVE, new_state) == new_state) 
> {
>          trace_migrate_set_state(new_state);
>      }
>  }
> diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
> index 22915aa..dbf2fc8 100644
> --- a/tests/test-thread-pool.c
> +++ b/tests/test-thread-pool.c
> @@ -17,15 +17,15 @@ typedef struct {
>  static int worker_cb(void *opaque)
>  {
>      WorkerTestData *data = opaque;
> -    return __sync_fetch_and_add(&data->n, 1);
> +    return atomic_inc(&data->n);
>  }
>
>  static int long_cb(void *opaque)
>  {
>      WorkerTestData *data = opaque;
> -    __sync_fetch_and_add(&data->n, 1);
> +    atomic_inc(&data->n);
>      g_usleep(2000000);
> -    __sync_fetch_and_add(&data->n, 1);
> +    atomic_inc(&data->n);
>      return 0;
>  }
>
> @@ -169,7 +169,7 @@ static void test_cancel(void)
>      /* Cancel the jobs that haven't been started yet.  */
>      num_canceled = 0;
>      for (i = 0; i < 100; i++) {
> -        if (__sync_val_compare_and_swap(&data[i].n, 0, 3) == 0) {
> +        if (atomic_cmpxchg(&data[i].n, 0, 3) == 0) {
>              data[i].ret = -ECANCELED;
>              bdrv_aio_cancel(data[i].aiocb);
>              active--;
> --
> 1.8.1.4
>
>
>

Re: [Qemu-devel] [PATCH 04/30] add a header file for atomic operations

Reply via email to