During normal operation, virtio host first writes a used index
and then checks whether it should interrupt the guest
by reading guest avail flag/used event index values.
Guest does the reverse: writes the index/flag,
then checks the used ring.

The ordering is important: if host avail flag read bypasses the used
index write, we could in effect get this timing:

host avail flag/used event index read
                guest enable interrupts: this performs
                         avail flag/used event index write
                guest check used ring: ring is empty
host used index write

This timing results in a lost interrupt: guest will never be
notified about the used ring update.

This has actually been observed in the field, when using qemu-kvm such
that the guest vcpu and qemu io run on different host cpus,
but only seems to trigger on some specific hardware,
and only with userspace virtio: vhost has the necessary smp_mb() in
place to prevent the reordering, so the same workload stalls forever
waiting for an interrupt with vhost=off but works fine with vhost=on.

Insert an smp_mb() barrier operation in userspace virtio to
ensure the correct ordering.
Applying this patch fixed the race condition we have observed.
Tested on x86_64. I checked the code generated by the new macro
for i386 and ppc but didn't run virtio.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---
 hw/virtio.c    |    2 ++
 qemu-barrier.h |   23 ++++++++++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index f805790..6449746 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -693,6 +693,8 @@ static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     uint16_t old, new;
     bool v;
+    /* We need to expose used array entries before checking used event. */
+    smp_mb();
     /* Always notify when queue is empty (when feature acknowledge) */
     if (((vdev->guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY)) &&
          !vq->inuse && vring_avail_idx(vq) == vq->last_avail_idx)) {
diff --git a/qemu-barrier.h b/qemu-barrier.h
index c11bb2b..f6722a8 100644
--- a/qemu-barrier.h
+++ b/qemu-barrier.h
@@ -4,7 +4,7 @@
 /* Compiler barrier */
 #define barrier()   asm volatile("" ::: "memory")
 
-#if defined(__i386__) || defined(__x86_64__)
+#if defined(__i386__)
 
 /*
  * Because of the strongly ordered x86 storage model, wmb() is a nop
@@ -13,15 +13,31 @@
  * load/stores from C code.
  */
 #define smp_wmb()   barrier()
+/*
+ * We use GCC builtin if it's available, as that can use
+ * mfence on 32 bit as well, e.g. if built with -march=pentium-m.
+ * However, on i386, there seem to be known bugs as recently as 4.3.
+ * */
+#if defined(_GNUC__) && __GNUC__ >= 4 && __GNUC_MINOR__ >= 4
+#define smp_mb() __sync_synchronize()
+#else
+#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#endif
+
+#elif defined(__x86_64__)
+
+#define smp_wmb()   barrier()
+#define smp_mb() asm volatile("mfence" ::: "memory")
 
 #elif defined(_ARCH_PPC)
 
 /*
- * We use an eieio() for a wmb() on powerpc.  This assumes we don't
+ * We use an eieio() for wmb() and mb() on powerpc.  This assumes we don't
  * need to order cacheable and non-cacheable stores with respect to
  * each other
  */
 #define smp_wmb()   asm volatile("eieio" ::: "memory")
+#define smp_mb()   asm volatile("eieio" ::: "memory")
 
 #else
 
@@ -29,9 +45,10 @@
  * For (host) platforms we don't have explicit barrier definitions
  * for, we use the gcc __sync_synchronize() primitive to generate a
  * full barrier.  This should be safe on all platforms, though it may
- * be overkill.
+ * be overkill for wmb().
  */
 #define smp_wmb()   __sync_synchronize()
+#define smp_mb()   __sync_synchronize()
 
 #endif
 
-- 
1.7.9.111.gf3fb0

Reply via email to