During normal operation, virtio host first writes a used index and then checks whether it should interrupt the guest by reading guest avail flag/used event index values. Guest does the reverse: writes the index/flag, then checks the used ring.
The ordering is important: if host avail flag read bypasses the used index write, we could in effect get this timing: host avail flag/used event index read guest enable interrupts: this performs avail flag/used event index write guest check used ring: ring is empty host used index write This timing results in a lost interrupt: guest will never be notified about the used ring update. This has actually been observed in the field, when using qemu-kvm such that the guest vcpu and qemu io run on different host cpus, but only seems to trigger on some specific hardware, and only with userspace virtio: vhost has the necessary smp_mb() in place to prevent the reordering, so the same workload stalls forever waiting for an interrupt with vhost=off but works fine with vhost=on. Insert an smp_mb() barrier operation in userspace virtio to ensure the correct ordering. Applying this patch fixed the race condition we have observed. Tested on x86_64. I checked the code generated by the new macro for i386 and ppc but didn't run virtio. Signed-off-by: Michael S. Tsirkin <m...@redhat.com> --- hw/virtio.c | 2 ++ qemu-barrier.h | 23 ++++++++++++++++++++--- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index f805790..6449746 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -693,6 +693,8 @@ static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq) { uint16_t old, new; bool v; + /* We need to expose used array entries before checking used event. */ + smp_mb(); /* Always notify when queue is empty (when feature acknowledge) */ if (((vdev->guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY)) && !vq->inuse && vring_avail_idx(vq) == vq->last_avail_idx)) { diff --git a/qemu-barrier.h b/qemu-barrier.h index c11bb2b..f6722a8 100644 --- a/qemu-barrier.h +++ b/qemu-barrier.h @@ -4,7 +4,7 @@ /* Compiler barrier */ #define barrier() asm volatile("" ::: "memory") -#if defined(__i386__) || defined(__x86_64__) +#if defined(__i386__) /* * Because of the strongly ordered x86 storage model, wmb() is a nop @@ -13,15 +13,31 @@ * load/stores from C code. */ #define smp_wmb() barrier() +/* + * We use GCC builtin if it's available, as that can use + * mfence on 32 bit as well, e.g. if built with -march=pentium-m. + * However, on i386, there seem to be known bugs as recently as 4.3. + * */ +#if defined(_GNUC__) && __GNUC__ >= 4 && __GNUC_MINOR__ >= 4 +#define smp_mb() __sync_synchronize() +#else +#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory") +#endif + +#elif defined(__x86_64__) + +#define smp_wmb() barrier() +#define smp_mb() asm volatile("mfence" ::: "memory") #elif defined(_ARCH_PPC) /* - * We use an eieio() for a wmb() on powerpc. This assumes we don't + * We use an eieio() for wmb() and mb() on powerpc. This assumes we don't * need to order cacheable and non-cacheable stores with respect to * each other */ #define smp_wmb() asm volatile("eieio" ::: "memory") +#define smp_mb() asm volatile("eieio" ::: "memory") #else @@ -29,9 +45,10 @@ * For (host) platforms we don't have explicit barrier definitions * for, we use the gcc __sync_synchronize() primitive to generate a * full barrier. This should be safe on all platforms, though it may - * be overkill. + * be overkill for wmb(). */ #define smp_wmb() __sync_synchronize() +#define smp_mb() __sync_synchronize() #endif -- 1.7.9.111.gf3fb0