On Tue, 24 Apr 2012, Paolo Bonzini wrote: > Il 24/04/2012 16:20, Michael S. Tsirkin ha scritto: > > On Tue, Apr 24, 2012 at 03:46:25PM +0200, Paolo Bonzini wrote: > >> Il 23/04/2012 15:19, Michael S. Tsirkin ha scritto: > >>> During normal operation, virtio first writes a used index > >>> and then checks whether it should interrupt the guest > >>> by reading guest avail index/flag values. > >>> > >>> Guest does the reverse: writes the index/flag, > >>> then checks the used ring. > >>> > >>> The ordering is important: if host avail flag read bypasses the used > >>> index write, we could in effect get this timing: > >>> > >>> host avail flag read > >>> guest enable interrupts: avail flag write > >>> guest check used ring: ring is empty > >>> host used index write > >>> > >>> which results in a lost interrupt: guest will never be notified > >>> about the used ring update. > >>> > >>> This actually can happen when using kvm with an io thread, > >>> such that the guest vcpu and qemu run on different host cpus, > >>> and this has actually been observed in the field > >>> (but only seems to trigger on very specific processor types) > >>> with userspace virtio: vhost has the necessary smp_mb() > >>> in place to prevent the regordering, so the same workload stalls > >>> forever waiting for an interrupt with vhost=off but works > >>> fine with vhost=on. > >>> > >>> Insert an smp_mb barrier operation in userspace virtio to > >>> ensure the correct ordering. > >>> Applying this patch fixed the race condition we have observed. > >>> Tested on x86_64. I checked the code generated by the new macro > >>> for i386 and ppc but didn't run virtio. > >>> > >>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > >>> --- > >>> hw/virtio.c | 2 ++ > >>> qemu-barrier.h | 23 ++++++++++++++++++++--- > >>> 2 files changed, 22 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/hw/virtio.c b/hw/virtio.c > >>> index f805790..6449746 100644 > >>> --- a/hw/virtio.c > >>> +++ b/hw/virtio.c > >>> @@ -693,6 +693,8 @@ static bool vring_notify(VirtIODevice *vdev, > >>> VirtQueue *vq) > >>> { > >>> uint16_t old, new; > >>> bool v; > >>> + /* We need to expose used array entries before checking used event. > >>> */ > >>> + mb(); > >> > >> mb() vs. smp_mb()? > > > > rhel used wmb() everywhere so this keeps it consistent. > > upstream switched to smp_wmb so I added smp_mb there. > > > >>> /* Always notify when queue is empty (when feature acknowledge) */ > >>> if (((vdev->guest_features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY)) && > >>> !vq->inuse && vring_avail_idx(vq) == vq->last_avail_idx)) { > >>> diff --git a/qemu-barrier.h b/qemu-barrier.h > >>> index c11bb2b..f6722a8 100644 > >>> --- a/qemu-barrier.h > >>> +++ b/qemu-barrier.h > >>> @@ -4,7 +4,7 @@ > >>> /* Compiler barrier */ > >>> #define barrier() asm volatile("" ::: "memory") > >>> > >>> -#if defined(__i386__) || defined(__x86_64__) > >>> +#if defined(__i386__) > >>> > >>> /* > >>> * Because of the strongly ordered x86 storage model, wmb() is a nop > >>> @@ -13,15 +13,31 @@ > >>> * load/stores from C code. > >>> */ > >>> #define smp_wmb() barrier() > >>> +/* > >>> + * We use GCC builtin if it's available, as that can use > >>> + * mfence on 32 bit as well, e.g. if built with -march=pentium-m. > >>> + * However, on i386, there seem to be known bugs as recently as 4.3. > >>> + * */ > >> > >> Do you know what those bugs are? Either add a pointer, or there is no > >> reason to have cruft that is only backed by hearsay. > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793 > > I'll add this link in the commit log. > > Ok, thanks. I modified my "atomics" branch to add it too. > > >>> +#if defined(_GNUC__) && __GNUC__ >= 4 && __GNUC_MINOR__ >= 4
__GNUC__ perhaps? > >>> +#define smp_mb() __sync_synchronize() > >>> +#else > >>> +#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory") > >>> +#endif > >>> + > >>> +#elif defined(__x86_64__) > >>> + > >>> +#define smp_wmb() barrier() > >>> +#define smp_mb() asm volatile("mfence" ::: "memory") > >>> > >>> #elif defined(_ARCH_PPC) > >>> > >>> /* > >>> - * We use an eieio() for a wmb() on powerpc. This assumes we don't > >>> + * We use an eieio() for wmb() and mb() on powerpc. This assumes we > >>> don't > >>> * need to order cacheable and non-cacheable stores with respect to > >>> * each other > >>> */ > >>> #define smp_wmb() asm volatile("eieio" ::: "memory") > >>> +#define smp_mb() asm volatile("eieio" ::: "memory") > >> > >> smp_mb() is hwsync under PPC, > > > > This one? > > __asm__ __volatile__ ("sync" : : : "memory") > > > >> but I would just trust GCC. > >> > >> Paolo > > > > __sync_synchronize? Unfortunately it's still pretty new. > > So is KVM on PPC. :) > > Paolo > -- mailto:av1...@comtv.ru