(re-adding cc:s) On Tue, 22 Mar 2016 15:18:05 +0800 Fam Zheng <f...@redhat.com> wrote:
> On Tue, 03/22 15:10, tu bo wrote: > > Hi Fam: > > > > On 03/21/2016 06:57 PM, Fam Zheng wrote: > > >On Thu, 03/17 19:03, tu bo wrote: > > >> > > >>On 03/17/2016 08:39 AM, Fam Zheng wrote: > > >>>On Wed, 03/16 14:45, Paolo Bonzini wrote: > > >>>> > > >>>> > > >>>>On 16/03/2016 14:38, Christian Borntraeger wrote: > > >>>>>>If you just remove the calls to virtio_queue_host_notifier_read, here > > >>>>>>and in virtio_queue_aio_set_host_notifier_fd_handler, does it work > > >>>>>>(keeping patches 2-4 in)? > > >>>>> > > >>>>>With these changes and patch 2-4 it does no longer locks up. > > >>>>>I keep it running some hour to check if a crash happens. > > >>>>> > > >>>>>Tu Bo, your setup is currently better suited for reproducing. Can you > > >>>>>also check? > > >>>> > > >>>>Great, I'll prepare a patch to virtio then sketching the solution that > > >>>>Conny agreed with. > > >>>> > > >>>>While Fam and I agreed that patch 1 is not required, I'm not sure if the > > >>>>mutex is necessary in the end. > > >>> > > >>>If we can fix this from the virtio_queue_host_notifier_read side, the > > >>>mutex/BH > > >>>are not necessary; but OTOH the mutex does catch such bugs, so maybe > > >>>it's good > > >>>to have it. I'm not sure about the BH. > > >>> > > >>>And on a hindsight I realize we don't want patches 2-3 too. Actually the > > >>>begin/end pair won't work as expected because of the blk_set_aio_context. > > >>> > > >>>Let's hold on this series. > > >>> > > >>>> > > >>>>So if Tu Bo can check without the virtio_queue_host_notifier_read calls, > > >>>>and both with/without Fam's patches, it would be great. > > >>> > > >>>Tu Bo, only with/withoug patch 4, if you want to check. Sorry for the > > >>>noise. > > >>> > > >>1. without the virtio_queue_host_notifier_read calls, without patch 4 > > >> > > >>crash happens very often, > > >> > > >>(gdb) bt > > >>#0 bdrv_co_do_rw (opaque=0x0) at block/io.c:2172 > > >>#1 0x000002aa165da37e in coroutine_trampoline (i0=<optimized out>, > > >>i1=1812051552) at util/coroutine-ucontext.c:79 > > >>#2 0x000003ff7dd5150a in __makecontext_ret () from /lib64/libc.so.6 > > >> > > >> > > >>2. without the virtio_queue_host_notifier_read calls, with patch 4 > > >> > > >>crash happens very often, > > >> > > >>(gdb) bt > > >>#0 bdrv_co_do_rw (opaque=0x0) at block/io.c:2172 > > >>#1 0x000002aa39dda43e in coroutine_trampoline (i0=<optimized out>, > > >>i1=-1677715600) at util/coroutine-ucontext.c:79 > > >>#2 0x000003ffab6d150a in __makecontext_ret () from /lib64/libc.so.6 > > >> > > >> > > > > > >Tu Bo, > > > > > >Could you help test this patch (on top of master, without patch 4)? > > > > > >diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c > > >index 08275a9..47f8043 100644 > > >--- a/hw/virtio/virtio.c > > >+++ b/hw/virtio/virtio.c > > >@@ -1098,7 +1098,14 @@ void virtio_queue_notify_vq(VirtQueue *vq) > > > > > > void virtio_queue_notify(VirtIODevice *vdev, int n) > > > { > > >- virtio_queue_notify_vq(&vdev->vq[n]); > > >+ VirtQueue *vq = &vdev->vq[n]; > > >+ EventNotifier *n; > > >+ n = virtio_queue_get_host_notifier(vq); > > >+ if (n) { > > >+ event_notifier_set(n); > > >+ } else { > > >+ virtio_queue_notify_vq(vq); > > >+ } > > > } > > > > > > uint16_t virtio_queue_vector(VirtIODevice *vdev, int n) > > > > > > > > > > I got a build error as below, > > > > /BUILD/qemu-2.5.50/hw/virtio/virtio.c: In function 'virtio_queue_notify': > > /BUILD/qemu-2.5.50/hw/virtio/virtio.c:1102:20: error: 'n' redeclared > > as different kind of symbol > > EventNotifier *n; > > ^ > > /BUILD/qemu-2.5.50/hw/virtio/virtio.c:1099:50: note: previous > > definition of 'n' was here > > void virtio_queue_notify(VirtIODevice *vdev, int n) > > > > > > Then I did some change for your patch as below, > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c > > index 08275a9..a10da39 100644 > > --- a/hw/virtio/virtio.c > > +++ b/hw/virtio/virtio.c > > @@ -1098,7 +1098,14 @@ void virtio_queue_notify_vq(VirtQueue *vq) > > > > void virtio_queue_notify(VirtIODevice *vdev, int n) > > { > > - virtio_queue_notify_vq(&vdev->vq[n]); > > + VirtQueue *vq = &vdev->vq[n]; > > + EventNotifier *en; > > + en = virtio_queue_get_host_notifier(vq); > > + if (en) { > > + event_notifier_set(en); > > + } else { > > + virtio_queue_notify_vq(vq); > > + } > > } > > > > uint16_t virtio_queue_vector(VirtIODevice *vdev, int n) > > > > With qemu master + modified patch above(without patch 4, without > > Conny's patches), I did NOT get crash so far. thanks > > Yes, it was a mistake. Thanks for the testing! I'm wondering what we learn from this. Bypassing notify_vq() removes the race, but that's not the solution here. So far, we had the best results with my refactoring + the mutex/bh change. Two problems: - We don't really understand yet why my refactoring helps, but passing the assign value is a good canditate (and it's agreed that this fixes a bug, I think.) - There's some problem with the bh, if I understood Stefan correctly.