On Wed, Apr 13, 2016 at 05:51:15AM -0400, Marc-André Lureau wrote: > Hi > > ----- Original Message ----- > > Hi Marc, > > > > First of all, sorry again for late response! > > > > Last time I tried with your first version, I found few issues related > > with reconnect, mainly on the acked_feautres lost. While checking your > > new code, I found that you've already solved that, which is great. > > > > So, I tried harder this time, your patches work great, except that I > > found few nits. > > > > On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lur...@redhat.com wrote: > > > From: Marc-André Lureau <marcandre.lur...@redhat.com> > > ... > > > +Slave message types > > > +------------------- > > > + > > > + * VHOST_USER_SLAVE_SHUTDOWN: > > > + Id: 1 > > > + Master payload: N/A > > > + Slave payload: u64 > > > + > > > + Request the master to shutdown the slave. A 0 reply is for > > > + success, in which case the slave may close all connections > > > + immediately and quit. > > > > Assume we are using ovs + dpdk here, that we could have two > > vhost-user connections. While ovs tries to initiate a restart, > > it might unregister the two connections one by one. In such > > case, two VHOST_USER_SLAVE_SHUTDOWN request will be sent, > > and two replies will get. Therefore, I don't think it's a > > proper ask here to let the backend implementation to do quit > > here. > > > > On success reply, the master sent all the commands to finish the connection. > So the slave must flush/finish all pending requests first.
Yes, that's okay. I here just mean the "close __all__ connections" and "quit" part. Firstly, we should do cleanup/flush/finish to it's own connection. But not all, right? Second, as stated, doing quit might not make sense, as we may have more connections. > I think this should be enough, otherwise we may need a new explicit message? > > > > > > > > > switch (msg.request) { > > > + case VHOST_USER_SLAVE_SHUTDOWN: { > > > + uint64_t success = 1; /* 0 is for success */ > > > + if (dev->stop) { > > > + dev->stop(dev); > > > + success = 0; > > > + } > > > + msg.payload.u64 = success; > > > + msg.size = sizeof(msg.payload.u64); > > > + size = send(u->slave_fd, &msg, VHOST_USER_HDR_SIZE + msg.size, > > > 0); > > > + if (size != VHOST_USER_HDR_SIZE + msg.size) { > > > + error_report("Failed to write reply."); > > > + } > > > + break; > > > > You might want to remove the slave_fd from watch list? We > > might also need to close slave_fd here, assuming that we > > will no longer use it when VHOST_USER_SLAVE_SHUTDOWN is > > received? > > Makes sense, I will change that in next update. > > > I'm asking because I found a seg fault issue sometimes, > > due to opaque is NULL. Oh, I was wrong, it's u being NULL, but not opaque. > > > > I would be interested to see the backtrace or have a reproducer. It's a normal test steps: start a vhost-user switch (I'm using DPDK vhost-switch example), kill it, and wait for a while (something like more than 10s or even longer), then I saw a seg fault: (gdb) p dev $4 = (struct vhost_dev *) 0x555556571bf0 (gdb) p u $5 = (struct vhost_user *) 0x0 (gdb) where #0 0x0000555555798612 in slave_read (opaque=0x555556571bf0) at /home/yliu/qemu/hw/virtio/vhost-user.c:539 #1 0x0000555555a343a4 in aio_dispatch (ctx=0x55555655f560) at /home/yliu/qemu/aio-posix.c:327 #2 0x0000555555a2738b in aio_ctx_dispatch (source=0x55555655f560, callback=0x0, user_data=0x0) at /home/yliu/qemu/async.c:233 #3 0x00007ffff51032a6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #4 0x0000555555a3239e in glib_pollfds_poll () at /home/yliu/qemu/main-loop.c:213 #5 0x0000555555a3247b in os_host_main_loop_wait (timeout=29875848) at /home/yliu/qemu/main-loop.c:258 #6 0x0000555555a3252b in main_loop_wait (nonblocking=0) at /home/yliu/qemu/main-loop.c:506 #7 0x0000555555846e35 in main_loop () at /home/yliu/qemu/vl.c:1934 #8 0x000055555584e6bf in main (argc=31, argv=0x7fffffffe078, envp=0x7fffffffe178) at /home/yliu/qemu/vl.c:4658 --yliu