On Wed, May 06, 2020 at 06:08:34PM -0400, Raphael Norwitz wrote: > As you correctly point out, this code needs to be looked at more > carefully so that > if the device does disconnect in the background we can handle the migration > path > gracefully. In particular, we need to decide whether a migration > should be allowed > to continue if a device disconnects durning the migration stage. > > mst, any thoughts?
Why not? It can't change state while disconnected, so it just makes things easier. > Have you looked at the suggestion I gave Li Feng to move vhost_dev_cleanup() > into the connection path in vhost-user-blk? I’m not sure if he’s > actively working on it, > but I would prefer if we can find a way to keep some state around > between reconnects > so we aren’t constantly checking dev->started. A device can be stopped > for reasons > other than backend disconnect so I’d rather not reuse this field to > check for backend > disconnect failures. > > On Thu, Apr 30, 2020 at 9:57 AM Dima Stepanov <dimas...@yandex-team.ru> wrote: > > > > If vhost-user daemon is used as a backend for the vhost device, then we > > should consider a possibility of disconnect at any moment. If such > > disconnect happened in the vhost_migration_log() routine the vhost > > device structure will be clean up. > > At the start of the vhost_migration_log() function there is a check: > > if (!dev->started) { > > dev->log_enabled = enable; > > return 0; > > } > > To be consistent with this check add the same check after calling the > > vhost_dev_set_log() routine. This in general help not to break a > > Could you point to the specific asserts which are being triggered? > > > migration due the assert() message. But it looks like that this code > > should be revised to handle these errors more carefully. > > > > In case of vhost-user device backend the fail paths should consider the > > state of the device. In this case we should skip some function calls > > during rollback on the error paths, so not to get the NULL dereference > > errors. > > > > Signed-off-by: Dima Stepanov <dimas...@yandex-team.ru> > > --- > > hw/virtio/vhost.c | 39 +++++++++++++++++++++++++++++++++++---- > > 1 file changed, 35 insertions(+), 4 deletions(-) > > > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > > index 3ee50c4..d5ab96d 100644 > > --- a/hw/virtio/vhost.c > > +++ b/hw/virtio/vhost.c > > @@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev > > *dev, > > static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log) > > { > > int r, i, idx; > > A couple points here > > > (1) This will fail the live migration if the device is disconnected. > That my be the right thing > to do, but if there are cases where migrations can proceed with > a disconnected device, > this may not be desirable. > > (2) This looks racy. As far as I can tell vhost_dev_set_log() is only > called by vhost_migration_log(), > and as you say one of the first things vhost_migration_log does > is return if dev->started is not > set. What’s to stop a disconnect from clearing the vdev right > after this check, just before > vhost_dev_set_features() is called? > > As stated above, I would prefer it if we could add some state which > would persist between > reconnects which could then be checked in the vhost-user code before > interacting with > the backend. I understand this will be a much more involved change and > will require a lot > of thought. > > Also, regarding (1) above, if the original check in > vhost_migration_log() returns success if the > device is not started why return an error here? I imagine this could > lead to some inconsistent > behavior if the device disconnects before the first check verses > before the second. > > > + > > + if (!dev->started) { > > + /* > > + * If vhost-user daemon is used as a backend for the > > + * device and the connection is broken, then the vhost_dev > > + * structure will be reset all its values to 0. > > + * Add additional check for the device state. > > + */ > > + return -1; > > + } > > + > > r = vhost_dev_set_features(dev, enable_log); > > if (r < 0) { > > goto err_features; > > @@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, > > bool enable_log) > > }