Re: [PATCH v2] migration: Handle block device inactivation failures better

Eric Blake Tue, 02 May 2023 13:18:01 -0700

On Tue, May 02, 2023 at 11:54:12AM +0200, Kevin Wolf wrote:
> Hi Eric,
> 
> you asked me for a review downstream, but since you would have to bring
> back any problem to upstream anyway, let's discuss it here. For the
> start, let me state that (a) I don't fully understand why this patch
> fixes things and (b) I hate this function. More below.


Sadly, I also fall into (a) I don't know if this patch fully fixes
things, or if we're playing whack-a-mole and another bug is still
lurking, and (b) the control flow is indeed horrendous, where I'm also
not sure if I'm fully understanding how migration is supposed to be
working.

> 
> Am 14.04.2023 um 17:33 hat Eric Blake geschrieben:
> > Consider what happens when performing a migration between two host
> > machines connected to an NFS server serving multiple block devices to
> > the guest, when the NFS server becomes unavailable.  The migration
> > attempts to inactivate all block devices on the source (a necessary
> > step before the destination can take over); but if the NFS server is
> > non-responsive, the attempt to inactivate can itself fail.  When that
> > happens, the destination fails to get the migrated guest (good,
> > because the source wasn't able to flush everything properly):
> > 
> >   (qemu) qemu-kvm: load of migration failed: Input/output error
> > 
> > at which point, our only hope for the guest is for the source to take
> > back control.  With the current code base, the host outputs a message, but 
> > then appears to resume:
> > 
> >   (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: 
> > bdrv_inactivate_all() failed (-1)
> > 
> >   (src qemu)info status
> >    VM status: running
> > 
> > but a second migration attempt now asserts:
> > 
> >   (src qemu) qemu-kvm: ../block.c:6738: int 
> > bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & 
> > BDRV_O_INACTIVE)' failed.
> > 
> > Whether the guest is recoverable on the source after the first failure
> > is debatable, but what we do not want is to have qemu itself fail due
> > to an assertion.  It looks like the problem is as follows:
> > 
> > In migration.c:migration_completion(), the source sets 'inactivate' to
> > true (since COLO is not enabled), then tries
> > savevm.c:qemu_savevm_state_complete_precopy() with a request to
> > inactivate block devices.  In turn, this calls
> > block.c:bdrv_inactivate_all(), which fails when flushing runs up
> > against the non-responsive NFS server.  With savevm failing, we are
> > now left in a state where some, but not all, of the block devices have
> > been inactivated; but migration_completion() then jumps to 'fail'
> > rather than 'fail_invalidate' and skips an attempt to reclaim those
> > those disks by calling bdrv_activate_all().  Even if we do attempt to
> > reclaim disks, we aren't taking note of failure there, either.
> 
> Why do we even jump to 'fail'? In other words, should 'fail_inactivate'
> really be called 'fail' and everything should jump there?
> 
> Greg added the 'fail_inactivate' label in fe904ea8242, but the commit
> message doesn't seem to tell why he left one goto. I see no reason why
> we wouldn't want to reactivate in this case, too. Maybe it's just for
> the colo case?

At the time of fe904ea8, all actions done by fail_invalidate: were
guarded by a mere
 if (s->state == MIGRATION_STATUS_ACTIVE)
Later, in 6039dd5b1c the guard was expanded to
 if (s->state == MIGRATION_STATUS_ACTIVE || s->state == MIGRATION_STATUS_DEVICE)

But reading the rest of the function (at either those prior commit
points, or at the present), I'm inclined to agree that all remaining
'goto fail' either happened at a point where s->state cannot match in
the first place (so consolidating labels is harmless), or is the ONE
place after we already checked s->state == MIGRATION_STATUS_ACTIVE or
just changed s->state to MIGRATION_STATUS_DEVICE, and our failure
could be from one of vm_stop_force_state(RUN_STATE_FINISH_MIGRATE)
(nothing to reactivate, as we haven't inactivated yet),
migration_maybe_pause() (where we change state, but still haven't
inactivated), or from qemu_savevm_state_complete_precopy() (where
'goto fail' does skip out on the reactivation, so consolidating the
labels would make sense to me, although at the time I wrote the patch,
I was too afraid to change the code that drastically).

Are we sure that bdrv_activate_all() is safe to call no matter what?
We already know that a request to inactivate when something is already
inactive asserts, but a request to activate something that is already
active is generally a no-op.  If so, I'm tending to agree with you
that having a single label where we ALWAYS attempt reactivation after
failure makes sense.

I'll post another patch along those lines.

> 
> > Thus, we have reached a state where the migration engine has forgotten
> > all state about whether a block device is inactive, because we did not
> > set s->block_inactive in enough places; so migration allows the source
> > to reach vm_start() and resume execution, violating the block layer
> > invariant that the guest CPUs should not be restarted while a device
> > is inactive.  Note that the code in migration.c:migrate_fd_cancel()
> > will also try to reactivate all block devices if s->block_inactive was
> > set, but because we failed to set that flag after the first failure,
> > the source assumes it has reclaimed all devices, even though it still
> > has remaining inactivated devices and does not try again.  Normally,
> > qmp_cont() will also try to reactivate all disks (or correctly fail if
> > the disks are not reclaimable because NFS is not yet back up), but the
> > auto-resumption of the source after a migration failure does not go
> > through qmp_cont().  And because we have left the block layer in an
> > inconsistent state with devices still inactivated, the later migration
> > attempt is hitting the assertion failure.
> > 
> > Since it is important to not resume the source with inactive disks,
> > this patch marks s->block_inactive before attempting inactivation,
> > rather than after succeeding, in order to prevent any vm_start() until
> > it has successfully reactivated all devices.
> 
> Here's the part that I don't understand: Even if you set
> s->block_inactive, where do we actually use this value and reactivate
> the image?
> 
> The only reader of the field is migrate_fd_cancel(), which is only
> called by migration_cancel() (a very small wrapper, it's a mystery why
> this exists when it's the only caller). migration_cancel() in turn is
> called in very few places:
> 
> * qmp_migrate_cancel: In our case, migration fails by itself, it's not
>   cancelled from QMP. So this is not where we're coming from.
> 
> * ram_mig_ram_block_resized: This one is an internal error during
>   migration, but what we're seeing is not related to RAM at all. So this
>   isn't where we're coming from either.
> 
> * migration_shutdown: Only called while shutting down QEMU. Doesn't look
>   like our case either.
> 
> So while this patch fixes some state inconsistencies, how is it fixing
> anything for the reported bug when this state is never used in the
> relevant places?
> 
> (That I don't understand the fix is what blocks my downstream review.
> The rest of my points are really only for upstream anyway.)

I may get to respin the downstream patch based on my upcoming upstream
patch...

> 
> > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982
> > 
> > Signed-off-by: Eric Blake <ebl...@redhat.com>
> > 
> > ---
> > 
> > v2: Set s->block_inactive sooner [Juan]
> > ---
> >  migration/migration.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index bda47891933..cb0d42c0610 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
> >                                              MIGRATION_STATUS_DEVICE);
> >              }
> >              if (ret >= 0) {
> > +                s->block_inactive = inactivate;
> >                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
> >                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, 
> > false,
> >                                                           inactivate);
> >              }
> > -            if (inactivate && ret >= 0) {
> > -                s->block_inactive = true;
> > -            }
> 
> This part of the code has now really become unintuitive. After commit
> f07fa4cbf0b we had perfectly intuitive code:
> 
>     ret = bdrv_inactivate_all();
>     if (ret >= 0) {
>         s->block_inactive = true;
>     }
> 
> Since then, the bdrv_inactivate_all() call has been moved to
> qemu_savevm_state_complete_precopy_non_iterable(), and now you changed
> the order because even on failure, we could end up with some inactivated
> nodes. I'm not arguing that either was a bad change, but the assignment
> to s->block_inactive looks really random now.
> 
> I think this desperately needs a comment.

Sure, I can add it.

> 
> >          }
> >          qemu_mutex_unlock_iothread();
> > 
> > @@ -3522,6 +3520,7 @@ fail_invalidate:
> >          bdrv_activate_all(&local_err);
> >          if (local_err) {
> >              error_report_err(local_err);
> > +            s->block_inactive = true;
> 
> bdrv_activate_all() never inactivates a node that was active before. So
> it seems that this line only ever comes into play if s->block_inactive
> was incorrect before.
> 
> I feel what we should do here is only try to activate if
> s->block_inactive was set above, and then have a single 'fail' label
> that always runs the re-activation code.
> 
> >          } else {
> >              s->block_inactive = false;
> >          }
> 
> Kevin
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2] migration: Handle block device inactivation failures better

Reply via email to