[Qemu-devel] ping RE: ping RE: question: I found a qemu crash about migration

wangjie (P) Wed, 11 Oct 2017 01:20:08 -0700

I found the real culprit, the bug begin to occur since the committed as 
following:
https://github.com/qemu/qemu/commit/e253f4b89796967d03a455d1df2ae6bda8cc7d01


since the patch committed, the mirror target bs begin to use BlockBackend, So 
the mirror target bs will be inactived in bdrv_inactivate_all

why the mirror target bs not be inactived in qemu-2.6.0, but be inactived in 
qemu-2.7.0-rc0?

-----Original Message-----
From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] 
Sent: Monday, October 09, 2017 7:56 PM
To: Kevin Wolf <kw...@redhat.com>
Cc: wangjie (P) <wangji...@huawei.com>; qemu-devel@nongnu.org; 
mre...@redhat.com; quint...@redhat.com; pbonz...@redhat.com; fuweiwei (C) 
<fuweiw...@huawei.com>; ebl...@redhat.com; berra...@redhat.com 
kcham...@redhat.com <kcham...@redhat.com>; f...@redhat.com
Subject: Re: ping RE: question: I found a qemu crash about migration

* Kevin Wolf (kw...@redhat.com) wrote:
> Am 29.09.2017 um 21:06 hat Dr. David Alan Gilbert geschrieben:
> > * Kevin Wolf (kw...@redhat.com) wrote:
> > > Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > > > Hi,
> > > >   This is a 'fun' bug;  I had a good chat to kwolf about it earlier.
> > > > A proper fix really needs to be done together with libvirt so 
> > > > that we can sequence:
> > > >    a) The stopping of the CPU on the source
> > > >    b) The termination of the mirroring block job
> > > >    c) The inactivation of the block devices on the source
> > > >        (bdrv_inactivate_all)
> > > >    d) The activation of the block devices on the destination
> > > >        (bdrv_invalidate_cache_all)
> > > >    e) The start of the CPU on the destination
> > > > 
> > > > It looks like you're hitting a race between b/c;  we've had 
> > > > races between c/d in the past and moved the bdrv_inactivate_all.
> > > > 
> > > > During the discussion we ended up with two proposed solutions; 
> > > > both of them require one extra command and one extra migration 
> > > > capability.
> > > > 
> > > > The block way
> > > > -------------
> > > >    1) Add a new migration capability pause-at-complete
> > > >    2) Add a new migration state almost-complete
> > > >    3) After saving devices, if pause-at-complete is set,
> > > >       transition to almost-complete
> > > >    4) Add a new command (migration-continue) that 
> > > >       causes the migration to inactivate the devices (c)
> > > >       and send the final EOF to the destination.
> > > > 
> > > > You set pause-at-complete, wait until migrate hits 
> > > > almost-complete; cleanup the mirror job, and then do 
> > > > migration-continue.  When it completes do 'cont' on the destination.
> > > 
> > > Especially since you're concerned about the exact position of the 
> > > pause, the following would be a little safer (using the 
> > > opportunity to suggest slightly different names,too):
> > > 
> > > 1) Add a new migration capability pause-at-complete
> > > 2) Add a new migration state ready
> > > 3) After migration has converged (i.e. the background phase of the
> > >    migration has completed) and we are ready to actually switch to the
> > >    destination, stop the VM, transition to 'ready' and emit a QMP 
> > > event
> > > 4) Continue processing QMP commands in the 'ready' phase. This is where
> > >    libvirt is supposed to tear down any running non-VM activities like
> > >    block jobs, NBD servers, etc.
> > > 5) Add a new command (migration-complete) that does the final part of
> > >    migration, including sending the remaining dirty memory, device state
> > >    and the final EOF that signals the destination to resume the VM if -S
> > >    isn't given.
> > 
> > What worries me here is whether some other subsection is going to 
> > say that this pause must happen after the device state save, e.g. if 
> > the device state save causes them to push out the last 
> > block/packet/etc - and I've got no real way to say whether that 
> > point is any better or worse than the point I was suggestion.
> > And the position in the cycle when the pause happens is part of the API.
> 
> The important point here is that the VM is stopped. If device 
> emulation is doing anything by itself while the VM is stopped, that is a bug.
> In other words, that last block/packet/etc. must already have been 
> pushed out when we stopped the VM, so there is nothing left to push 
> out when we save the state.
> 
> With correctly written device code, there is no difference whether we 
> save the device state first or last, without external influence it 
> will be the same.
> 
> The obvious external thing that I can see is monitor commands, e.g. 
> the monitor allows to access I/O ports, which can change the device state.
> However, monitor commands are completely under the control of the 
> user, so they can always choose the order that they need.
> 
> In any case, saving the device state only immediately before doing the 
> switch makes sure that we consider any changes that have still been 
> made.
> 
> > > The main reason why I advocate this "block way" is that it looks a 
> > > lot less hacky, but more future-proof to me. Instead of adding a 
> > > one-off hack for the problem at hand, it introduces a phase where 
> > > libvirt can cleanly tear down any activity it has still running on the 
> > > source qemu.
> > 
> > Yes, if we get that phase right.
> > 
> > > We got away without doing this because we never did a clean 
> > > handover of resources from the source to the destination. From the 
> > > perspective of a management tool, we had these distinct phases during 
> > > migration:
> > > 
> > > a) 'migrate' command:
> > >    Source VM is still running and in control of all resources (like disk
> > >    images), migrating some state in the background
> > > 
> > > b) Migration converges:
> > >    Both VMs are stopped (assuming -S is given on the destination,
> > >    otherwise this phase is skipped), the destination is in control of
> > >    the resources
> > > 
> > > c) 'cont' command:
> > >    Destination VM is running and in control of the resources
> > > 
> > > There is an asymmetry here because there is no phase where the VMs 
> > > are stopped, but the source is in control of the resources (this 
> > > is the additional state that my proposal introduces).
> > > 
> > > As long as we didn't really manage control of the resources with 
> > > locks, we could abuse b) for things where the source still 
> > > accessed the resources if we knew that the destination wouldn't 
> > > use them yet (it's not an assumption that I'm willing to trust too 
> > > much) and it would still kind of work in the common cases. But 
> > > it's not really the correct thing to do, it has probably always 
> > > caused bugs in corner cases, and the locking mechanisms are pointing that 
> > > out now.
> > > 
> > > We're now noticing this with block device because we're using 
> > > locking, but I wouldn't be surprised if we found sooner or later 
> > > that there are more resources that should really be handed over in 
> > > a cleanly designed way and that work only by accident as long as 
> > > you don't do that. So I don't think a local hack for mirroring (or 
> > > even all block jobs) is sufficient to be future-proof.
> > 
> > Without the block-job we were actually pretty safe - up until the 
> > EOF we knew the destination wouldn't start, and similarly we 
> > wouldn't start using a device on the destination before the source 
> > had saved that device.
> 
> s/Without the block-job/With only the guest device/
> 
> As long as only a guest device is using a resource, we're pretty safe, 
> yes. This is because the VM is stopped when we actually migrate, so 
> the user is already shut down without management tool interaction and 
> the resource isn't actually in use at this point.
> 
> Anything that keeps using a resource while the VM is stopped is a 
> potential problem. Apart from block jobs, this includes at least the 
> built-in NBD server, which also needs to be shut down before we can 
> complete migration. And probably other things I'm forgetting right now.
> 
> > My worry is where the edge of 'source is in control of the device 
> > is' - is that upto the point at which the end of the device save? Before 
> > that?
> > After that? What?  It seems to be pretty arbitrary.  Unless we can 
> > understand why it's where it is, then I'm not sure if it's a 'clean 
> > migration phase model'.
> 
> I can tell you where in the phase model it is: The source gives up 
> control of the image during 'migration-complete', i.e. at the 
> transition between the 'ready' and the 'completed' phase on the source.
> 
> Within this transition, it's less clear where it has to be, but that's 
> the same with its current place. I expect all of this code to stay 
> mostly unchanged, but just called a bit later, so the place where we 
> inactivate images today sounds like a good candidate. It's important 
> that it's done before sending the EOF packet that triggers the 
> destination to try and take control of the images.
> 
> > > > The migration way
> > > > -----------------
> > > >    1) Stop doing (d) when the destination is started with -S
> > > >       since it happens anyway when 'cont' is issued
> > > >    2) Add a new migration capability ext-manage-storage
> > > >    3) When 'ext-manage-storage' is set, we don't bother doing (c)
> > > >    4) Add a new command 'block-inactivate' on the source
> > > > 
> > > > You set ext-manage-storage, do the migrate and when it's 
> > > > finished clean up the block job, block-inactivate on the source, 
> > > > and then cont on the destination.
> > > 
> > > You would have to inactivate every single block node, which is not 
> > > a thing libvirt can currently do (they don't even really manage 
> > > individual nodes yet). Apart from that I believe it is much too 
> > > low-level and users shouldn't have to worry about what 
> > > implications migration has to block devices. A clean migration 
> > > phase model should be much easier to understand.
> > 
> > Yep, I see that.
> > (Having said that, this extra round-trip with management is adding 
> > extra downtime, and we'll have to work out things like what happens 
> > if you cancel or fail in that time).
> 
> libvirt can opt in to the additional phase only when it actually needs 
> it.
> 
> I'm not sure if the requirement can change while migration is already 
> running. I don't think so currently, but if yes, we might need a way 
> for management tools to change the capability setting after the fact.

OK, I'm going to look at doing this in the next few days.

Dave

> Kevin
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[Qemu-devel] ping RE: ping RE: question: I found a qemu crash about migration

Reply via email to