Re: [Qemu-devel] [PATCH] spapr: manage hotplugged devices while the VM is not started

David Gibson Sun, 18 Jun 2017 06:41:15 -0700

On Sun, Jun 18, 2017 at 07:37:00AM -0500, Michael Roth wrote:
> Quoting David Gibson (2017-06-18 04:59:33)
> > On Fri, Jun 16, 2017 at 11:15:46AM -0500, Michael Roth wrote:
> > > Quoting David Gibson (2017-06-16 09:40:53)
> > > > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote:
> > > > > On Wed, 14 Jun 2017 19:27:12 -0500
> > > > > Michael Roth <mdr...@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01)
> > > > > > > On Tue, 13 Jun 2017 16:42:45 -0500
> > > > > > > Michael Roth <mdr...@linux.vnet.ibm.com> wrote:
> > > > > > >   
> > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33)  
> > > > > [...]
> > > > > 
> > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu 
> > > > > > > hotplug
> > > > > > > is enabled even if dev->hotplugged is false (not nice but it 
> > > > > > > might work).
> > > > > > > Consider:
> > > > > > >   SRC1: hotplug CPU1 => CPU1.hotplugged = true
> > > > > > >   DST1: -device CPU1 => CPU1.hotplugged = false
> > > > > > > so in current code relying on CPU1.hotplugged would not work as 
> > > > > > > expected,
> > > > > > > it works by accident because libvirt uses device_add on target
> > > > > > >   DST1: device_add CPU1 => CPU1.hotplugged = true  
> > > > > > 
> > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because
> > > > > > default DRC state for CPU1.hotplugged = false matches the state a
> > > > > > hotplugged CPU will be brought to after onlining at the source, so
> > > > > > we don't send it over the wire in the first place once it reaches
> > > > > > that post-hotplug/coldplug state. So things work as expected, even
> > > > > > though technically the source has dev->hotplugged == true, whereas
> > > > > > the dest has dev->hotplugged == false.
> > > > > in your case it seems fragile to rely on -device setting hotplugged 
> > > > > cpu
> > > > > on target the way you want.
> > > > > 
> > > > > it could be:
> > > > > 
> > > > > SRC: hotplug CPU and start migration with DST: -device 
> > > > > cpu[,hotplugged=false]
> > > > >   *: machine is in hotplugging state and one would lose transitional 
> > > > > state if DRC is not migrated.
> > > > >  
> > > > > > It's the DST1: device_add case that wasn't accounted for when the 
> > > > > > DRC
> > > > > > migration patches were written, as those don't default to coldplug,
> > > > > > so, because the source doesn't send it, it ends up being presented
> > > > > > in pre-hotplug state because the dest doesn't know that the guest
> > > > > > already onlined the resource and transitioned it to
> > > > > > post-hotplug/coldplug state. Ironically, dev->hotplugged
> > > > > > is true on both source and dest in this case, but it ends up being
> > > > > > the broken one.
> > > > > it looks like for hotplugged CPUs DRC state should be always migrated
> > > > 
> > > > Yeah, I think in the first instance we should just unconditionally
> > > > transfer DRC state for newer machine types (at least once I clean up
> > > > what exactly the DRC state _is_).  For old machine types things are
> > > > still likely to be broken, but it won't be a regression.
> > > > 
> > > > If we can work out a way to fix things for older machine types, that's
> > > > a bonus, but it looks like it's very difficult, maybe impossible.
> > > > First priority should be sane semantics for the newer machine types.
> > > > 
> > > > > > But your point stands, the fact that both situations are possible
> > > > > > means we can't currently rely on dev->hotplugged without migrating
> > > > > > it, infering it based on QEMU lifecycle, or forcing management to
> > > > > > set it.
> > > > > > 
> > > > > > But that raises a 2nd point. Our dilemma isn't that we can't
> > > > > > rely on dev->hotplugged being synchronized (though if it
> > > > > > was we could build something around that), our dilemma is
> > > > > > that we make the following assumption in our code:
> > > > > > 
> > > > > > "Devices present at start-time will be handled the same way,
> > > > > > on source or dest, regardless of whether they were added via
> > > > > > cmdline or via device_add prior to machine start / migration
> > > > > > stream processing."
> > > > > > 
> > > > > > And I think that's a sensible expectation, since in theory
> > > > > > even the source could build up a machine via device_add
> > > > > > prior to starting it, and the reasonable default there is
> > > > > > dev->hotplugged = false rather than the opposite. That
> > > > > > suggests a need to fix things outside of migration.
> > > > > Agreed to a degree, i.e.
> > > > > 
> > > > >   -device/device_add before machine has been started
> > > > > without migration should follow coldplug path
> > > > > 
> > > > > it shouldn't cause problems for CPU/mem hotplug on x86
> > > > > and maybe will work for PCI (it may change behavior of
> > > > > ACPI based hotplug and bridges),
> > > > > CCing Marcel to confirm.
> > > > 
> > > > So for ppc the main problem with early plugged devices is a
> > > > duplication of device tree info between that delivered by CAS, and
> > > > that delivered through configure-connector.  I see two basic
> > > > approaches to fixing this:
> > > > 
> > > > 1) At CAS, reset all DRCs before building the device tree to send to
> > > > the guest.  That will essentially "convert" everything present at CAS
> > > > time into coldplugged device.  There might still be a problem if there
> > > > are existing hotplug events in the queue from before CAS.
> > > > 
> > > > 2) When building the device tree (at both CAS and reset) check the DRC
> > > > state, and omit DT information for anything that's not in CONFIGURED
> > > > state.  That should be correct, because the guest should call
> > > > configure-connector for anything not yet in CONFIGURED state, at which
> > > > point it will get the DT information.
> > > > 
> > > > I'm not sure which approach is best yet.  I'm not 100% sure of (1) is
> > > > correct in all cases, but it is a bit simpler than (2).
> > > 
> > > This fixes up devices added before CAS (whether via -device/device_add)
> > > on the source, but at the time of migration we will have (usually)
> > > already completed CAS. We also can't repeat the CAS-time fixups on the
> > > target, because that would reset any DRC state sent via migration (which
> > > would include things like transitional states outside of
> > > post-hotplug/coldplug).
> > 
> > Sorry, I wasn't clear.  The options above are for the problems with
> > hotplug during early boot, they won't help migration cases.
> > 
> > > So either we sent all DRC state (regardless of hotplug or not), or
> > > we need a reliable starting state on the dest from which we can
> > > determine what needs to be sent by the source. The 2 options are:
> > 
> > So for new machine types, I think we should always send DRC state.
> > 
> > > a) ensure all devices on the dest start off in a coldplug state, at
> > >    point we can simply send DRC for anything hotplugged. This isn't
> > >    the case now because device_add on the target results in the initial
> > >    state being hotplug, where -device's are in coldplug. I'm proposing
> > >    we ensure both these cases default coldplug state on the target if
> > >    device_add occurs before the dest is started.
> > 
> > I think that will become the case with my Part IV DRC cleanups.  IIRC,
> > there is a system reset performed before processing the incoming
> > migration.  With my DRC changes we reset the DRC state based only on
> 
> There's a system-wide reset, but it happens too early in cases where
> libvirt opts to specify devices via device_add on the dest rather than
> cmdline.
> 
> In that case the sequence ends up being:
> 
> 1) initialize cmdline devices and run their coldplug hooks
> 2) qdev_machine_creation_done(), all devices initialized
>    after this point will be dev->hotplugged == true
> 3) issue system-wide reset
> 4) libvirt issues device_add for the CPU we've hotplugged on the source
>    dev->hotplugged == true.
> 5) we set the DRC state to "pre-hotplug" in anticipation that the
>    that guest needs to finish onlining devices (same way we'd
>    handle it on the source). Except this is breaks, since the source
>    was assuming we'd put it in "coldplug/post-hotplug" just like we
>    would for cmdline-specified devices. That assumption doesn't hold
>    when libvirt opts to use device_add to build up a machine
> 6) dest starts receiving migration stream
> 
> So adding that reset prior to 6) is actually what I'm proposing we
> should do.


Ah, right.  I mistakenly assumed that the reset would happen
immediately prior to processing the incoming stream.

> (as well as adding a reset prior to machine start in general, even
> outside of migration, to handle boot-time devices specified via
> device_add, which I think Igor is on board with already. if we have
> agreement on both the migration and non-migration cases we can
> generalize the approach a bit more)
> 
> > the device presence at reset time.  Since the device will be
> > hotplugged before the incoming migration, it should be reset to
> > coldplug equivalent state.
> 
> My understanding is that your patches move all our DRC-specific
> coldplug hooks to a DRC reset() function, so the hotplug handler
> will always assume that that scenario is handled by the DRC reset
> path. I think that that's indeed the proper way to handle this.

Well, the patches I have so far don't quite get there (even including
the ones drafted byt not posted).  But that's the direction I'm
heading in, yes.

> But in the scenario above, we won't get that DRC reset() prior
> to 6), so that assumption doesn't hold. We need to ensure a
> system-wide reset happens prior to 6) even with your proposed
> series.
> 
> I'm simplifying a little bit though: technically, we can address
> this specific scenario by always migrating DRCs for
> dev->hotplugged devices. But that end up being a band-aid: if you
> hotplug a CPU on the source, then do a reboot, *then* do
> migration you will hit that exact same issue.
> 
> So I really think we need a machine-wide reset prior to 6)
> regardless.

We probably don't need a true machine-wide reset - just to call the
reset() hook on every DRC.

> 
> > 
> > > b) ensure dev->hotplugged is in sync between src/dest by migrating it
> > >    or forcing management to set it. Based on that knowledge, we can
> > >    determine on the src side what the default state will be on the dest,
> > >    and from that knowledge determine what state needs to be sent. Igor
> > >    has suggested an approach for handling it this way.
> > 
> > This sounds like a bad idea - I don't know what other effects altering
> > this generic qemu state will have.
> > 
> > > Personally I'm leaning toward a), since assuming that dest devices will
> > > start in a coldplug state is basically what most of the migration code
> > > already does (since up until fairly recently, the hotplugged devices
> > > were generally specified via cmdline on the dest and not via
> > > device_add)
> > 
> > Yes, I agree.
> > 
> > > > > > So far, all QEMU's existing migration code has managed ok
> > > > > > with the dest being starting with dev->hotplugged == false
> > > > > > via cmdline devices, even though maybe they were hotplugged
> > > > > > on the source. To me, it makes more sense to maintain this
> > > > > > behavior by fixing up this relatively new use-case of
> > > > > > adding devices via device_add before start to match the
> > > > > > same expectations we have around cmdline-specified devices.
> > > > > > 
> > > > > > This would fix migration for spapr, leave it working for
> > > > > > everyone else (since that's basically what we expect for
> > > > > > everything except newer-style cpu hotplug), and also make
> > > > > > the device-add-before-start be truly synonymous with
> > > > > > cmdline-created devices (which is applicable even outside
> > > > > > of migration).
> > > > > Neither -device or device_add can't really bring DRC/CPU into
> > > > > the state that devices might be at the moment when their state
> > > > > is transferred to target.
> > > > > 
> > > > > i.e. any state that has been changed after -device/device_add
> > > > > on SRC, should be in migration stream. I'd say even if state would
> > > > > eventually go back default (coldplugged) when hotplug is completed.
> > > > > So trying to avoid transmitting runtime state to optimize some bytes
> > > > > on migration stream is just asking for trouble.
> > > > 
> > > > Yeah, I'm coming to the same conclusion.  Note that the reason for
> > > > omitting state wasn't to save space in the migration stream, but to
> > > > make the stream compatible with older versions which never transmit
> > > > the state - at least in as many cases as possible.
> > > > 
> > > > > [...]
> > > > > > > 
> > > > > > > May be we should
> > > > > > >  1. make DeviceState:hotpluggable property write-able again
> > > > > > >  2. transmit DeviceState:hotpluggable as part of migration stream
> > > > > > >  3. add generic migration hook which will check if target and
> > > > > > >     source value match, if value differs => fail/abort migration.
> > > > > > >  4. in case values mismatch mgmt will be forced to explicitly
> > > > > > >     provide hotplugged property value on -device/device_add
> > > > > > > That would enforce consistent DeviceState:hotpluggable value
> > > > > > > on target and source.
> > > > > > > We can enforce it only for new machine types so it won't break
> > > > > > > old mgmt tools with old machine types but would force mgmt
> > > > > > > for new machines to use hotplugged property on target
> > > > > > > so QEMU could rely on its value for migration purposes.
> > > > > > >   
> > > > > > 
> > > > > > That would work, and generalizing this beyond spapr seems
> > > > > > appropriate.
> > > > > > 
> > > > > > It also has reasonable semantics, and it would work for us
> > > > > > *provided that* we always send DRC state for hotplugged devices
> > > > > > and not just DRCs in a transitional state:
> > > > > > 
> > > > > > SRC1: device_add $cpu
> > > > > >  -> dev->hotplugged == true
> > > > > >  -> device starts in pre-hotplug, ends up in post-hotplug state  
> > > > > >     after guest onlines it
> > > > > > <migrate>
> > > > > > DST1: device_add $cpu,hotplugged=true
> > > > > >  -> dev->hotplugged == true
> > > > > >  -> device starts in pre-hotplug state. guest sends updated state  
> > > > > >     to transition DRC to post-hotplug
> > > > > > 
> > > > > > But what about stuff like mem/pci? Currently, migration works for
> > > > > > cases like:
> > > > > > 
> > > > > > SRC1: device_add virtio-net-pci
> > > > > > DST1: qemu -device virtio-net-pci
> > > > > > 
> > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the
> > > > > > opposite. So for new machines, checking SRC1:dev->hotplugged ==
> > > > > > DST1:dev->hotplugged would fail, even though the migration
> > > > > > scenario is unchanged from before.
> > > > > > 
> > > > > > So management would now have to do:
> > > > > > 
> > > > > > SRC1: device_add virtio-net-pci
> > > > > > DST1: qemu -device virtio-net-pci,hotplugged=true
> > > > > > 
> > > > > > But the code behavior is a bit different then, since we now get
> > > > > > an ACPI hotplug event via the hotplug handler. Maybe the
> > > > > > migration stream fixes that up for us, but I think we would need
> > > > > > to audit this and similar cases to be sure.
> > > > > > 
> > > > > > That's all fine if it's necessary, but I feel like this is
> > > > > > the hard way to address what's actually a much more specific
> > > > > > issue: that device_add before machine-start doesn't currently
> > > > > > match the behavior for a device started via cmdline. i.e.
> > > > > > dev->hotplugged in the former vs. !dev->hotplugged in the
> > > > > > latter. I don't really see a good reason these 2 cases should
> > > > > > be different, and we can bring them to parity by doing
> > > > > > something like:
> > > > > > 
> > > > > > 1. For device_adds after qdev_machine_creation_done(), but
> > > > > >    before machine start, set a flag: reset_before_start.
> > > > > > 2. At the start of processing migration stream, or unpausing
> > > > > >    a -S guest (in the non-migration case), issue a system-wide
> > > > > >    reset if reset_before_start is set.
> > > > > > 3. reset handlers will already unset dev->hotplugged at that
> > > > > >    point and re-execute all the hotplug hooks with
> > > > > >    dev->hotplugged == false. This should put everything in
> > > > > >    a state that's identical to cmdline-created devices.
> > > > > instead of flag for non migration case we could use
> > > > >  RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING
> > > > > transition to reset all devices or
> > > > > maybe do something like this:
> > > > 
> > > > Hrm, does the general reset call happen now before or after this
> > > > transition?  Resetting DRCs at CAS time should accomplish the same 
> > > > thing.
> > > > 
> > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> > > > > index 0ce45a2..cdeb8f8 100644
> > > > > --- a/hw/core/qdev.c
> > > > > +++ b/hw/core/qdev.c
> > > > > 
> > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj)
> > > > >      ObjectClass *class;
> > > > >      Property *prop;
> > > > >  
> > > > > -    if (qdev_hotplug) {
> > > > > +    if (runstate_check(RUN_STATE_RUNNING) || ...) {
> > > > >          dev->hotplugged = 1;
> > > > >          qdev_hot_added = true;
> > > > >      }
> > > > > 
> > > > > > 4. Only allow management to do device_add before it sends
> > > > > >    the migration stream (if management doesn't already guard
> > > > > >    against this then it's probably a bug anyway)
> > > > > seems like Juan already took care of it.
> > > > > 
> > > > > > This allows management to treat device_add/cmdline as being
> > > > > > completely synonymous for guests that haven't started yet,
> > > > > > both for -incoming and -S in general, and it maintains
> > > > > > the behavior that existing migration code expects of
> > > > > > cmdline-specified devices.
> > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH] spapr: manage hotplugged devices while the VM is not started

Reply via email to