On Sun, Jun 18, 2017 at 07:37:00AM -0500, Michael Roth wrote: > Quoting David Gibson (2017-06-18 04:59:33) > > On Fri, Jun 16, 2017 at 11:15:46AM -0500, Michael Roth wrote: > > > Quoting David Gibson (2017-06-16 09:40:53) > > > > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > > > > > On Wed, 14 Jun 2017 19:27:12 -0500 > > > > > Michael Roth <mdr...@linux.vnet.ibm.com> wrote: > > > > > > > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > > > > Michael Roth <mdr...@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > > > [...] > > > > > > > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu > > > > > > > hotplug > > > > > > > is enabled even if dev->hotplugged is false (not nice but it > > > > > > > might work). > > > > > > > Consider: > > > > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > > > > so in current code relying on CPU1.hotplugged would not work as > > > > > > > expected, > > > > > > > it works by accident because libvirt uses device_add on target > > > > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > > > > default DRC state for CPU1.hotplugged = false matches the state a > > > > > > hotplugged CPU will be brought to after onlining at the source, so > > > > > > we don't send it over the wire in the first place once it reaches > > > > > > that post-hotplug/coldplug state. So things work as expected, even > > > > > > though technically the source has dev->hotplugged == true, whereas > > > > > > the dest has dev->hotplugged == false. > > > > > in your case it seems fragile to rely on -device setting hotplugged > > > > > cpu > > > > > on target the way you want. > > > > > > > > > > it could be: > > > > > > > > > > SRC: hotplug CPU and start migration with DST: -device > > > > > cpu[,hotplugged=false] > > > > > *: machine is in hotplugging state and one would lose transitional > > > > > state if DRC is not migrated. > > > > > > > > > > > It's the DST1: device_add case that wasn't accounted for when the > > > > > > DRC > > > > > > migration patches were written, as those don't default to coldplug, > > > > > > so, because the source doesn't send it, it ends up being presented > > > > > > in pre-hotplug state because the dest doesn't know that the guest > > > > > > already onlined the resource and transitioned it to > > > > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > > > > is true on both source and dest in this case, but it ends up being > > > > > > the broken one. > > > > > it looks like for hotplugged CPUs DRC state should be always migrated > > > > > > > > Yeah, I think in the first instance we should just unconditionally > > > > transfer DRC state for newer machine types (at least once I clean up > > > > what exactly the DRC state _is_). For old machine types things are > > > > still likely to be broken, but it won't be a regression. > > > > > > > > If we can work out a way to fix things for older machine types, that's > > > > a bonus, but it looks like it's very difficult, maybe impossible. > > > > First priority should be sane semantics for the newer machine types. > > > > > > > > > > But your point stands, the fact that both situations are possible > > > > > > means we can't currently rely on dev->hotplugged without migrating > > > > > > it, infering it based on QEMU lifecycle, or forcing management to > > > > > > set it. > > > > > > > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > > > > rely on dev->hotplugged being synchronized (though if it > > > > > > was we could build something around that), our dilemma is > > > > > > that we make the following assumption in our code: > > > > > > > > > > > > "Devices present at start-time will be handled the same way, > > > > > > on source or dest, regardless of whether they were added via > > > > > > cmdline or via device_add prior to machine start / migration > > > > > > stream processing." > > > > > > > > > > > > And I think that's a sensible expectation, since in theory > > > > > > even the source could build up a machine via device_add > > > > > > prior to starting it, and the reasonable default there is > > > > > > dev->hotplugged = false rather than the opposite. That > > > > > > suggests a need to fix things outside of migration. > > > > > Agreed to a degree, i.e. > > > > > > > > > > -device/device_add before machine has been started > > > > > without migration should follow coldplug path > > > > > > > > > > it shouldn't cause problems for CPU/mem hotplug on x86 > > > > > and maybe will work for PCI (it may change behavior of > > > > > ACPI based hotplug and bridges), > > > > > CCing Marcel to confirm. > > > > > > > > So for ppc the main problem with early plugged devices is a > > > > duplication of device tree info between that delivered by CAS, and > > > > that delivered through configure-connector. I see two basic > > > > approaches to fixing this: > > > > > > > > 1) At CAS, reset all DRCs before building the device tree to send to > > > > the guest. That will essentially "convert" everything present at CAS > > > > time into coldplugged device. There might still be a problem if there > > > > are existing hotplug events in the queue from before CAS. > > > > > > > > 2) When building the device tree (at both CAS and reset) check the DRC > > > > state, and omit DT information for anything that's not in CONFIGURED > > > > state. That should be correct, because the guest should call > > > > configure-connector for anything not yet in CONFIGURED state, at which > > > > point it will get the DT information. > > > > > > > > I'm not sure which approach is best yet. I'm not 100% sure of (1) is > > > > correct in all cases, but it is a bit simpler than (2). > > > > > > This fixes up devices added before CAS (whether via -device/device_add) > > > on the source, but at the time of migration we will have (usually) > > > already completed CAS. We also can't repeat the CAS-time fixups on the > > > target, because that would reset any DRC state sent via migration (which > > > would include things like transitional states outside of > > > post-hotplug/coldplug). > > > > Sorry, I wasn't clear. The options above are for the problems with > > hotplug during early boot, they won't help migration cases. > > > > > So either we sent all DRC state (regardless of hotplug or not), or > > > we need a reliable starting state on the dest from which we can > > > determine what needs to be sent by the source. The 2 options are: > > > > So for new machine types, I think we should always send DRC state. > > > > > a) ensure all devices on the dest start off in a coldplug state, at > > > point we can simply send DRC for anything hotplugged. This isn't > > > the case now because device_add on the target results in the initial > > > state being hotplug, where -device's are in coldplug. I'm proposing > > > we ensure both these cases default coldplug state on the target if > > > device_add occurs before the dest is started. > > > > I think that will become the case with my Part IV DRC cleanups. IIRC, > > there is a system reset performed before processing the incoming > > migration. With my DRC changes we reset the DRC state based only on > > There's a system-wide reset, but it happens too early in cases where > libvirt opts to specify devices via device_add on the dest rather than > cmdline. > > In that case the sequence ends up being: > > 1) initialize cmdline devices and run their coldplug hooks > 2) qdev_machine_creation_done(), all devices initialized > after this point will be dev->hotplugged == true > 3) issue system-wide reset > 4) libvirt issues device_add for the CPU we've hotplugged on the source > dev->hotplugged == true. > 5) we set the DRC state to "pre-hotplug" in anticipation that the > that guest needs to finish onlining devices (same way we'd > handle it on the source). Except this is breaks, since the source > was assuming we'd put it in "coldplug/post-hotplug" just like we > would for cmdline-specified devices. That assumption doesn't hold > when libvirt opts to use device_add to build up a machine > 6) dest starts receiving migration stream > > So adding that reset prior to 6) is actually what I'm proposing we > should do.
Ah, right. I mistakenly assumed that the reset would happen immediately prior to processing the incoming stream. > (as well as adding a reset prior to machine start in general, even > outside of migration, to handle boot-time devices specified via > device_add, which I think Igor is on board with already. if we have > agreement on both the migration and non-migration cases we can > generalize the approach a bit more) > > > the device presence at reset time. Since the device will be > > hotplugged before the incoming migration, it should be reset to > > coldplug equivalent state. > > My understanding is that your patches move all our DRC-specific > coldplug hooks to a DRC reset() function, so the hotplug handler > will always assume that that scenario is handled by the DRC reset > path. I think that that's indeed the proper way to handle this. Well, the patches I have so far don't quite get there (even including the ones drafted byt not posted). But that's the direction I'm heading in, yes. > But in the scenario above, we won't get that DRC reset() prior > to 6), so that assumption doesn't hold. We need to ensure a > system-wide reset happens prior to 6) even with your proposed > series. > > I'm simplifying a little bit though: technically, we can address > this specific scenario by always migrating DRCs for > dev->hotplugged devices. But that end up being a band-aid: if you > hotplug a CPU on the source, then do a reboot, *then* do > migration you will hit that exact same issue. > > So I really think we need a machine-wide reset prior to 6) > regardless. We probably don't need a true machine-wide reset - just to call the reset() hook on every DRC. > > > > > > b) ensure dev->hotplugged is in sync between src/dest by migrating it > > > or forcing management to set it. Based on that knowledge, we can > > > determine on the src side what the default state will be on the dest, > > > and from that knowledge determine what state needs to be sent. Igor > > > has suggested an approach for handling it this way. > > > > This sounds like a bad idea - I don't know what other effects altering > > this generic qemu state will have. > > > > > Personally I'm leaning toward a), since assuming that dest devices will > > > start in a coldplug state is basically what most of the migration code > > > already does (since up until fairly recently, the hotplugged devices > > > were generally specified via cmdline on the dest and not via > > > device_add) > > > > Yes, I agree. > > > > > > > > So far, all QEMU's existing migration code has managed ok > > > > > > with the dest being starting with dev->hotplugged == false > > > > > > via cmdline devices, even though maybe they were hotplugged > > > > > > on the source. To me, it makes more sense to maintain this > > > > > > behavior by fixing up this relatively new use-case of > > > > > > adding devices via device_add before start to match the > > > > > > same expectations we have around cmdline-specified devices. > > > > > > > > > > > > This would fix migration for spapr, leave it working for > > > > > > everyone else (since that's basically what we expect for > > > > > > everything except newer-style cpu hotplug), and also make > > > > > > the device-add-before-start be truly synonymous with > > > > > > cmdline-created devices (which is applicable even outside > > > > > > of migration). > > > > > Neither -device or device_add can't really bring DRC/CPU into > > > > > the state that devices might be at the moment when their state > > > > > is transferred to target. > > > > > > > > > > i.e. any state that has been changed after -device/device_add > > > > > on SRC, should be in migration stream. I'd say even if state would > > > > > eventually go back default (coldplugged) when hotplug is completed. > > > > > So trying to avoid transmitting runtime state to optimize some bytes > > > > > on migration stream is just asking for trouble. > > > > > > > > Yeah, I'm coming to the same conclusion. Note that the reason for > > > > omitting state wasn't to save space in the migration stream, but to > > > > make the stream compatible with older versions which never transmit > > > > the state - at least in as many cases as possible. > > > > > > > > > [...] > > > > > > > > > > > > > > May be we should > > > > > > > 1. make DeviceState:hotpluggable property write-able again > > > > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > > > > 3. add generic migration hook which will check if target and > > > > > > > source value match, if value differs => fail/abort migration. > > > > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > > > > provide hotplugged property value on -device/device_add > > > > > > > That would enforce consistent DeviceState:hotpluggable value > > > > > > > on target and source. > > > > > > > We can enforce it only for new machine types so it won't break > > > > > > > old mgmt tools with old machine types but would force mgmt > > > > > > > for new machines to use hotplugged property on target > > > > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > > > > appropriate. > > > > > > > > > > > > It also has reasonable semantics, and it would work for us > > > > > > *provided that* we always send DRC state for hotplugged devices > > > > > > and not just DRCs in a transitional state: > > > > > > > > > > > > SRC1: device_add $cpu > > > > > > -> dev->hotplugged == true > > > > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > > > > after guest onlines it > > > > > > <migrate> > > > > > > DST1: device_add $cpu,hotplugged=true > > > > > > -> dev->hotplugged == true > > > > > > -> device starts in pre-hotplug state. guest sends updated state > > > > > > to transition DRC to post-hotplug > > > > > > > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > > > > cases like: > > > > > > > > > > > > SRC1: device_add virtio-net-pci > > > > > > DST1: qemu -device virtio-net-pci > > > > > > > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > > > > DST1:dev->hotplugged would fail, even though the migration > > > > > > scenario is unchanged from before. > > > > > > > > > > > > So management would now have to do: > > > > > > > > > > > > SRC1: device_add virtio-net-pci > > > > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > > > > > > > But the code behavior is a bit different then, since we now get > > > > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > > > > migration stream fixes that up for us, but I think we would need > > > > > > to audit this and similar cases to be sure. > > > > > > > > > > > > That's all fine if it's necessary, but I feel like this is > > > > > > the hard way to address what's actually a much more specific > > > > > > issue: that device_add before machine-start doesn't currently > > > > > > match the behavior for a device started via cmdline. i.e. > > > > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > > > > latter. I don't really see a good reason these 2 cases should > > > > > > be different, and we can bring them to parity by doing > > > > > > something like: > > > > > > > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > > > > before machine start, set a flag: reset_before_start. > > > > > > 2. At the start of processing migration stream, or unpausing > > > > > > a -S guest (in the non-migration case), issue a system-wide > > > > > > reset if reset_before_start is set. > > > > > > 3. reset handlers will already unset dev->hotplugged at that > > > > > > point and re-execute all the hotplug hooks with > > > > > > dev->hotplugged == false. This should put everything in > > > > > > a state that's identical to cmdline-created devices. > > > > > instead of flag for non migration case we could use > > > > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > > > > transition to reset all devices or > > > > > maybe do something like this: > > > > > > > > Hrm, does the general reset call happen now before or after this > > > > transition? Resetting DRCs at CAS time should accomplish the same > > > > thing. > > > > > > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > > > > index 0ce45a2..cdeb8f8 100644 > > > > > --- a/hw/core/qdev.c > > > > > +++ b/hw/core/qdev.c > > > > > > > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > > > > ObjectClass *class; > > > > > Property *prop; > > > > > > > > > > - if (qdev_hotplug) { > > > > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > > > > dev->hotplugged = 1; > > > > > qdev_hot_added = true; > > > > > } > > > > > > > > > > > 4. Only allow management to do device_add before it sends > > > > > > the migration stream (if management doesn't already guard > > > > > > against this then it's probably a bug anyway) > > > > > seems like Juan already took care of it. > > > > > > > > > > > This allows management to treat device_add/cmdline as being > > > > > > completely synonymous for guests that haven't started yet, > > > > > > both for -incoming and -S in general, and it maintains > > > > > > the behavior that existing migration code expects of > > > > > > cmdline-specified devices. > > > > > > > > > > > > > > > > > > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature