On Thu, Feb 11, 2021 at 07:52:45PM -0300, Daniel Henrique Barboza wrote: > There is a reliable way to make a CPU hotunplug fail in the pseries > machine. Hotplug a CPU A, then offline all other CPUs inside the guest > but A. When trying to hotunplug A the guest kernel will refuse to do > it, because A is now the last online CPU of the guest. PAPR has no > 'error callback' in this situation to report back to the platform, > so the guest kernel will deny the unplug in silent and QEMU will never > know what happened. The unplug pending state of A will remain until > the guest is shutdown or rebooted. > > Previous attempts of fixing it (see [1] and [2]) were aimed at trying to > mitigate the effects of the problem. In [1] we were trying to guess which > guest CPUs were online to forbid hotunplug of the last online CPU in the QEMU > layer, avoiding the scenario described above because QEMU is now failing > in behalf of the guest. This is not robust because the last online CPU of > the guest can change while we're in the middle of the unplug process, and > our initial assumptions are now invalid. In [2] we were accepting that our > unplug process is uncertain and the user should be allowed to spam the IRQ > hotunplug queue of the guest in case the CPU hotunplug fails. > > This patch presents another alternative, using the timeout infrastructure > introduced in the previous patch. CPU hotunplugs in the pSeries machine will > now timeout after 15 seconds. This is a long time for a single CPU unplug > to occur, regardless of guest load - although the user is *strongly* > encouraged > to *not* hotunplug devices from a guest under high load - and we can be sure > that something went wrong if it takes longer than that for the guest to > release > the CPU (the same can't be said about memory hotunplug - more on that in the > next patch). > > Timing out the unplug operation will reset the unplug state of the CPU and > allow the user to try it again, regardless of the error situation that > prevented the hotunplug to occur. Of all the not so pretty fixes/mitigations > for CPU hotunplug errors in pSeries, timing out the operation is an admission > that we have no control in the process, and must assume the worst case if > the operation doesn't succeed in a sensible time frame. > > [1] https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg03353.html > [2] https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg04400.html > > Reported-by: Xujun Ma <x...@redhat.com> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414 > Signed-off-by: Daniel Henrique Barboza <danielhb...@gmail.com>
Reviewed-by: David Gibson <da...@gibson.dropbear.id.au> > --- > hw/ppc/spapr.c | 4 ++++ > hw/ppc/spapr_drc.c | 17 +++++++++++++++++ > include/hw/ppc/spapr_drc.h | 3 +++ > 3 files changed, 24 insertions(+) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index b066df68cb..ecce8abf14 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -3724,6 +3724,10 @@ void spapr_core_unplug_request(HotplugHandler > *hotplug_dev, DeviceState *dev, > if (!spapr_drc_unplug_requested(drc)) { > spapr_drc_unplug_request(drc); > spapr_hotplug_req_remove_by_index(drc); > + } else { > + error_setg(errp, "core-id %d unplug is still pending, %d seconds " > + "timeout remaining", > + cc->core_id, spapr_drc_unplug_timeout_remaining_sec(drc)); Reporting this information is a nice touch. > } > } > > diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c > index c88bb524c5..c143bfb6d3 100644 > --- a/hw/ppc/spapr_drc.c > +++ b/hw/ppc/spapr_drc.c > @@ -398,6 +398,12 @@ void spapr_drc_unplug_request(SpaprDrc *drc) > > drc->unplug_requested = true; > > + if (drck->unplug_timeout_seconds != 0) { > + timer_mod(drc->unplug_timeout_timer, > + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + > + drck->unplug_timeout_seconds * 1000); > + } > + > if (drc->state != drck->empty_state) { > trace_spapr_drc_awaiting_quiesce(spapr_drc_index(drc)); > return; > @@ -406,6 +412,16 @@ void spapr_drc_unplug_request(SpaprDrc *drc) > spapr_drc_release(drc); > } > > +int spapr_drc_unplug_timeout_remaining_sec(SpaprDrc *drc) > +{ > + if (drc->unplug_requested && timer_pending(drc->unplug_timeout_timer)) { > + return > (qemu_timeout_ns_to_ms(drc->unplug_timeout_timer->expire_time) - > + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)) / 1000; Hmm. Reaching into the timer's internal fields isn't ideal. I wonder if we should add a helper in the timer code for reporting this information. > + } > + > + return 0; > +} > + > bool spapr_drc_reset(SpaprDrc *drc) > { > SpaprDrcClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > @@ -706,6 +722,7 @@ static void spapr_drc_cpu_class_init(ObjectClass *k, void > *data) > drck->drc_name_prefix = "CPU "; > drck->release = spapr_core_release; > drck->dt_populate = spapr_core_dt_populate; > + drck->unplug_timeout_seconds = 15; > } > > static void spapr_drc_pci_class_init(ObjectClass *k, void *data) > diff --git a/include/hw/ppc/spapr_drc.h b/include/hw/ppc/spapr_drc.h > index b2e6222d09..26599c385a 100644 > --- a/include/hw/ppc/spapr_drc.h > +++ b/include/hw/ppc/spapr_drc.h > @@ -211,6 +211,8 @@ typedef struct SpaprDrcClass { > > int (*dt_populate)(SpaprDrc *drc, struct SpaprMachineState *spapr, > void *fdt, int *fdt_start_offset, Error **errp); > + > + int unplug_timeout_seconds; > } SpaprDrcClass; > > typedef struct SpaprDrcPhysical { > @@ -246,6 +248,7 @@ int spapr_dt_drc(void *fdt, int offset, Object *owner, > uint32_t drc_type_mask); > */ > void spapr_drc_attach(SpaprDrc *drc, DeviceState *d); > void spapr_drc_unplug_request(SpaprDrc *drc); > +int spapr_drc_unplug_timeout_remaining_sec(SpaprDrc *drc); > > /* > * Reset all DRCs, causing pending hot-plug/unplug requests to complete. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature