On Thu, Feb 11, 2021 at 07:52:46PM -0300, Daniel Henrique Barboza wrote: > Handling errors in memory hotunplug in the pSeries machine is more complex > than any other device type, because there are all the complications that other > devices has, and more. > > For instance, determining a timeout for a DIMM hotunplug must consider if > it's a > Hash-MMU or a Radix-MMU guest, because Hash guests takes longer to hotunplug > DIMMs. > The size of the DIMM is also a factor, given that longer DIMMs naturally takes > longer to be hotunplugged from the kernel. And there's also the guest memory > usage to > be considered: if there's a process that is consuming memory that would be > lost by > the DIMM unplug, the kernel will postpone the unplug process until the process > finishes, and then initiate the regular hotunplug process. The first two > considerations are manageable, but the last one is a deal breaker. > > There is no sane way for the pSeries machine to determine the memory load in > the guest > when attempting a DIMM hotunplug - and even if there was a way, the guest can > start > using all the RAM in the middle of the unplug process and invalidate our > previous > assumptions - and in result we can't even begin to calculate a timeout for the > operation. This means that we can't implement a viable timeout mechanism for > memory > unplug in pSeries. > > Going back to why we would consider an unplug timeout, the reason is that we > can't > know if the kernel is giving up the unplug. Turns out that, sometimes, we can. > Consider a failed memory hotunplug attempt where the kernel will error out > with > the following message: > > 'pseries-hotplug-mem: Memory indexed-count-remove failed, adding any removed > LMBs' > > This happens when there is a LMB that the kernel gave up in removing, and the > LMBs > marked for removal of the same DIMM are now being added back. This process > happens
We need to be a little careful about terminology here. From the guest's point of view, there's no such thing as a DIMM, only LMBs. What the guest is doing here is essentially rejecting a single "index + number" DRC unplug request, which corresponds to one DIMM on the qemu side. > in the pseries kernel in [1], dlpar_memory_remove_by_ic() into > dlpar_add_lmb(), and > after that update_lmb_associativity_index(). In this function, the kernel is > configuring > the LMB DRC connector again. Note that this is a valid usage in LOPAR, as > stated in > section "ibm,configure-connector RTAS Call": > > 'A subsequent sequence of calls to ibm,configure-connector with the same > entry from > the “ibm,drc-indexes” or “ibm,drc-info” property will restart the > configuration of > devices which were not completely configured.' > > We can use this kernel behavior in our favor. If a DRC connector > reconfiguration > for a LMB that we marked as unplug pending happens, this indicates that the > kernel > changed its mind about the unplug and is reasserting that it will keep using > the > DIMM. In this case, it's safe to assume that the whole DIMM unplug was > cancelled. > > This patch hops into rtas_ibm_configure_connector() and, in the scenario > described > above, clear the unplug state for the DIMM device. This will not solve all the > problems we still have with memory unplug, but it will cover this case where > the > kernel reconfigures LMBs after a failed unplug. We are a bit more resilient, > without using an unreliable timeout, and we didn't make the remaining error > cases > any worse. I wonder if we could use this as a beginning of a hotplug failure reporting mechanism. As noted, this is explicitly allowed by PAPR and I think in general it makes sense that a configure-connector would re-assert that the guest is using the resource and we can't unplug it. Could we extend guests to do an indicative configure-connector on any unplug it knows it can't complete? Or if configure-connector is too disruptive could we use an (extra) H_SET_INDICATOR to "UNISOLATE" state? If I'm reading right, that should be both permitted and a no-op for existing PAPR implementations, so it should be a pretty safe way to add that indication. > > [1] arch/powerpc/platforms/pseries/hotplug-memory.c > > Signed-off-by: Daniel Henrique Barboza <danielhb...@gmail.com> > --- > hw/ppc/spapr.c | 30 ++++++++++++++++++++++++++++++ > hw/ppc/spapr_drc.c | 14 ++++++++++++++ > include/hw/ppc/spapr.h | 2 ++ > 3 files changed, 46 insertions(+) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index ecce8abf14..4bcded4a1a 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -3575,6 +3575,36 @@ static SpaprDimmState > *spapr_recover_pending_dimm_state(SpaprMachineState *ms, > return spapr_pending_dimm_unplugs_add(ms, avail_lmbs, dimm); > } > > +void spapr_clear_pending_dimm_unplug_state(SpaprMachineState *spapr, > + PCDIMMDevice *dimm) > +{ > + SpaprDimmState *ds = spapr_pending_dimm_unplugs_find(spapr, dimm); > + SpaprDrc *drc; > + uint32_t nr_lmbs; > + uint64_t size, addr_start, addr; > + int i; > + > + if (ds) { > + spapr_pending_dimm_unplugs_remove(spapr, ds); > + } Hrm... how would !ds arise? Could this just be an assert? > + > + size = memory_device_get_region_size(MEMORY_DEVICE(dimm), &error_abort); > + nr_lmbs = size / SPAPR_MEMORY_BLOCK_SIZE; > + > + addr_start = object_property_get_uint(OBJECT(dimm), PC_DIMM_ADDR_PROP, > + &error_abort); > + > + addr = addr_start; > + for (i = 0; i < nr_lmbs; i++) { > + drc = spapr_drc_by_id(TYPE_SPAPR_DRC_LMB, > + addr / SPAPR_MEMORY_BLOCK_SIZE); > + g_assert(drc); > + > + drc->unplug_requested = false; > + addr += SPAPR_MEMORY_BLOCK_SIZE; > + } > +} > + > /* Callback to be called during DRC release. */ > void spapr_lmb_release(DeviceState *dev) > { > diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c > index c143bfb6d3..eae941233a 100644 > --- a/hw/ppc/spapr_drc.c > +++ b/hw/ppc/spapr_drc.c > @@ -1230,6 +1230,20 @@ static void rtas_ibm_configure_connector(PowerPCCPU > *cpu, > > drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > + /* > + * This indicates that the kernel is reconfiguring a LMB due to > + * a failed hotunplug. Clear the pending unplug state for the whole > + * DIMM. > + */ > + if (spapr_drc_type(drc) == SPAPR_DR_CONNECTOR_TYPE_LMB && > + drc->unplug_requested) { > + > + /* This really shouldn't happen in this point, but ... */ > + g_assert(drc->dev); I'm a little worried that a buggy or malicious guest could trigger this assert. > + > + spapr_clear_pending_dimm_unplug_state(spapr, PC_DIMM(drc->dev)); > + } > + > if (!drc->fdt) { > void *fdt; > int fdt_size; > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > index ccbeeca1de..5bcc8f3bb8 100644 > --- a/include/hw/ppc/spapr.h > +++ b/include/hw/ppc/spapr.h > @@ -847,6 +847,8 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize); > int spapr_reallocate_hpt(SpaprMachineState *spapr, int shift, Error **errp); > void spapr_clear_pending_events(SpaprMachineState *spapr); > void spapr_clear_pending_hotplug_events(SpaprMachineState *spapr); > +void spapr_clear_pending_dimm_unplug_state(SpaprMachineState *spapr, > + PCDIMMDevice *dimm); > int spapr_max_server_number(SpaprMachineState *spapr); > void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex, > uint64_t pte0, uint64_t pte1); -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature