On Tue, 3 Mar 2020 14:43:48 +1100 David Gibson <da...@gibson.dropbear.id.au> wrote:
> The Real Mode Area (RMA) is the part of memory which a guest can access > when in real (MMU off) mode. Of course, for a guest under KVM, the MMU > isn't really turned off, it's just in a special translation mode - Virtual > Real Mode Area (VRMA) - which looks like real mode in guest mode. > > The mechanics of how this works when using the hash MMU (HPT) put a > constraint on the size of the RMA, which depends on the size of the > HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA > we advertise to the guest based on this VRMA limit. > > There are several things wrong with this: > 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum > of Node 0 memory size and the VRMA limit. That will *often* work the > same as clamping, but there can be other constraints on RMA size which > supersede Node 0 memory size. We have real bugs caused by this > (currently worked around in the guest kernel) > 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where > we're past the point that we can actually advertise an RMA limit to the > guest > 3) But most fundamentally, the VRMA limit depends on host configuration > (page size) which shouldn't be visible to the guest, but this partially > exposes it. This can cause problems with migration in certain edge > cases, although we will mostly get away with it. > > In practice, this clamping is almost never applied anyway. With 64kiB > pages and the normal rules for sizing of the HPT, the theoretical VRMA > limit will be 4x(guest memory size) and so never hit. It will hit with > 4kiB pages, where it will be (guest memory size)/4. However all mainstream > distro kernels for POWER have used a 64kiB page size for at least 10 years. > > So, simply replace this logic with a check that the RMA we've calculated > based only on guest visible configuration will fit within the host implied > VRMA limit. This can break if running HPT guests on a host kernel with > 4kiB page size. As noted that's very rare. There also exist several > possible workarounds: > * Change the host kernel to use 64kiB pages > * Use radix MMU (RPT) guests instead of HPT > * Use 64kiB hugepages on the host to back guest memory > * Increase the guest memory size so that the RMA hits one of the fixed > limits before the RMA limit. This is relatively easy on POWER8 which > has a 16GiB limit, harder on POWER9 which has a 1TiB limit. > * Use a guest NUMA configuration which artificially constrains the RMA > within the VRMA limit (the RMA must always fit within Node 0). > > Previously, on KVM, we also temporarily reduced the rma_size to 256M so > that the we'd load the kernel and initrd safely, regardless of the VRMA > limit. This was a) confusing, b) could significantly limit the size of > images we could load and c) introduced a behavioural difference between > KVM and TCG. So we remove that as well. > > Signed-off-by: David Gibson <da...@gibson.dropbear.id.au> > Reviewed-by: Alexey Kardashevskiy <a...@ozlabs.ru> > --- Reviewed-by: Greg Kurz <gr...@kaod.org> > hw/ppc/spapr.c | 28 ++++++++++------------------ > hw/ppc/spapr_hcall.c | 4 ++-- > include/hw/ppc/spapr.h | 3 +-- > 3 files changed, 13 insertions(+), 22 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 18bf4bc3de..ef7667455c 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -1569,7 +1569,7 @@ void spapr_reallocate_hpt(SpaprMachineState *spapr, int > shift, > spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT); > } > > -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr) > +void spapr_setup_hpt(SpaprMachineState *spapr) > { > int hpt_shift; > > @@ -1585,10 +1585,16 @@ void spapr_setup_hpt_and_vrma(SpaprMachineState > *spapr) > } > spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal); > > - if (spapr->vrma_adjust) { > + if (kvm_enabled()) { > hwaddr vrma_limit = kvmppc_vrma_limit(spapr->htab_shift); > > - spapr->rma_size = MIN(spapr_node0_size(MACHINE(spapr)), vrma_limit); > + /* Check our RMA fits in the possible VRMA */ > + if (vrma_limit < spapr->rma_size) { > + error_report("Unable to create %" HWADDR_PRIu > + "MiB RMA (VRMA only allows %" HWADDR_PRIu "MiB", > + spapr->rma_size / MiB, vrma_limit / MiB); > + exit(EXIT_FAILURE); > + } > } > } > > @@ -1628,7 +1634,7 @@ static void spapr_machine_reset(MachineState *machine) > spapr->patb_entry = PATE1_GR; > spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT); > } else { > - spapr_setup_hpt_and_vrma(spapr); > + spapr_setup_hpt(spapr); > } > > qemu_devices_reset(); > @@ -2695,20 +2701,6 @@ static void spapr_machine_init(MachineState *machine) > > spapr->rma_size = node0_size; > > - /* With KVM, we don't actually know whether KVM supports an > - * unbounded RMA (PR KVM) or is limited by the hash table size > - * (HV KVM using VRMA), so we always assume the latter > - * > - * In that case, we also limit the initial allocations for RTAS > - * etc... to 256M since we have no way to know what the VRMA size > - * is going to be as it depends on the size of the hash table > - * which isn't determined yet. > - */ > - if (kvm_enabled()) { > - spapr->vrma_adjust = 1; > - spapr->rma_size = MIN(spapr->rma_size, 0x10000000); > - } > - > /* Actually we don't support unbounded RMA anymore since we added > * proper emulation of HV mode. The max we can get is 16G which > * also happens to be what we configure for PAPR mode so make sure > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c > index c2b3286625..40c86e91eb 100644 > --- a/hw/ppc/spapr_hcall.c > +++ b/hw/ppc/spapr_hcall.c > @@ -1458,7 +1458,7 @@ static void > spapr_check_setup_free_hpt(SpaprMachineState *spapr, > spapr_free_hpt(spapr); > } else if (!(patbe_new & PATE1_GR)) { > /* RADIX->HASH || NOTHING->HASH : Allocate HPT */ > - spapr_setup_hpt_and_vrma(spapr); > + spapr_setup_hpt(spapr); > } > return; > } > @@ -1845,7 +1845,7 @@ static target_ulong > h_client_architecture_support(PowerPCCPU *cpu, > * (because the guest isn't going to use radix) then set it up here. > */ > if ((spapr->patb_entry & PATE1_GR) && !guest_radix) { > /* legacy hash or new hash: */ > - spapr_setup_hpt_and_vrma(spapr); > + spapr_setup_hpt(spapr); > } > > if (fdt_bufsize < sizeof(hdr)) { > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > index a4216935a1..90dbc55931 100644 > --- a/include/hw/ppc/spapr.h > +++ b/include/hw/ppc/spapr.h > @@ -156,7 +156,6 @@ struct SpaprMachineState { > SpaprPendingHpt *pending_hpt; /* in-progress resize */ > > hwaddr rma_size; > - int vrma_adjust; > uint32_t fdt_size; > uint32_t fdt_initial_size; > void *fdt_blob; > @@ -795,7 +794,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool > reset, size_t space); > void spapr_events_init(SpaprMachineState *sm); > void spapr_dt_events(SpaprMachineState *sm, void *fdt); > void close_htab_fd(SpaprMachineState *spapr); > -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr); > +void spapr_setup_hpt(SpaprMachineState *spapr); > void spapr_free_hpt(SpaprMachineState *spapr); > SpaprTceTable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn); > void spapr_tce_table_enable(SpaprTceTable *tcet,