pc: relocate 4g start to 1T where applicable

Michael S. Tsirkin Thu, 24 Feb 2022 21:23:39 -0800

On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote:
> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
> >>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> >>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> >>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> >>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
> >>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>>>> +                                          uint64_t 
> >>>>>>>>>>> pci_hole64_size)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>>>>>> +
> >>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>>>>> +        return;
> >>>>>>>>>>> +    }
> >>>>>>>>>>
> >>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU 
> >>>>>>>>>> ID?
> >>>>>>>>>> It's really about what we present to the guest though,
> >>>>>>>>>> isn't it?
> >>>>>>>>>
> >>>>>>>>> It was the easier catch all to use cpuid without going into
> >>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is 
> >>>>>>>>> only
> >>>>>>>>> for systems with an IOMMU present.
> >>>>>>>>>
> >>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>>>>>
> >>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() 
> >>>>>>>>> helper
> >>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child 
> >>>>>>>>> entries
> >>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT 
> >>>>>>>>> region is
> >>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I 
> >>>>>>>>> don't think it's
> >>>>>>>>> even worth checking the range exists in:
> >>>>>>>>>
> >>>>>>>>>     /sys/kernel/iommu_groups/0/reserved_regions
> >>>>>>>>>
> >>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
> >>>>>>>>
> >>>>>>>> Here's what I have staged in local tree, to address your comment.
> >>>>>>>>
> >>>>>>>> Naturally the first chunk is what's affected by this patch the rest 
> >>>>>>>> is a
> >>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
> >>>>>>>> pass
> >>>>>>>> all the tests and what not.
> >>>>>>>>
> >>>>>>>> I am not entirely sure this is the right place to put such a helper, 
> >>>>>>>> open
> >>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow 
> >>>>>>>> the rest
> >>>>>>>> of the file's style.
> >>>>>>>>
> >>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>>>>>> --- a/hw/i386/pc.c
> >>>>>>>> +++ b/hw/i386/pc.c
> >>>>>>>> @@ -868,10 +868,8 @@ static void 
> >>>>>>>> x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>                                            uint64_t pci_hole64_size)
> >>>>>>>>  {
> >>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>> -    uint32_t eax, vendor[3];
> >>>>>>>>
> >>>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>>>>>          return;
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>>>>>> --- a/include/qemu/osdep.h
> >>>>>>>> +++ b/include/qemu/osdep.h
> >>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>>>>>   */
> >>>>>>>>  size_t qemu_get_host_physmem(void);
> >>>>>>>>
> >>>>>>>> +/**
> >>>>>>>> + * qemu_amd_iommu_is_present:
> >>>>>>>> + *
> >>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>>>>>> + * is present.
> >>>>>>>> + *
> >>>>>>>> + */
> >>>>>>>> +bool qemu_amd_iommu_is_present(void);
> >>>>>>>> +
> >>>>>>>>  /*
> >>>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>>>>>   * for the current thread.
> >>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>>>>> index f2be7321c59f..54cef21217c4 100644
> >>>>>>>> --- a/util/oslib-posix.c
> >>>>>>>> +++ b/util/oslib-posix.c
> >>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>>>>>  #endif
> >>>>>>>>      return 0;
> >>>>>>>>  }
> >>>>>>>> +
> >>>>>>>> +bool qemu_amd_iommu_is_present(void)
> >>>>>>>> +{
> >>>>>>>> +    bool found = false;
> >>>>>>>> +#ifdef CONFIG_LINUX
> >>>>>>>> +    struct dirent *entry;
> >>>>>>>> +    char *path;
> >>>>>>>> +    DIR *dir;
> >>>>>>>> +
> >>>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>>>>>> +    dir = opendir(path);
> >>>>>>>> +    if (!dir) {
> >>>>>>>> +        g_free(path);
> >>>>>>>> +        return found;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    do {
> >>>>>>>> +            entry = readdir(dir);
> >>>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>>>>>> +                found = true;
> >>>>>>>> +                break;
> >>>>>>>> +            }
> >>>>>>>> +    } while (entry);
> >>>>>>>> +
> >>>>>>>> +    g_free(path);
> >>>>>>>> +    closedir(dir);
> >>>>>>>> +#endif
> >>>>>>>> +    return found;
> >>>>>>>> +}
> >>>>>>>
> >>>>>>> why are we checking whether an AMD IOMMU is present
> >>>>>>> on the host? 
> >>>>>>> Isn't what we care about whether it's
> >>>>>>> present in the VM? What we are reserving after all
> >>>>>>> is a range of GPAs, not actual host PA's ...
> >>>>>>>
> >>>>>> Right.
> >>>>>>
> >>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >>>>>> and so we need to not map that portion of address space entirely
> >>>>>> and use some other portion of the GPA-space. This has to
> >>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
> >>>>>> place. So, if you do not have an host IOMMU, you can't
> >>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, 
> >>>>>> therefore you
> >>>>>> don't need this problem.
> >>>>>>
> >>>>>> Whether one uses a guest IOMMU or not does not affect the result,
> >>>>>> it would be more of a case of mimicking real hardware not fixing
> >>>>>> the issue of this series.
> >>>>>
> >>>>>
> >>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> >>>>> And ideally move the logic reporting reserved ranges there too.
> >>>>> However, I think vdpa has the same issue too.
> >>>>> CC Jason for an opinion.
> >>>>
> >>>> It does have the same problem.
> >>>>
> >>>> This is not specific to VFIO, it's to anything that uses the iommu.
> >>>
> >>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> >>> that we don't have a global "enable-vfio" flag since vfio devices
> >>> can be hot-plugged. I think this is not the first time we wanted
> >>> something like this, right Alex?
> >>>
> >>>> It's
> >>>> just that VFIO doesn't let you shoot in the foot :)
> >>>>
> >>>> vDPA would need to validate iova ranges as well to prevent mapping on
> >>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> >>>> map request is within a valid iova range"). Now you could argue that
> >>>> it's an IOMMU core problem, maybe.
> >>>>
> >>>> But I this as a separate problem,
> >>>> because regardless of said validation, your guest provisioning
> >>>> is still going to fail for guests with >=1010G of max GPA and even if
> >>>> it doesn't fail you shouldn't be letting it DMA map those in the first
> >>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> >>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
> >>>
> >>> I wonder what's the status on a system without an IOMMU.
> >>>
> >> In theory you should be OK. Also it's worth keeping in mind that AMD 
> >> machines
> >> with >=1T have this 12G hole marked as Reserved, so even DMA at last when 
> >> CPU
> >> is the initiator should be fine on worst case scenario.
> > 
> > Not sure I understand this last sentence.
> > 
> I was trying to say that the E820 from hardware is marked as Reserved and any 
> DMA
> from/to endpoints from kernel-supplied CPU addresses will use those reserved 
> ranges.


meaning "will *not* use" I guess?

> >>>>> Also, some concerns
> >>>>> - I think this changes memory layout for working VMs not using VFIO.
> >>>>>   Better preserve the old layout for old machine types?
> >>>>>
> >>>> Oh definitely, and I do that in this series. See the last commit, and
> >>>> in the past it was also a machine-property exposed to userspace.
> >>>> Otherwise I would be breaking (badly) compat/migration.
> >>>>
> >>>> I would like to emphasize that these memory layout changes are 
> >>>> *exclusive* and
> >>>> *only* for hosts on AMD with guests memory being bigger than 
> >>>> ~950G-~1010G.
> >>>> It's not every guest, but a fairly limited subset of big-memory guests 
> >>>> that
> >>>> are not the norm.
> >>>>
> >>>> Albeit the phys-bits property errors might gives us a bit more trouble, 
> >>>> so
> >>>> it might help being more conservative.
> >>>>
> >>>>> - You mention the phys bits issue very briefly, and it's pretty
> >>>>>   concerning.  Do we maybe want to also disable the work around if phys
> >>>>>   bits is too small? 
> >>>>
> >>>> We are doing that here (well, v4), as suggested by Igor. Note that in 
> >>>> this series
> >>>> it's a warning, but v4 I have it as an error and with the 32-bit issues 
> >>>> that
> >>>> I found through qtest.
> >>>>
> >>>> I share the same concern as you over making this an error because of 
> >>>> compatibility.
> >>>> Perhaps, we could go back to the previous version and turn back into a 
> >>>> warning and
> >>>> additionally even disabling the relocation all together. Or if desired 
> >>>> even
> >>>> depending on a machine-property to enable it.
> >>>
> >>> I would be inclined to disable the relocation. And maybe block vfio? I'm
> >>> not 100% sure about that last one, but this can be a vfio decision to
> >>> make.
> >>>
> >> I don't see this as a VFIO problem (VFIO is actually being a good citizen 
> >> and doing the
> >> right thing :)). From my perspective this fix is also useful to vDPA 
> >> (which we also care),
> >> and thus will also let us avoid DMA mapping in these GPA ranges.
> >>
> >> The reason I mention VFIO in cover letter is that it's one visible UAPI 
> >> change that
> >> users will notice that suddenly their 1TB+ VMs break.
> > 
> > What I mean is that most VMs don't use either VFIO or VDPA.
> > If we had VFIO,VDPA as an accelerator that needs to be listed
> > upfront when qemu is started, we could solve this with
> > a bit less risk by not changing anything for VMs without these two.
> > 
> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
> you what are those allowed IOVA ranges when you create the container.
> 
> The machine property, though, could communicate /the intent/ to add
> any VFIO or vDPA devices in the future and maybe cover for that. That
> let's one avoid any 'accelerator-only' problems and restrict the issues
> of this series to VMs with VFIO/VDPA if the risk is a concern.

Well Alex nacked that idea (and I certainly trust him where VFIO is
concerned), I guess for now we'll just live with the risk.

> > Alex what do you think about adding this?
> > 
> > Given we do not have such a thing right now, one can get
> > into a situation where phys bits limit CPU, then
> > more memory is supplied than would fit above reserved region, then
> > we stick the memory over the reserved region, and finally
> > then vfio device is added.
> > 
> The current code considers the maximum possible address considering
> memory hotplug, PCI hole64 and etc. So we would need to worry about
> whether VFIO or VDPA is going to be hotplugged at some point in the
> future during guest lifecycle, do decide to alter the memory layout
> at guest provisioning.

Right. And given we currently have no way of knowing, we have to assume
yes. At least we can check whether qemu was configured with VFIO or VDPA.

> > In this last configuration, should vfio device add fail?
> > Or just warn and try to continue? I think we can code this last
> > decision as part of vfio code and then Alex will get to decide ;)
> > And yes, a similar thing with vdpa.
> > 
> 
> Of all those cases I would feel the machine-property is better,
> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> discovering late stage that the user is doing something wrong (and thus
> fail the DMA_MAP operation for those who do check invalid iovas)

-- 
MST

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

Reply via email to