On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote: > On 2/24/22 20:12, Michael S. Tsirkin wrote: > > On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote: > >> On 2/24/22 19:54, Michael S. Tsirkin wrote: > >>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote: > >>>> On 2/24/22 18:30, Michael S. Tsirkin wrote: > >>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote: > >>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote: > >>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote: > >>>>>>>> On 2/23/22 23:35, Joao Martins wrote: > >>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote: > >>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms, > >>>>>>>>>>> + uint64_t > >>>>>>>>>>> pci_hole64_size) > >>>>>>>>>>> +{ > >>>>>>>>>>> + X86MachineState *x86ms = X86_MACHINE(pcms); > >>>>>>>>>>> + uint32_t eax, vendor[3]; > >>>>>>>>>>> + > >>>>>>>>>>> + host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); > >>>>>>>>>>> + if (!IS_AMD_VENDOR(vendor)) { > >>>>>>>>>>> + return; > >>>>>>>>>>> + } > >>>>>>>>>> > >>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU > >>>>>>>>>> ID? > >>>>>>>>>> It's really about what we present to the guest though, > >>>>>>>>>> isn't it? > >>>>>>>>> > >>>>>>>>> It was the easier catch all to use cpuid without going into > >>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is > >>>>>>>>> only > >>>>>>>>> for systems with an IOMMU present. > >>>>>>>>> > >>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present? > >>>>>>>>>> > >>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() > >>>>>>>>> helper > >>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child > >>>>>>>>> entries > >>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT > >>>>>>>>> region is > >>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I > >>>>>>>>> don't think it's > >>>>>>>>> even worth checking the range exists in: > >>>>>>>>> > >>>>>>>>> /sys/kernel/iommu_groups/0/reserved_regions > >>>>>>>>> > >>>>>>>>> (Also that sysfs ABI is >= 4.11 only) > >>>>>>>> > >>>>>>>> Here's what I have staged in local tree, to address your comment. > >>>>>>>> > >>>>>>>> Naturally the first chunk is what's affected by this patch the rest > >>>>>>>> is a > >>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to > >>>>>>>> pass > >>>>>>>> all the tests and what not. > >>>>>>>> > >>>>>>>> I am not entirely sure this is the right place to put such a helper, > >>>>>>>> open > >>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow > >>>>>>>> the rest > >>>>>>>> of the file's style. > >>>>>>>> > >>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c > >>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644 > >>>>>>>> --- a/hw/i386/pc.c > >>>>>>>> +++ b/hw/i386/pc.c > >>>>>>>> @@ -868,10 +868,8 @@ static void > >>>>>>>> x86_update_above_4g_mem_start(PCMachineState *pcms, > >>>>>>>> uint64_t pci_hole64_size) > >>>>>>>> { > >>>>>>>> X86MachineState *x86ms = X86_MACHINE(pcms); > >>>>>>>> - uint32_t eax, vendor[3]; > >>>>>>>> > >>>>>>>> - host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); > >>>>>>>> - if (!IS_AMD_VENDOR(vendor)) { > >>>>>>>> + if (!qemu_amd_iommu_is_present()) { > >>>>>>>> return; > >>>>>>>> } > >>>>>>>> > >>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h > >>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644 > >>>>>>>> --- a/include/qemu/osdep.h > >>>>>>>> +++ b/include/qemu/osdep.h > >>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp); > >>>>>>>> */ > >>>>>>>> size_t qemu_get_host_physmem(void); > >>>>>>>> > >>>>>>>> +/** > >>>>>>>> + * qemu_amd_iommu_is_present: > >>>>>>>> + * > >>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU > >>>>>>>> + * is present. > >>>>>>>> + * > >>>>>>>> + */ > >>>>>>>> +bool qemu_amd_iommu_is_present(void); > >>>>>>>> + > >>>>>>>> /* > >>>>>>>> * Toggle write/execute on the pages marked MAP_JIT > >>>>>>>> * for the current thread. > >>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c > >>>>>>>> index f2be7321c59f..54cef21217c4 100644 > >>>>>>>> --- a/util/oslib-posix.c > >>>>>>>> +++ b/util/oslib-posix.c > >>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void) > >>>>>>>> #endif > >>>>>>>> return 0; > >>>>>>>> } > >>>>>>>> + > >>>>>>>> +bool qemu_amd_iommu_is_present(void) > >>>>>>>> +{ > >>>>>>>> + bool found = false; > >>>>>>>> +#ifdef CONFIG_LINUX > >>>>>>>> + struct dirent *entry; > >>>>>>>> + char *path; > >>>>>>>> + DIR *dir; > >>>>>>>> + > >>>>>>>> + path = g_strdup_printf("/sys/class/iommu"); > >>>>>>>> + dir = opendir(path); > >>>>>>>> + if (!dir) { > >>>>>>>> + g_free(path); > >>>>>>>> + return found; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + do { > >>>>>>>> + entry = readdir(dir); > >>>>>>>> + if (entry && !strncmp(entry->d_name, "ivhd", 4)) { > >>>>>>>> + found = true; > >>>>>>>> + break; > >>>>>>>> + } > >>>>>>>> + } while (entry); > >>>>>>>> + > >>>>>>>> + g_free(path); > >>>>>>>> + closedir(dir); > >>>>>>>> +#endif > >>>>>>>> + return found; > >>>>>>>> +} > >>>>>>> > >>>>>>> why are we checking whether an AMD IOMMU is present > >>>>>>> on the host? > >>>>>>> Isn't what we care about whether it's > >>>>>>> present in the VM? What we are reserving after all > >>>>>>> is a range of GPAs, not actual host PA's ... > >>>>>>> > >>>>>> Right. > >>>>>> > >>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail, > >>>>>> and so we need to not map that portion of address space entirely > >>>>>> and use some other portion of the GPA-space. This has to > >>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes > >>>>>> place. So, if you do not have an host IOMMU, you can't > >>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, > >>>>>> therefore you > >>>>>> don't need this problem. > >>>>>> > >>>>>> Whether one uses a guest IOMMU or not does not affect the result, > >>>>>> it would be more of a case of mimicking real hardware not fixing > >>>>>> the issue of this series. > >>>>> > >>>>> > >>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then. > >>>>> And ideally move the logic reporting reserved ranges there too. > >>>>> However, I think vdpa has the same issue too. > >>>>> CC Jason for an opinion. > >>>> > >>>> It does have the same problem. > >>>> > >>>> This is not specific to VFIO, it's to anything that uses the iommu. > >>> > >>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate > >>> that we don't have a global "enable-vfio" flag since vfio devices > >>> can be hot-plugged. I think this is not the first time we wanted > >>> something like this, right Alex? > >>> > >>>> It's > >>>> just that VFIO doesn't let you shoot in the foot :) > >>>> > >>>> vDPA would need to validate iova ranges as well to prevent mapping on > >>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma > >>>> map request is within a valid iova range"). Now you could argue that > >>>> it's an IOMMU core problem, maybe. > >>>> > >>>> But I this as a separate problem, > >>>> because regardless of said validation, your guest provisioning > >>>> is still going to fail for guests with >=1010G of max GPA and even if > >>>> it doesn't fail you shouldn't be letting it DMA map those in the first > >>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events > >>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole). > >>> > >>> I wonder what's the status on a system without an IOMMU. > >>> > >> In theory you should be OK. Also it's worth keeping in mind that AMD > >> machines > >> with >=1T have this 12G hole marked as Reserved, so even DMA at last when > >> CPU > >> is the initiator should be fine on worst case scenario. > > > > Not sure I understand this last sentence. > > > I was trying to say that the E820 from hardware is marked as Reserved and any > DMA > from/to endpoints from kernel-supplied CPU addresses will use those reserved > ranges.
meaning "will *not* use" I guess? > >>>>> Also, some concerns > >>>>> - I think this changes memory layout for working VMs not using VFIO. > >>>>> Better preserve the old layout for old machine types? > >>>>> > >>>> Oh definitely, and I do that in this series. See the last commit, and > >>>> in the past it was also a machine-property exposed to userspace. > >>>> Otherwise I would be breaking (badly) compat/migration. > >>>> > >>>> I would like to emphasize that these memory layout changes are > >>>> *exclusive* and > >>>> *only* for hosts on AMD with guests memory being bigger than > >>>> ~950G-~1010G. > >>>> It's not every guest, but a fairly limited subset of big-memory guests > >>>> that > >>>> are not the norm. > >>>> > >>>> Albeit the phys-bits property errors might gives us a bit more trouble, > >>>> so > >>>> it might help being more conservative. > >>>> > >>>>> - You mention the phys bits issue very briefly, and it's pretty > >>>>> concerning. Do we maybe want to also disable the work around if phys > >>>>> bits is too small? > >>>> > >>>> We are doing that here (well, v4), as suggested by Igor. Note that in > >>>> this series > >>>> it's a warning, but v4 I have it as an error and with the 32-bit issues > >>>> that > >>>> I found through qtest. > >>>> > >>>> I share the same concern as you over making this an error because of > >>>> compatibility. > >>>> Perhaps, we could go back to the previous version and turn back into a > >>>> warning and > >>>> additionally even disabling the relocation all together. Or if desired > >>>> even > >>>> depending on a machine-property to enable it. > >>> > >>> I would be inclined to disable the relocation. And maybe block vfio? I'm > >>> not 100% sure about that last one, but this can be a vfio decision to > >>> make. > >>> > >> I don't see this as a VFIO problem (VFIO is actually being a good citizen > >> and doing the > >> right thing :)). From my perspective this fix is also useful to vDPA > >> (which we also care), > >> and thus will also let us avoid DMA mapping in these GPA ranges. > >> > >> The reason I mention VFIO in cover letter is that it's one visible UAPI > >> change that > >> users will notice that suddenly their 1TB+ VMs break. > > > > What I mean is that most VMs don't use either VFIO or VDPA. > > If we had VFIO,VDPA as an accelerator that needs to be listed > > upfront when qemu is started, we could solve this with > > a bit less risk by not changing anything for VMs without these two. > > > That wouldn't work for the vfio/vdpa hotplug case (which we do use), > and part of the reason I picked to always avoid the 1TB hole. VFIO even tells > you what are those allowed IOVA ranges when you create the container. > > The machine property, though, could communicate /the intent/ to add > any VFIO or vDPA devices in the future and maybe cover for that. That > let's one avoid any 'accelerator-only' problems and restrict the issues > of this series to VMs with VFIO/VDPA if the risk is a concern. Well Alex nacked that idea (and I certainly trust him where VFIO is concerned), I guess for now we'll just live with the risk. > > Alex what do you think about adding this? > > > > Given we do not have such a thing right now, one can get > > into a situation where phys bits limit CPU, then > > more memory is supplied than would fit above reserved region, then > > we stick the memory over the reserved region, and finally > > then vfio device is added. > > > The current code considers the maximum possible address considering > memory hotplug, PCI hole64 and etc. So we would need to worry about > whether VFIO or VDPA is going to be hotplugged at some point in the > future during guest lifecycle, do decide to alter the memory layout > at guest provisioning. Right. And given we currently have no way of knowing, we have to assume yes. At least we can check whether qemu was configured with VFIO or VDPA. > > In this last configuration, should vfio device add fail? > > Or just warn and try to continue? I think we can code this last > > decision as part of vfio code and then Alex will get to decide ;) > > And yes, a similar thing with vdpa. > > > > Of all those cases I would feel the machine-property is better, > and more flexible than having VFIO/VDPA deal with a bad memory-layout and > discovering late stage that the user is doing something wrong (and thus > fail the DMA_MAP operation for those who do check invalid iovas) -- MST