Hi Michael, >-----Original Message----- >From: Michael S. Tsirkin <m...@redhat.com> >Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check and >sync host IOMMU cap/ecap > >On Mon, Mar 18, 2024 at 02:20:50PM +0100, Eric Auger wrote: >> Hi Michael, >> >> On 3/13/24 12:17, Michael S. Tsirkin wrote: >> > On Wed, Mar 13, 2024 at 07:54:11AM +0000, Duan, Zhenzhong wrote: >> >> >> >>> -----Original Message----- >> >>> From: Michael S. Tsirkin <m...@redhat.com> >> >>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to check >and >> >>> sync host IOMMU cap/ecap >> >>> >> >>> On Wed, Mar 13, 2024 at 02:52:39AM +0000, Duan, Zhenzhong wrote: >> >>>> Hi Michael, >> >>>> >> >>>>> -----Original Message----- >> >>>>> From: Michael S. Tsirkin <m...@redhat.com> >> >>>>> Subject: Re: [PATCH v1 3/6] intel_iommu: Add a framework to >check and >> >>>>> sync host IOMMU cap/ecap >> >>>>> >> >>>>> On Wed, Feb 28, 2024 at 05:44:29PM +0800, Zhenzhong Duan >wrote: >> >>>>>> From: Yi Liu <yi.l....@intel.com> >> >>>>>> >> >>>>>> Add a framework to check and synchronize host IOMMU cap/ecap >with >> >>>>>> vIOMMU cap/ecap. >> >>>>>> >> >>>>>> The sequence will be: >> >>>>>> >> >>>>>> vtd_cap_init() initializes iommu->cap/ecap. >> >>>>>> vtd_check_hdev() update iommu->cap/ecap based on host >cap/ecap. >> >>>>>> iommu->cap_frozen set when machine create done, iommu- >>cap/ecap >> >>>>> become readonly. >> >>>>>> Implementation details for different backends will be in following >> >>> patches. >> >>>>>> Signed-off-by: Yi Liu <yi.l....@intel.com> >> >>>>>> Signed-off-by: Yi Sun <yi.y....@linux.intel.com> >> >>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.d...@intel.com> >> >>>>>> --- >> >>>>>> include/hw/i386/intel_iommu.h | 1 + >> >>>>>> hw/i386/intel_iommu.c | 50 >> >>>>> ++++++++++++++++++++++++++++++++++- >> >>>>>> 2 files changed, 50 insertions(+), 1 deletion(-) >> >>>>>> >> >>>>>> diff --git a/include/hw/i386/intel_iommu.h >> >>>>> b/include/hw/i386/intel_iommu.h >> >>>>>> index bbc7b96add..c71a133820 100644 >> >>>>>> --- a/include/hw/i386/intel_iommu.h >> >>>>>> +++ b/include/hw/i386/intel_iommu.h >> >>>>>> @@ -283,6 +283,7 @@ struct IntelIOMMUState { >> >>>>>> >> >>>>>> uint64_t cap; /* The value of capability reg */ >> >>>>>> uint64_t ecap; /* The value of extended >> >>>>>> capability reg >*/ >> >>>>>> + bool cap_frozen; /* cap/ecap become read-only >> >>>>>> after >> >>> frozen */ >> >>>>>> uint32_t context_cache_gen; /* Should be in [1,MAX] */ >> >>>>>> GHashTable *iotlb; /* IOTLB */ >> >>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c >> >>>>>> index ffa1ad6429..a9f9dfd6a7 100644 >> >>>>>> --- a/hw/i386/intel_iommu.c >> >>>>>> +++ b/hw/i386/intel_iommu.c >> >>>>>> @@ -35,6 +35,8 @@ >> >>>>>> #include "sysemu/kvm.h" >> >>>>>> #include "sysemu/dma.h" >> >>>>>> #include "sysemu/sysemu.h" >> >>>>>> +#include "hw/vfio/vfio-common.h" >> >>>>>> +#include "sysemu/iommufd.h" >> >>>>>> #include "hw/i386/apic_internal.h" >> >>>>>> #include "kvm/kvm_i386.h" >> >>>>>> #include "migration/vmstate.h" >> >>>>>> @@ -3819,6 +3821,38 @@ VTDAddressSpace >> >>>>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, >> >>>>>> return vtd_dev_as; >> >>>>>> } >> >>>>>> >> >>>>>> +static int vtd_check_legacy_hdev(IntelIOMMUState *s, >> >>>>>> + IOMMULegacyDevice *ldev, >> >>>>>> + Error **errp) >> >>>>>> +{ >> >>>>>> + return 0; >> >>>>>> +} >> >>>>>> + >> >>>>>> +static int vtd_check_iommufd_hdev(IntelIOMMUState *s, >> >>>>>> + IOMMUFDDevice *idev, >> >>>>>> + Error **errp) >> >>>>>> +{ >> >>>>>> + return 0; >> >>>>>> +} >> >>>>>> + >> >>>>>> +static int vtd_check_hdev(IntelIOMMUState *s, >> >>> VTDHostIOMMUDevice >> >>>>> *vtd_hdev, >> >>>>>> + Error **errp) >> >>>>>> +{ >> >>>>>> + HostIOMMUDevice *base_dev = vtd_hdev->dev; >> >>>>>> + IOMMUFDDevice *idev; >> >>>>>> + >> >>>>>> + if (base_dev->type == HID_LEGACY) { >> >>>>>> + IOMMULegacyDevice *ldev = container_of(base_dev, >> >>>>>> + IOMMULegacyDevice, >> >>>>>> base); >> >>>>>> + >> >>>>>> + return vtd_check_legacy_hdev(s, ldev, errp); >> >>>>>> + } >> >>>>>> + >> >>>>>> + idev = container_of(base_dev, IOMMUFDDevice, base); >> >>>>>> + >> >>>>>> + return vtd_check_iommufd_hdev(s, idev, errp); >> >>>>>> +} >> >>>>>> + >> >>>>>> static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, >int >> >>>>> devfn, >> >>>>>> HostIOMMUDevice *base_dev, Error >> >>>>>> **errp) >> >>>>>> { >> >>>>>> @@ -3829,6 +3863,7 @@ static int >> >>> vtd_dev_set_iommu_device(PCIBus >> >>>>> *bus, void *opaque, int devfn, >> >>>>>> .devfn = devfn, >> >>>>>> }; >> >>>>>> struct vtd_as_key *new_key; >> >>>>>> + int ret; >> >>>>>> >> >>>>>> assert(base_dev); >> >>>>>> >> >>>>>> @@ -3848,6 +3883,13 @@ static int >> >>> vtd_dev_set_iommu_device(PCIBus >> >>>>> *bus, void *opaque, int devfn, >> >>>>>> vtd_hdev->iommu_state = s; >> >>>>>> vtd_hdev->dev = base_dev; >> >>>>>> >> >>>>>> + ret = vtd_check_hdev(s, vtd_hdev, errp); >> >>>>>> + if (ret) { >> >>>>>> + g_free(vtd_hdev); >> >>>>>> + vtd_iommu_unlock(s); >> >>>>>> + return ret; >> >>>>>> + } >> >>>>>> + >> >>>>>> new_key = g_malloc(sizeof(*new_key)); >> >>>>>> new_key->bus = bus; >> >>>>>> new_key->devfn = devfn; >> >>>>> >> >>>>> Okay. So when VFIO device is created, it will call >> >>> vtd_dev_set_iommu_device >> >>>>> and that in turn will update caps. >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>>> @@ -4083,7 +4125,9 @@ static void vtd_init(IntelIOMMUState >*s) >> >>>>>> s->iq_dw = false; >> >>>>>> s->next_frcd_reg = 0; >> >>>>>> >> >>>>>> - vtd_cap_init(s); >> >>>>>> + if (!s->cap_frozen) { >> >>>>>> + vtd_cap_init(s); >> >>>>>> + } >> >>>>>> >> >>>>> If it's fronzen it's because VFIO was added after machine done. >> >>>>> And then what? I think caps are just wrong? >> >>>> Not quite get your question on caps being wrong. But try to explains: >> >>>> >> >>>> When a hot plugged vfio device's host iommu cap isn't compatible >with >> >>>> vIOMMU's, hotplug should fail. Currently there is no check for this >and >> >>>> allow hotplug to succeed, but then some issue will reveal later, >> >>>> e.g., vIOMMU's MGAW > host IOMMU's MGAW, guest can setup >iova >> >>>> mapping beyond host supported iova range, then DMA will fail. >> >>>> >> >>>> In fact, before this series, cap is not impacted by VFIO, so it's same >effect of >> >>>> frozen after machine done. >> >>>> >> >>>>> >> >>>>> I think the way to approach this is just by specifying this >> >>>>> as an option on command line. >> >>>> Do you mean add a cap_frozen property to intel_iommu? >> >>>> Vtd_init() is called in realize() and system reset(), so I utilize >> >>>> realize() >to init >> >>> cap >> >>>> and froze cap before system reset(). If cap_frozen is an option, when >it's >> >>> set to >> >>>> false, cap could be updated every system reset and it's not a fix value >any >> >>> more. >> >>>> This may break migration. >> >>> No, I mean either >> >>> 1. add some kind of vfio-iommu device that is not exposed to guest >> >>> but is not hot pluggable >> >> Not quite get, what will such vfio-iommu device be used for if not >exposed to guest. >> > It will update the IOMMU. >> > And do so without need for tricky callbacks. >> > >> >> At the moment the only way VFIO can pass info to vIOMMU is through the >> IOMMU MR API (IOMMUMemoryRegionClass). >> Unfortunately this API is not fully adapted to new use cases because it >> relies on the IOMMU MR to be enabled which causes the VFIO >> >> vfio_listener_region_add() to be called and call the relevant IOMMU MR >callback. Before the IOMMU MR enablement there is no way to convey info >from VFIO to the vIOMMU. That why, for several years and since the >beginning of discussions related to nested IOMMU, Peter Xu, Yi Liu, myself >headed towards the usage of PCIIOMMUOps instead. >> >> >> You will find in >> [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for >hotplugged devices >> https://lore.kernel.org/all/20240117080414.316890-1- >eric.au...@redhat.com/ >> yet another example of such kind of PCIIOMMUOps. In that series instead >of relying on a HostIOMMUDevice object as proposed by Zhenzhong it uses a >direct callback but it is still the same principle and the HostIOMMUDevice >looks better to accomodate both iommufd and VFIO backend. >> >> >> host reserved memory regions is not something we can easiliy describe at >vIOMMU level. The info are fetched from host and propagated to the >vIOMMU >> >> I think nested stage setup would also need this PCIIOMMUOps trick to >setup stage 1 info. So the introduction of the so called HostIOMMUDevice >object has a broader scope than a gaw option or ecap option set I think. >> >> If you don't like this approach, please can you elaborate on the "vfio- >iommu device" proposal above so that we can explore it. >> >> Thank you in advance >> >> Eric > > >Hi Eric, >Sorry about the delay in answering - was on vacation. > >My concern is with user interface not with the internal API used. >The concern is simple - assymetry and complex rules in handling devices. >E.g. a non hotplugged vfio device adjusts the vIOMMU, then you can >remove it and hotplug another one and it works, but if you start >without vfio then hotplug then it might fail because it's too late >to adjust the vIOMMU.
Just clarify the adjustment from vfio is only zeroing the cap/ecap's 1 bits, never set an original 0 bits. So if hotplug fails, then coldplug should also fail for the same device. > >And what I am saying, is that we either want something on the qemu >command line >that will tell the vIOMMU "please use info from the host" or >have management take info from the host and supply that to the vIOMMU. >The former seems more user friendly, for sure. The disadvantage of it >is that one can't e.g. take the least common denominator between two >hosts and make things migrateable in this way. >If we limit our ambition to VFIO that might be acceptable, after all >VFIO already assumes very similar hardware on src and dst of migration. >The later is what we did for aw-bits. We can get host IOMMU's hw cap/ecap by reading below file: /sys/devices/virtual/iommu/dmar[*]/intel-iommu/[cap|ecap] But it's not accurate as kernel command line can limit the support from software aspect, e.g., intel_iommu=off/sm_off. So I'd like to follow up the 2nd way, add two options which defaults 0, + DEFINE_PROP_UINT64("host_cap", IntelIOMMUState, host_cap, 0), + DEFINE_PROP_UINT64("host_ecap", IntelIOMMUState, host_ecap, 0), When any is set by management, it overrides vIOMMU default value, Or else use default. With cap/ecap set by management, we still need to have the cap/ecap check logic between host and vIOMMU, but no update, no frozen. The suggestion for management is to set a value based on /sys/devices/virtual/iommu/dmar[*]/intel-iommu/[cap|ecap] But do not assign a random value. Let me know if I misunderstand anything😊 > >I see lots of acceptable options, but yes it seems nice to have >something single that will work for all IOMMUs and also maybe live under >VFIO? Thus I came up with the idea of a stub device living under vfio >who's job is just to be present on command line and adjust the guest >machine (ATM the viommu but we might see other uses too) to match host. IIUC, kernel doesn’t provide any ioctl() for a stub device to get the least common denominator of cap/ecap of all the host IOMMU units. So I'd like to try your suggestion as above to add options. Thanks Zhenzhong