>-----Original Message-----
>From: Liu, Yi L <[email protected]>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/15 15:48, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <[email protected]>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Yi L <[email protected]>
>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>> migration
>>>>>
>>>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Yi L <[email protected]>
>>>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap
>during
>>>>>>> migration
>>>>>>>
>>>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>>> domain,
>>>>>>>> vtd_address_space_unmap() is called to unmap whole address
>space.
>>>>>>>>
>>>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>>>> backend as below:
>>>>>>>>
>>>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>>>
>>>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>>> way?
>>>>>>
>>>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>>> with
>>>>> migration.
>>>>>>
>>>>>> When device switches from IOMMU domain to block domain, the
>whole
>>>>> iommu
>>>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>>>> memory
>>>>>> region,
>>>>>
>>>>> I got this part.
>>>>>
>>>>>> no matter how many or how large the mappings are in the iommu MR.
>>>>>
>>>>> hmmm. A more explicit question: does this error happen with 4G VM
>>> memory
>>>>> as well?
>>>>
>>>> Coincidently, I remember QAT team reported this issue just with 4G VM
>>> memory.
>>>
>>> ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>>> It can be a large range. But it's still not clear to me how can guest
>>> map a range more than 4G if VM only has 4G memory.
>>
>> It happens when guest switch from DMA domain to block domain, below
>sequence is triggered:
>>
>> vtd_context_device_invalidate
>> vtd_address_space_sync
>> vtd_address_space_unmap
>>
>> You can see the whole iommu address space is unmapped, it's unrelated to
>actual mapping in guest.
>
>got it.
>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>>> maps
>>>>> to
>>>>>>> 8TB on
>>>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>>>> unmap_bitmap
>>>>>>>> ioctl fails.
>>>>>>>>
>>>>>>>> There is no such limitation with iommufd backend, but it's still not
>>> optimal
>>>>>>>> to allocate large bitmap.
>>>>>>>>
>>>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>>>> active
>>>>>>>> mapping when migration is active. If migration is not active,
>>> unmapping
>>>>> the
>>>>>>>> whole address space in one go is optimal.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhenzhong Duan <[email protected]>
>>>>>>>> Tested-by: Giovannio Cabiddu <[email protected]>
>>>>>>>> ---
>>>>>>>> hw/i386/intel_iommu.c | 42
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> 1 file changed, 42 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>>> index 83c5e44413..6876dae727 100644
>>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>>> @@ -37,6 +37,7 @@
>>>>>>>> #include "system/system.h"
>>>>>>>> #include "hw/i386/apic_internal.h"
>>>>>>>> #include "kvm/kvm_i386.h"
>>>>>>>> +#include "migration/misc.h"
>>>>>>>> #include "migration/vmstate.h"
>>>>>>>> #include "trace.h"
>>>>>>>>
>>>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>>> vtd_iommu_unlock(s);
>>>>>>>> }
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Unmapping a large range in one go is not optimal during
>migration
>>>>>>> because
>>>>>>>> + * a large dirty bitmap needs to be allocated while there may be
>only
>>>>> small
>>>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>>> active
>>>>>>> mapping.
>>>>>>>> + */
>>>>>>>> +static void
>>> vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>>>> *as,
>>>>>>>> +
>>>>>>> IOMMUNotifier *n)
>>>>>>>> +{
>>>>>>>> + const DMAMap *map;
>>>>>>>> + const DMAMap target = {
>>>>>>>> + .iova = n->start,
>>>>>>>> + .size = n->end,
>>>>>>>> + };
>>>>>>>> + IOVATree *tree = as->iova_tree;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * DMAMap is created during IOMMU page table sync, it's
>either
>>>>> 4KB
>>>>>>> or huge
>>>>>>>> + * page size and always a power of 2 in size. So the range of
>>>>>>> DMAMap could
>>>>>>>> + * be used for UNMAP notification directly.
>>>>>>>> + */
>>>>>>>> + while ((map = iova_tree_find(tree, &target))) {
>>>>>>>
>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>> the
>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>looks
>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>
>>>>>> Not get what failure will happen, could you elaborate?
>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>iommu
>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>> space,
>>>>>> if any memory region is disabled, vfio_listener will catch it and do
>>>>>> dirty
>>>>> tracking.
>>>>>
>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>> a big range is given to kernel while kernel does not support. So if
>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>> possible when guest (a VM with large size memory) switches from
>identify
>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>
>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>8TB
>>> size.
>>>> If guest memory is large enough and lead to a memory region of more
>than
>>> 8TB size,
>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>> 8TB memory,
>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>below
>>> MACRO
>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>doesn't
>>> have such limit.
>>>
>>> This limit shall not affect the usage of device dirty tracking. right?
>>> If yes, add something to tell user use iommufd backend is better. e.g
>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>> user wants migration capability.
>>
>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>> In that case, there is error print as above, I think that's enough as a hint?
>
>it's not related to diryrate.
>
>> I guess you mean to add a migration blocker if limit is reached? It's hard
>> because the limit is only helpful for identity domain, DMA domain in guest
>> doesn't have such limit, and we can't know guest's choice of domain type
>> of each VFIO device attached.
>
>I meant a blocker to boot QEMU if there is limit. something like below:
>
> if (VM memory > 8TB && legacy_container_backend &&
>migration_enabled)
> fail the VM boot.
OK, will add below to vfio_migration_realize() with an extra patch:
if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
/*
* The 8TB comes from default kernel and QEMU config, it may be
* conservative here as VM can use large page or run with vIOMMU
* so the limitation may be relaxed. But 8TB is already quite
* large for live migration. One can also switch to use IOMMUFD
* backend if there is a need to migrate large VM.
*/
error_setg(&err, "%s: Migration is currently not supported "
"with large memory VM with approximately 8TB memory "
"due to limitation in VFIO type1 driver", vbasedev->name);
goto add_blocker;
}
Thanks
Zhenzhong