from:"Duan, Zhenzhong"

RE: [PATCH v4 17/17] tests/qtest: Add intel-iommu test

2024-09-30 Thread Duan, Zhenzhong

Sorry, forgot to update to new parameter "x-scalable-mode=on,x-fls=on", will 
resend this patch only.

Thanks
Zhenzhong

>-Original Message-
>From: Duan, Zhenzhong 
>Subject: [PATCH v4 17/17] tests/qtest: Add intel-iommu test
>
>Add the framework to test the intel-iommu device.
>
>Currently only tested cap/ecap bits correctness in scalable
>modern mode. Also tested cap/ecap bits consistency before
>and after system reset.
>
>Signed-off-by: Zhenzhong Duan 
>Acked-by: Thomas Huth 
>Reviewed-by: Clément Mathieu--Drif
>Acked-by: Jason Wang 
>---
> MAINTAINERS|  1 +
> include/hw/i386/intel_iommu.h  |  1 +
> tests/qtest/intel-iommu-test.c | 65
>++
> tests/qtest/meson.build|  1 +
> 4 files changed, 68 insertions(+)
> create mode 100644 tests/qtest/intel-iommu-test.c
>
>diff --git a/MAINTAINERS b/MAINTAINERS
>index 62f5255f40..331b7c7a13 100644
>--- a/MAINTAINERS
>+++ b/MAINTAINERS
>@@ -3679,6 +3679,7 @@ S: Supported
> F: hw/i386/intel_iommu.c
> F: hw/i386/intel_iommu_internal.h
> F: include/hw/i386/intel_iommu.h
>+F: tests/qtest/intel-iommu-test.c
>
> AMD-Vi Emulation
> S: Orphan
>diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>index 4d6acb2314..a1858898f1 100644
>--- a/include/hw/i386/intel_iommu.h
>+++ b/include/hw/i386/intel_iommu.h
>@@ -47,6 +47,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(IntelIOMMUState,
>INTEL_IOMMU_DEVICE)
> #define VTD_HOST_AW_48BIT   48
> #define VTD_HOST_AW_AUTO0xff
> #define VTD_HAW_MASK(aw)((1ULL << (aw)) - 1)
>+#define VTD_MGAW_FROM_CAP(cap)  ((cap >> 16) & 0x3fULL)
>
> #define DMAR_REPORT_F_INTR  (1)
>
>diff --git a/tests/qtest/intel-iommu-test.c b/tests/qtest/intel-iommu-test.c
>new file mode 100644
>index 00..6131e20117
>--- /dev/null
>+++ b/tests/qtest/intel-iommu-test.c
>@@ -0,0 +1,65 @@
>+/*
>+ * QTest testcase for intel-iommu
>+ *
>+ * Copyright (c) 2024 Intel, Inc.
>+ *
>+ * Author: Zhenzhong Duan 
>+ *
>+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
>+ * See the COPYING file in the top-level directory.
>+ */
>+
>+#include "qemu/osdep.h"
>+#include "libqtest.h"
>+#include "hw/i386/intel_iommu_internal.h"
>+
>+#define CAP_MODERN_FIXED1(VTD_CAP_FRO | VTD_CAP_NFR |
>VTD_CAP_ND | \
>+  VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS)
>+#define ECAP_MODERN_FIXED1   (VTD_ECAP_QI |  VTD_ECAP_IR |
>VTD_ECAP_IRO | \
>+  VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FLTS)
>+
>+static inline uint64_t vtd_reg_readq(QTestState *s, uint64_t offset)
>+{
>+return qtest_readq(s, Q35_HOST_BRIDGE_IOMMU_ADDR + offset);
>+}
>+
>+static void test_intel_iommu_modern(void)
>+{
>+uint8_t init_csr[DMAR_REG_SIZE]; /* register values */
>+uint8_t post_reset_csr[DMAR_REG_SIZE]; /* register values */
>+uint64_t cap, ecap, tmp;
>+QTestState *s;
>+
>+s = qtest_init("-M q35 -device intel-iommu,x-scalable-mode=modern");
>+
>+cap = vtd_reg_readq(s, DMAR_CAP_REG);
>+g_assert((cap & CAP_MODERN_FIXED1) == CAP_MODERN_FIXED1);
>+
>+tmp = cap & VTD_CAP_SAGAW_MASK;
>+g_assert(tmp == (VTD_CAP_SAGAW_39bit | VTD_CAP_SAGAW_48bit));
>+
>+tmp = VTD_MGAW_FROM_CAP(cap);
>+g_assert(tmp == VTD_HOST_AW_48BIT - 1);
>+
>+ecap = vtd_reg_readq(s, DMAR_ECAP_REG);
>+g_assert((ecap & ECAP_MODERN_FIXED1) == ECAP_MODERN_FIXED1);
>+
>+qtest_memread(s, Q35_HOST_BRIDGE_IOMMU_ADDR, init_csr,
>DMAR_REG_SIZE);
>+
>+qobject_unref(qtest_qmp(s, "{ 'execute': 'system_reset' }"));
>+qtest_qmp_eventwait(s, "RESET");
>+
>+qtest_memread(s, Q35_HOST_BRIDGE_IOMMU_ADDR, post_reset_csr,
>DMAR_REG_SIZE);
>+/* Ensure registers are consistent after hard reset */
>+g_assert(!memcmp(init_csr, post_reset_csr, DMAR_REG_SIZE));
>+
>+qtest_quit(s);
>+}
>+
>+int main(int argc, char **argv)
>+{
>+g_test_init(&argc, &argv, NULL);
>+qtest_add_func("/q35/intel-iommu/modern",
>test_intel_iommu_modern);
>+
>+return g_test_run();
>+}
>diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
>index 310865e49c..8a928caf70 100644
>--- a/tests/qtest/meson.build
>+++ b/tests/qtest/meson.build
>@@ -90,6 +90,7 @@ qtests_i386 = \
>   (config_all_devices.has_key('CONFIG_SB16') ? ['fuzz-sb16-test'] : []) +
>\
>   (config_all_devices.has_key('CONFIG_SDHCI_PCI') ? ['fuzz-sdcard-test'] : [])
>+\
>   (config_all_devices.has_key('CONFIG_ESP_PCI') ? ['am53c974-test'] : []) +
>\
>+  (config_all_devices.has_key('CONFIG_VTD') ? ['intel-iommu-test'] : []) +
>\
>   (host_os != 'windows' and   
>  \
>config_all_devices.has_key('CONFIG_ACPI_ERST') ? ['erst-test'] : []) +
>\
>   (config_all_devices.has_key('CONFIG_PCIE_PORT') and
>\
>--
>2.34.1

RE: [PATCH v3 06/17] intel_iommu: Implement stage-1 translation

2024-09-29 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v3 06/17] intel_iommu: Implement stage-1 translation
>
>On 2024/9/11 13:22, Zhenzhong Duan wrote:
>> From: Yi Liu 
>>
>> This adds stage-1 page table walking to support stage-1 only
>> transltion in scalable modern mode.
>
>a typo. s/tansltion/translation/

Will fix.

>>
>> Signed-off-by: Yi Liu 
>> Co-developed-by: Clément Mathieu--Drif d...@eviden.com>
>> Signed-off-by: Clément Mathieu--Drif 
>> Signed-off-by: Yi Sun 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   hw/i386/intel_iommu_internal.h |  23 ++
>>   hw/i386/intel_iommu.c  | 146
>-
>>   2 files changed, 165 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 1fa4add9e2..51e9b1fc43 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -433,6 +433,21 @@ typedef union VTDInvDesc VTDInvDesc;
>>   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
>VTD_SL_TM)) : \
>>   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>>
>> +/* Rsvd field masks for fpte */
>> +#define VTD_FS_UPPER_IGNORED 0xfff0ULL
>> +#define VTD_FPTE_PAGE_L1_RSVD_MASK(aw) \
>> +(~(VTD_HAW_MASK(aw) | VTD_FS_UPPER_IGNORED))
>> +#define VTD_FPTE_PAGE_L2_RSVD_MASK(aw) \
>> +(~(VTD_HAW_MASK(aw) | VTD_FS_UPPER_IGNORED))
>> +#define VTD_FPTE_PAGE_L3_RSVD_MASK(aw) \
>> +(~(VTD_HAW_MASK(aw) | VTD_FS_UPPER_IGNORED))
>> +#define VTD_FPTE_PAGE_L3_FS1GP_RSVD_MASK(aw) \
>> +(0x3fffe000ULL | ~(VTD_HAW_MASK(aw) |
>VTD_FS_UPPER_IGNORED))
>> +#define VTD_FPTE_PAGE_L2_FS2MP_RSVD_MASK(aw) \
>> +(0x1fe000ULL | ~(VTD_HAW_MASK(aw) | VTD_FS_UPPER_IGNORED))
>
>May we follow the same naming for the large page? e.g. LPAGE_L2,
>LPAGE_L3.
>Also follow the order of the SL definitions as well.

Sure, will do.

>
>> +#define VTD_FPTE_PAGE_L4_RSVD_MASK(aw) \
>> +(0x80ULL | ~(VTD_HAW_MASK(aw) | VTD_FS_UPPER_IGNORED))
>> +
>>   /* Masks for PIOTLB Invalidate Descriptor */
>>   #define VTD_INV_DESC_PIOTLB_G (3ULL << 4)
>>   #define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
>> @@ -525,6 +540,14 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   #define VTD_SM_PASID_ENTRY_AW  7ULL /* Adjusted guest-
>address-width */
>>   #define VTD_SM_PASID_ENTRY_DID(val)((val) &
>VTD_DOMAIN_ID_MASK)
>>
>> +#define VTD_SM_PASID_ENTRY_FLPM  3ULL
>> +#define VTD_SM_PASID_ENTRY_FLPTPTR   (~0xfffULL)
>> +
>> +/* First Level Paging Structure */
>> +/* Masks for First Level Paging Entry */
>> +#define VTD_FL_P1ULL
>> +#define VTD_FL_RW_MASK  (1ULL << 1)
>> +
>>   /* Second Level Page Translation Pointer*/
>>   #define VTD_SM_PASID_ENTRY_SLPTPTR (~0xfffULL)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index a22bd43b98..6e31a8d383 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -48,6 +48,8 @@
>>
>>   /* pe operations */
>>   #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] &
>VTD_SM_PASID_ENTRY_PGTT)
>> +#define VTD_PE_GET_FL_LEVEL(pe) \
>> +(4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
>>   #define VTD_PE_GET_SL_LEVEL(pe) \
>>   (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
>>
>> @@ -755,6 +757,11 @@ static inline bool
>vtd_is_sl_level_supported(IntelIOMMUState *s, uint32_t level)
>>  (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
>>   }
>>
>> +static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s,
>uint32_t level)
>> +{
>> +return level == VTD_PML4_LEVEL;
>> +}
>> +
>>   /* Return true if check passed, otherwise false */
>>   static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>>VTDPASIDEntry *pe)
>> @@ -838,6 +845,11 @@ static int
>vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>>   return -VTD_FR_PASID_TABLE_ENTRY_INV;
>>   }
>>
>> +if (pgtt == VTD_SM_PASID_ENTRY_FLT &&
>> +!vtd_is_fl_level_supported(s, VTD_PE_GET_FL_LEVEL(pe))) {
>> +return -VTD_FR_PASID_TABLE_ENTRY_INV;
>> +}
>> +
>>   return 0;
>>   }
>>
>> @@ -973,7 +985,11 @@ static uint32_t
>vtd_get_iova_level(IntelIOMMUState *s,
>>
>>   if (s->root_scalable) {
>>   vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>> -return VTD_PE_GET_SL_LEVEL(&pe);
>> +if (s->scalable_modern) {
>> +return VTD_PE_GET_FL_LEVEL(&pe);
>> +} else {
>> +return VTD_PE_GET_SL_LEVEL(&pe);
>> +}
>>   }
>>
>>   return vtd_ce_get_level(ce);
>> @@ -1060,7 +1076,11 @@ static dma_addr_t
>vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>>
>>   if (s->root_scalable) {
>>   vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
>> -return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
>> +if (s->scalable_modern) {
>> +return pe.val[2] & VTD_SM_PASID_ENTRY_FLP

RE: [PATCH v3 05/17] intel_iommu: Rename slpte to pte

2024-09-29 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v3 05/17] intel_iommu: Rename slpte to pte
>
>On 2024/9/11 13:22, Zhenzhong Duan wrote:
>> From: Yi Liu 
...
>> @@ -1918,13 +1919,13 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>
>>   cc_entry = &vtd_as->context_cache_entry;
>>
>> -/* Try to fetch slpte form IOTLB, we don't need RID2PASID logic */
>> +/* Try to fetch pte form IOTLB, we don't need RID2PASID logic */
>
>s/form/from/

Will fix.

>
>>   if (!rid2pasid) {
>>   iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>>   if (iotlb_entry) {
>> -trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
>> +trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
>>iotlb_entry->domain_id);
>> -slpte = iotlb_entry->slpte;
>> +pte = iotlb_entry->pte;
>>   access_flags = iotlb_entry->access_flags;
>>   page_mask = iotlb_entry->mask;
>>   goto out;
>> @@ -1996,20 +1997,20 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>   return true;
>>   }
>>
>> -/* Try to fetch slpte form IOTLB for RID2PASID slow path */
>> +/* Try to fetch pte form IOTLB for RID2PASID slow path */
>
>s/form/from/. otherwise, looks good to me.

Will fix.

Thanks
Zhenzhong

>
>Reviewed-by: Yi Liu 
>
>>   if (rid2pasid) {
>>   iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>>   if (iotlb_entry) {
>> -trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
>> +trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
>>iotlb_entry->domain_id);
>> -slpte = iotlb_entry->slpte;
>> +pte = iotlb_entry->pte;
>>   access_flags = iotlb_entry->access_flags;
>>   page_mask = iotlb_entry->mask;
>>   goto out;
>>   }
>>   }
>>
>> -ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
>> +ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &pte, &level,
>>  &reads, &writes, s->aw_bits, pasid);
>>   if (ret_fr) {
>>   vtd_report_fault(s, -ret_fr, is_fpd_set, source_id,
>> @@ -2017,14 +2018,14 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>   goto error;
>>   }
>>
>> -page_mask = vtd_slpt_level_page_mask(level);
>> +page_mask = vtd_pt_level_page_mask(level);
>>   access_flags = IOMMU_ACCESS_FLAG(reads, writes);
>>   vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce, pasid),
>> - addr, slpte, access_flags, level, pasid);
>> + addr, pte, access_flags, level, pasid);
>>   out:
>>   vtd_iommu_unlock(s);
>>   entry->iova = addr & page_mask;
>> -entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) &
>page_mask;
>> +entry->translated_addr = vtd_get_pte_addr(pte, s->aw_bits) &
>page_mask;
>>   entry->addr_mask = ~page_mask;
>>   entry->perm = access_flags;
>>   return true;
>
>--
>Regards,
>Yi Liu

RE: [PATCH v3 14/17] intel_iommu: Set default aw_bits to 48 in scalable modern mode

2024-09-28 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 14/17] intel_iommu: Set default aw_bits to 48 in
>scalable modern mode
>
>On Fri, Sep 27, 2024 at 2:39 PM Duan, Zhenzhong
> wrote:
>>
>>
>>
>> >-Original Message-
>> >From: Jason Wang 
>> >Subject: Re: [PATCH v3 14/17] intel_iommu: Set default aw_bits to 48 in
>> >scalable modern mode
>> >
>> >On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
>> > wrote:
>> >>
>> >> According to VTD spec, stage-1 page table could support 4-level and
>> >> 5-level paging.
>> >>
>> >> However, 5-level paging translation emulation is unsupported yet.
>> >> That means the only supported value for aw_bits is 48.
>> >>
>> >> So default aw_bits to 48 in scalable modern mode. In other cases,
>> >> it is still default to 39 for compatibility.
>> >>
>> >> Add a check to ensure user specified value is 48 in modern mode
>> >> for now.
>> >>
>> >> Signed-off-by: Zhenzhong Duan 
>> >> Reviewed-by: Clément Mathieu--Drifd...@eviden.com>
>> >> ---
>> >>  include/hw/i386/intel_iommu.h |  2 +-
>> >>  hw/i386/intel_iommu.c | 10 +-
>> >>  2 files changed, 10 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/include/hw/i386/intel_iommu.h
>> >b/include/hw/i386/intel_iommu.h
>> >> index b843d069cc..48134bda11 100644
>> >> --- a/include/hw/i386/intel_iommu.h
>> >> +++ b/include/hw/i386/intel_iommu.h
>> >> @@ -45,7 +45,7 @@
>OBJECT_DECLARE_SIMPLE_TYPE(IntelIOMMUState,
>> >INTEL_IOMMU_DEVICE)
>> >>  #define DMAR_REG_SIZE   0x230
>> >>  #define VTD_HOST_AW_39BIT   39
>> >>  #define VTD_HOST_AW_48BIT   48
>> >> -#define VTD_HOST_ADDRESS_WIDTH  VTD_HOST_AW_39BIT
>> >> +#define VTD_HOST_AW_AUTO0xff
>> >>  #define VTD_HAW_MASK(aw)((1ULL << (aw)) - 1)
>> >>
>> >>  #define DMAR_REPORT_F_INTR  (1)
>> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> >> index c25211ddaf..949f120456 100644
>> >> --- a/hw/i386/intel_iommu.c
>> >> +++ b/hw/i386/intel_iommu.c
>> >> @@ -3771,7 +3771,7 @@ static Property vtd_properties[] = {
>> >>  ON_OFF_AUTO_AUTO),
>> >>  DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim,
>> >false),
>> >>  DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>> >> -  VTD_HOST_ADDRESS_WIDTH),
>> >> +  VTD_HOST_AW_AUTO),
>> >
>> >Such command line API seems to be wired.
>> >
>> >I think we can stick the current default and when scalable modern is
>> >enabled by aw is not specified, we can change aw to 48?
>>
>> Current default is 39. I use VTD_HOST_AW_AUTO to initialize aw as not
>specified.
>
>If I read the code correctly, aw=0xff means "auto". This seems a
>little bit wried.
>
>And even if we change it to auto, we need deal with the migration
>compatibility that stick 39 for old machine types.

0xff isn't the final initial value, in vtd_decide_config(), there is code to 
check 0xff
to do final initialization:

if (s->aw_bits == VTD_HOST_AW_AUTO) {
if (s->scalable_modern) {
s->aw_bits = VTD_HOST_AW_48BIT;
} else {
s->aw_bits = VTD_HOST_AW_39BIT;
}
}

If old machine types force aw to 39, then above code is bypassed and 39 is 
sticked.

>
>> Do we have other way to catch the update if we stick to 39?
>
>I meant I don't understand if there will be any issue if we keep use
>39 as default. Or I may not get the point of this question.

If we default aw to 39, there is no way to decide if it's user forced value 
which we need to stick
or initial default value which we can change.

Thanks
Zhenzhong

RE: [PATCH v3 15/17] intel_iommu: Modify x-scalable-mode to be string option to expose scalable modern mode

2024-09-28 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 15/17] intel_iommu: Modify x-scalable-mode to be
>string option to expose scalable modern mode
>
>On Fri, Sep 27, 2024 at 2:39 PM Duan, Zhenzhong
> wrote:
>>
>>
>>
>> >-Original Message-
>> >From: Jason Wang 
>> >Subject: Re: [PATCH v3 15/17] intel_iommu: Modify x-scalable-mode to
>be
>> >string option to expose scalable modern mode
>> >
>> >On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
>> > wrote:
>> >>
>> >> From: Yi Liu 
>> >>
>> >> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
>capabilities
>> >> related to scalable mode translation, thus there are multiple
>combinations.
>> >> While this vIOMMU implementation wants to simplify it for user by
>> >providing
>> >> typical combinations. User could config it by "x-scalable-mode" option.
>The
>> >> usage is as below:
>> >>
>> >> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
>> >>
>> >>  - "legacy": gives support for stage-2 page table
>> >>  - "modern": gives support for stage-1 page table
>> >>  - "off": no scalable mode support
>> >>  - any other string, will throw error
>> >
>> >Those we had "x" prefix but I wonder if this is the best option for
>> >enabling scalable-modern mode since the "on" is illegal after this
>> >change.
>>
>> Yes, I was thinking "x" means not stable user interface yet.
>> But I do agree with you it's better to keep stable user interface whenever
>possible.
>>
>> >
>> >Maybe it's better to just have an "x-fls". Or if we considering the
>> >scalable mode is kind of complete, it's time to get rid of "x" prefix.
>>
>> Ah, I thought this is a question only maintainers and reviewers can decide
>if it's complete.
>> If no voice on that, I'd like to add "x-fls" as you suggested and keep x-
>scalable-mode unchanged.
>
>A question here:
>
>Are there any other major features that are still lacking for scalable
>mode? If not, maybe we can get rid of the "x" prefix?

We don't support stage-1 and stage-2 coexist emulation and nested translation 
emulation through stage-1 and stage-2 yet.

Currently we only support either stage-1 or stage-2 in scalable mode, one 
reason is supporting stage1 is enough for current usage,
the other reason is to simplify the nesting series 
https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2 for 
review.

Thanks
Zhenzhong

RE: [PATCH v3 12/17] intel_iommu: Add support for PASID-based device IOTLB invalidation

2024-09-28 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 12/17] intel_iommu: Add support for PASID-based
>device IOTLB invalidation
>
>On Fri, Sep 27, 2024 at 3:18 PM Duan, Zhenzhong
> wrote:
>>
>>
>>
>> >-Original Message-
>> >From: Jason Wang 
>> >Subject: Re: [PATCH v3 12/17] intel_iommu: Add support for PASID-
>based
>> >device IOTLB invalidation
>> >
>> >On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
>> > wrote:
>> >>
>> >> From: Clément Mathieu--Drif 
>> >>
>> >> Signed-off-by: Clément Mathieu--Drif d...@eviden.com>
>> >> Signed-off-by: Zhenzhong Duan 
>> >> ---
>> >>  hw/i386/intel_iommu_internal.h | 11 
>> >>  hw/i386/intel_iommu.c  | 50
>> >++
>> >>  2 files changed, 61 insertions(+)
>> >>
>> >> diff --git a/hw/i386/intel_iommu_internal.h
>> >b/hw/i386/intel_iommu_internal.h
>> >> index 4f2c3a9350..52bdbf3bc5 100644
>> >> --- a/hw/i386/intel_iommu_internal.h
>> >> +++ b/hw/i386/intel_iommu_internal.h
>> >> @@ -375,6 +375,7 @@ typedef union VTDInvDesc VTDInvDesc;
>> >>  #define VTD_INV_DESC_WAIT   0x5 /* Invalidation Wait
>Descriptor
>> >*/
>> >>  #define VTD_INV_DESC_PIOTLB 0x6 /* PASID-IOTLB Invalidate
>Desc
>> >*/
>> >>  #define VTD_INV_DESC_PC 0x7 /* PASID-cache Invalidate
>Desc */
>> >> +#define VTD_INV_DESC_DEV_PIOTLB 0x8 /* PASID-based-DIOTLB
>> >inv_desc*/
>> >>  #define VTD_INV_DESC_NONE   0   /* Not an Invalidate
>Descriptor
>> >*/
>> >>
>> >>  /* Masks for Invalidation Wait Descriptor*/
>> >> @@ -413,6 +414,16 @@ typedef union VTDInvDesc VTDInvDesc;
>> >>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_HI 0xffeULL
>> >>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffe0fff8
>> >>
>> >> +/* Mask for PASID Device IOTLB Invalidate Descriptor */
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_ADDR(val) ((val) & \
>> >> +   0xf000ULL)
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_SIZE(val) ((val >> 11) &
>0x1)
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_GLOBAL(val) ((val) &
>0x1)
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_SID(val) (((val) >> 16) &
>> >0xULL)
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_PASID(val) ((val >> 32)
>&
>> >0xfULL)
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_HI 0x7feULL
>> >> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_LO
>> >0xfff0f000ULL
>> >> +
>> >>  /* Rsvd field masks for spte */
>> >>  #define VTD_SPTE_SNP 0x800ULL
>> >>
>> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> >> index d28c862598..4cf56924e1 100644
>> >> --- a/hw/i386/intel_iommu.c
>> >> +++ b/hw/i386/intel_iommu.c
>> >> @@ -3017,6 +3017,49 @@ static void
>> >do_invalidate_device_tlb(VTDAddressSpace *vtd_dev_as,
>> >>  memory_region_notify_iommu(&vtd_dev_as->iommu, 0, event);
>> >>  }
>> >>
>> >> +static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
>> >> +   VTDInvDesc *inv_desc)
>> >> +{
>> >> +uint16_t sid;
>> >> +VTDAddressSpace *vtd_dev_as;
>> >> +bool size;
>> >> +bool global;
>> >> +hwaddr addr;
>> >> +uint32_t pasid;
>> >> +
>> >> +if ((inv_desc->hi & VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_HI)
>||
>> >> + (inv_desc->lo & VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_LO))
>{
>> >> +error_report_once("%s: invalid pasid-based dev iotlb inv desc:"
>> >> +  "hi=%"PRIx64 "(reserved nonzero)",
>> >> +  __func__, inv_desc->hi);
>> >> +return false;
>> >> +}
>> >> +
>> >> +global = VTD_INV_DESC_PASID_DEVICE_IOTLB_GLOBAL(inv_desc-
>>hi);
>> >> +size = VTD_INV_DESC_PASID_DEVICE_IOTLB_SIZE(inv_desc->hi);
>> >> +addr = VTD_INV_DESC_PASID_DEVICE_IOTLB_ADDR(inv_desc->hi);
>> >&

RE: [PATCH v3 12/17] intel_iommu: Add support for PASID-based device IOTLB invalidation

2024-09-27 Thread Duan, Zhenzhong



>-Original Message-
>From: Duan, Zhenzhong
>Subject: RE: [PATCH v3 12/17] intel_iommu: Add support for PASID-based
>device IOTLB invalidation
>
>
>
>>-Original Message-
>>From: Jason Wang 
>>Subject: Re: [PATCH v3 12/17] intel_iommu: Add support for PASID-based
>>device IOTLB invalidation
>>
>>On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
>> wrote:
>>>
>>> From: Clément Mathieu--Drif 
>>>
>>> Signed-off-by: Clément Mathieu--Drif d...@eviden.com>
>>> Signed-off-by: Zhenzhong Duan 
>>> ---
>>>  hw/i386/intel_iommu_internal.h | 11 
>>>  hw/i386/intel_iommu.c  | 50
>>++
>>>  2 files changed, 61 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h
>>b/hw/i386/intel_iommu_internal.h
>>> index 4f2c3a9350..52bdbf3bc5 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -375,6 +375,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>>  #define VTD_INV_DESC_WAIT   0x5 /* Invalidation Wait
>Descriptor
>>*/
>>>  #define VTD_INV_DESC_PIOTLB 0x6 /* PASID-IOTLB Invalidate
>Desc
>>*/
>>>  #define VTD_INV_DESC_PC 0x7 /* PASID-cache Invalidate Desc
>*/
>>> +#define VTD_INV_DESC_DEV_PIOTLB 0x8 /* PASID-based-DIOTLB
>>inv_desc*/
>>>  #define VTD_INV_DESC_NONE   0   /* Not an Invalidate Descriptor
>>*/
>>>
>>>  /* Masks for Invalidation Wait Descriptor*/
>>> @@ -413,6 +414,16 @@ typedef union VTDInvDesc VTDInvDesc;
>>>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_HI 0xffeULL
>>>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffe0fff8
>>>
>>> +/* Mask for PASID Device IOTLB Invalidate Descriptor */
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_ADDR(val) ((val) & \
>>> +   0xf000ULL)
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_SIZE(val) ((val >> 11) &
>0x1)
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_GLOBAL(val) ((val) & 0x1)
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_SID(val) (((val) >> 16) &
>>0xULL)
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_PASID(val) ((val >> 32) &
>>0xfULL)
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_HI 0x7feULL
>>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_LO
>>0xfff0f000ULL
>>> +
>>>  /* Rsvd field masks for spte */
>>>  #define VTD_SPTE_SNP 0x800ULL
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index d28c862598..4cf56924e1 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3017,6 +3017,49 @@ static void
>>do_invalidate_device_tlb(VTDAddressSpace *vtd_dev_as,
>>>  memory_region_notify_iommu(&vtd_dev_as->iommu, 0, event);
>>>  }
>>>
>>> +static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
>>> +   VTDInvDesc *inv_desc)
>>> +{
>>> +uint16_t sid;
>>> +VTDAddressSpace *vtd_dev_as;
>>> +bool size;
>>> +bool global;
>>> +hwaddr addr;
>>> +uint32_t pasid;
>>> +
>>> +if ((inv_desc->hi & VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_HI) ||
>>> + (inv_desc->lo & VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_LO)) {
>>> +error_report_once("%s: invalid pasid-based dev iotlb inv desc:"
>>> +  "hi=%"PRIx64 "(reserved nonzero)",
>>> +  __func__, inv_desc->hi);
>>> +return false;
>>> +}
>>> +
>>> +global = VTD_INV_DESC_PASID_DEVICE_IOTLB_GLOBAL(inv_desc->hi);
>>> +size = VTD_INV_DESC_PASID_DEVICE_IOTLB_SIZE(inv_desc->hi);
>>> +addr = VTD_INV_DESC_PASID_DEVICE_IOTLB_ADDR(inv_desc->hi);
>>> +sid = VTD_INV_DESC_PASID_DEVICE_IOTLB_SID(inv_desc->lo);
>>> +if (global) {
>>> +QLIST_FOREACH(vtd_dev_as, &s->vtd_as_with_notifiers, next) {
>>> +if ((vtd_dev_as->pasid != PCI_NO_PASID) &&
>>> +(PCI_BUILD_BDF(pci_bus_num(vtd_dev_as->bus),
>>> +   vtd_dev_as->devfn) == sid)) {
>>> +do_invalidate_device_tlb(vtd_dev_as, size, addr);
>>> +

RE: [PATCH v3 12/17] intel_iommu: Add support for PASID-based device IOTLB invalidation

2024-09-27 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 12/17] intel_iommu: Add support for PASID-based
>device IOTLB invalidation
>
>On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
> wrote:
>>
>> From: Clément Mathieu--Drif 
>>
>> Signed-off-by: Clément Mathieu--Drif 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>  hw/i386/intel_iommu_internal.h | 11 
>>  hw/i386/intel_iommu.c  | 50
>++
>>  2 files changed, 61 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 4f2c3a9350..52bdbf3bc5 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -375,6 +375,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>  #define VTD_INV_DESC_WAIT   0x5 /* Invalidation Wait Descriptor
>*/
>>  #define VTD_INV_DESC_PIOTLB 0x6 /* PASID-IOTLB Invalidate Desc
>*/
>>  #define VTD_INV_DESC_PC 0x7 /* PASID-cache Invalidate Desc 
>> */
>> +#define VTD_INV_DESC_DEV_PIOTLB 0x8 /* PASID-based-DIOTLB
>inv_desc*/
>>  #define VTD_INV_DESC_NONE   0   /* Not an Invalidate Descriptor
>*/
>>
>>  /* Masks for Invalidation Wait Descriptor*/
>> @@ -413,6 +414,16 @@ typedef union VTDInvDesc VTDInvDesc;
>>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_HI 0xffeULL
>>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffe0fff8
>>
>> +/* Mask for PASID Device IOTLB Invalidate Descriptor */
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_ADDR(val) ((val) & \
>> +   0xf000ULL)
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_SIZE(val) ((val >> 11) & 0x1)
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_GLOBAL(val) ((val) & 0x1)
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_SID(val) (((val) >> 16) &
>0xULL)
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_PASID(val) ((val >> 32) &
>0xfULL)
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_HI 0x7feULL
>> +#define VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_LO
>0xfff0f000ULL
>> +
>>  /* Rsvd field masks for spte */
>>  #define VTD_SPTE_SNP 0x800ULL
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index d28c862598..4cf56924e1 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3017,6 +3017,49 @@ static void
>do_invalidate_device_tlb(VTDAddressSpace *vtd_dev_as,
>>  memory_region_notify_iommu(&vtd_dev_as->iommu, 0, event);
>>  }
>>
>> +static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
>> +   VTDInvDesc *inv_desc)
>> +{
>> +uint16_t sid;
>> +VTDAddressSpace *vtd_dev_as;
>> +bool size;
>> +bool global;
>> +hwaddr addr;
>> +uint32_t pasid;
>> +
>> +if ((inv_desc->hi & VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_HI) ||
>> + (inv_desc->lo & VTD_INV_DESC_PASID_DEVICE_IOTLB_RSVD_LO)) {
>> +error_report_once("%s: invalid pasid-based dev iotlb inv desc:"
>> +  "hi=%"PRIx64 "(reserved nonzero)",
>> +  __func__, inv_desc->hi);
>> +return false;
>> +}
>> +
>> +global = VTD_INV_DESC_PASID_DEVICE_IOTLB_GLOBAL(inv_desc->hi);
>> +size = VTD_INV_DESC_PASID_DEVICE_IOTLB_SIZE(inv_desc->hi);
>> +addr = VTD_INV_DESC_PASID_DEVICE_IOTLB_ADDR(inv_desc->hi);
>> +sid = VTD_INV_DESC_PASID_DEVICE_IOTLB_SID(inv_desc->lo);
>> +if (global) {
>> +QLIST_FOREACH(vtd_dev_as, &s->vtd_as_with_notifiers, next) {
>> +if ((vtd_dev_as->pasid != PCI_NO_PASID) &&
>> +(PCI_BUILD_BDF(pci_bus_num(vtd_dev_as->bus),
>> +   vtd_dev_as->devfn) == sid)) {
>> +do_invalidate_device_tlb(vtd_dev_as, size, addr);
>> +}
>> +}
>> +} else {
>> +pasid = VTD_INV_DESC_PASID_DEVICE_IOTLB_PASID(inv_desc->lo);
>> +vtd_dev_as = vtd_get_as_by_sid_and_pasid(s, sid, pasid);
>> +if (!vtd_dev_as) {
>> +return true;
>> +}
>> +
>> +do_invalidate_device_tlb(vtd_dev_as, size, addr);
>
>Question:
>
>I wonder if current vhost (which has a device IOTLB abstraction via
>virtio-pci) can work with this (PASID based IOTLB invalidation)

Currently, it depends on if caching-mode is on. If it's off, vhost works. E.g.:

-device 
intel-iommu,caching-mode=off,dma-drain=on,device-iotlb=on,x-scalable-mode=on
-netdev tap,id=tap0,vhost=on,script=/etc/qemu-ifup
-device virtio-net-pci,netdev=tap0,bus=root0,iommu_platform=on,ats=on

It doesn't work currently when caching-mode is on.
Reason is linux kernel has an optimization to send only piotlb invalidation,
no device-piotlb invalidation is sent. But I heard from Yi the optimization
will be dropped, then it will work too when caching-mode is on.

Thanks
Zhenzhong

RE: [PATCH v3 16/17] intel_iommu: Introduce a property to control FS1GP cap bit setting

2024-09-27 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 16/17] intel_iommu: Introduce a property to control
>FS1GP cap bit setting
>
>On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
> wrote:
>>
>> This gives user flexibility to turn off FS1GP for debug purpose.
>>
>> It is also useful for future nesting feature. When host IOMMU doesn't
>> support FS1GP but vIOMMU does, nested page table on host side works
>> after turn FS1GP off in vIOMMU.
>>
>> This property has no effect when vIOMMU isn't in scalable modern
>> mode.
>
>It looks to me there's no need to have an "x" prefix for this.

Will remove "x" prefix.

Thanks
Zhenzhong

>
>Other looks good.
>
>>
>> Signed-off-by: Zhenzhong Duan 
>> Reviewed-by: Clément Mathieu--Drif
>> ---
>>  include/hw/i386/intel_iommu.h | 1 +
>>  hw/i386/intel_iommu.c | 5 -
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 650641544c..f6d9b41b80 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -308,6 +308,7 @@ struct IntelIOMMUState {
>>  bool dma_drain; /* Whether DMA r/w draining enabled */
>>  bool dma_translation;   /* Whether DMA translation supported */
>>  bool pasid; /* Whether to support PASID */
>> +bool fs1gp; /* First Stage 1-GByte Page Support */
>>
>>  /*
>>   * Protects IOMMU states in general.  Currently it protects the
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index bb3ed48281..8b40aace8b 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3779,6 +3779,7 @@ static Property vtd_properties[] = {
>>  DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>>  DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>>  DEFINE_PROP_BOOL("dma-translation", IntelIOMMUState,
>dma_translation, true),
>> +DEFINE_PROP_BOOL("x-cap-fs1gp", IntelIOMMUState, fs1gp, true),
>>  DEFINE_PROP_END_OF_LIST(),
>>  };
>>
>> @@ -4507,7 +4508,9 @@ static void vtd_cap_init(IntelIOMMUState *s)
>>  /* TODO: read cap/ecap from host to decide which cap to be exposed.
>*/
>>  if (s->scalable_modern) {
>>  s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FLTS;
>> -s->cap |= VTD_CAP_FS1GP;
>> +if (s->fs1gp) {
>> +s->cap |= VTD_CAP_FS1GP;
>> +}
>>  } else if (s->scalable_mode) {
>>  s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>>  }
>> --
>> 2.34.1
>>
>
>Thanks

RE: [PATCH v3 15/17] intel_iommu: Modify x-scalable-mode to be string option to expose scalable modern mode

2024-09-26 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 15/17] intel_iommu: Modify x-scalable-mode to be
>string option to expose scalable modern mode
>
>On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
> wrote:
>>
>> From: Yi Liu 
>>
>> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
>> related to scalable mode translation, thus there are multiple combinations.
>> While this vIOMMU implementation wants to simplify it for user by
>providing
>> typical combinations. User could config it by "x-scalable-mode" option. The
>> usage is as below:
>>
>> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
>>
>>  - "legacy": gives support for stage-2 page table
>>  - "modern": gives support for stage-1 page table
>>  - "off": no scalable mode support
>>  - any other string, will throw error
>
>Those we had "x" prefix but I wonder if this is the best option for
>enabling scalable-modern mode since the "on" is illegal after this
>change.

Yes, I was thinking "x" means not stable user interface yet.
But I do agree with you it's better to keep stable user interface whenever 
possible.

>
>Maybe it's better to just have an "x-fls". Or if we considering the
>scalable mode is kind of complete, it's time to get rid of "x" prefix.

Ah, I thought this is a question only maintainers and reviewers can decide if 
it's complete.
If no voice on that, I'd like to add "x-fls" as you suggested and keep 
x-scalable-mode unchanged.

Thanks
Zhenzhong

RE: [PATCH v3 14/17] intel_iommu: Set default aw_bits to 48 in scalable modern mode

2024-09-26 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 14/17] intel_iommu: Set default aw_bits to 48 in
>scalable modern mode
>
>On Wed, Sep 11, 2024 at 1:27 PM Zhenzhong Duan
> wrote:
>>
>> According to VTD spec, stage-1 page table could support 4-level and
>> 5-level paging.
>>
>> However, 5-level paging translation emulation is unsupported yet.
>> That means the only supported value for aw_bits is 48.
>>
>> So default aw_bits to 48 in scalable modern mode. In other cases,
>> it is still default to 39 for compatibility.
>>
>> Add a check to ensure user specified value is 48 in modern mode
>> for now.
>>
>> Signed-off-by: Zhenzhong Duan 
>> Reviewed-by: Clément Mathieu--Drif
>> ---
>>  include/hw/i386/intel_iommu.h |  2 +-
>>  hw/i386/intel_iommu.c | 10 +-
>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index b843d069cc..48134bda11 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -45,7 +45,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(IntelIOMMUState,
>INTEL_IOMMU_DEVICE)
>>  #define DMAR_REG_SIZE   0x230
>>  #define VTD_HOST_AW_39BIT   39
>>  #define VTD_HOST_AW_48BIT   48
>> -#define VTD_HOST_ADDRESS_WIDTH  VTD_HOST_AW_39BIT
>> +#define VTD_HOST_AW_AUTO0xff
>>  #define VTD_HAW_MASK(aw)((1ULL << (aw)) - 1)
>>
>>  #define DMAR_REPORT_F_INTR  (1)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index c25211ddaf..949f120456 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3771,7 +3771,7 @@ static Property vtd_properties[] = {
>>  ON_OFF_AUTO_AUTO),
>>  DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim,
>false),
>>  DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>> -  VTD_HOST_ADDRESS_WIDTH),
>> +  VTD_HOST_AW_AUTO),
>
>Such command line API seems to be wired.
>
>I think we can stick the current default and when scalable modern is
>enabled by aw is not specified, we can change aw to 48?

Current default is 39. I use VTD_HOST_AW_AUTO to initialize aw as not specified.
Do we have other way to catch the update if we stick to 39?

Thanks
Zhenzhong

>
>>  DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode,
>FALSE),
>>  DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState,
>scalable_mode, FALSE),
>>  DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control,
>false),
>> @@ -4686,6 +4686,14 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>  }
>>  }
>>
>> +if (s->aw_bits == VTD_HOST_AW_AUTO) {
>> +if (s->scalable_modern) {
>> +s->aw_bits = VTD_HOST_AW_48BIT;
>> +} else {
>> +s->aw_bits = VTD_HOST_AW_39BIT;
>> +}
>> +}
>> +
>>  if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>>  (s->aw_bits != VTD_HOST_AW_48BIT) &&
>>  !s->scalable_modern) {
>> --
>> 2.34.1
>>
>
>Thanks

RE: [PATCH v3 08/17] intel_iommu: Set accessed and dirty bits during first stage translation

2024-09-26 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 08/17] intel_iommu: Set accessed and dirty bits
>during first stage translation
>
>On Wed, Sep 11, 2024 at 1:26 PM Zhenzhong Duan
> wrote:
>>
>> From: Clément Mathieu--Drif 
>>
>> Signed-off-by: Clément Mathieu--Drif 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>  hw/i386/intel_iommu_internal.h |  3 +++
>>  hw/i386/intel_iommu.c  | 25 -
>>  2 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 668583aeca..7786ef7624 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -324,6 +324,7 @@ typedef enum VTDFaultReason {
>>
>>  /* Output address in the interrupt address range for scalable mode */
>>  VTD_FR_SM_INTERRUPT_ADDR = 0x87,
>> +VTD_FR_FS_BIT_UPDATE_FAILED = 0x91, /* SFS.10 */
>>  VTD_FR_MAX, /* Guard */
>>  } VTDFaultReason;
>>
>> @@ -549,6 +550,8 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  /* Masks for First Level Paging Entry */
>>  #define VTD_FL_P1ULL
>>  #define VTD_FL_RW_MASK  (1ULL << 1)
>> +#define VTD_FL_A0x20
>> +#define VTD_FL_D0x40
>
>Nit: let's use _MASK suffix to all or not.

Will use:

#define VTD_FL_P1ULL
#define VTD_FL_RW_MASK  (1ULL << 1)
#define VTD_FL_A_MASK   (1ULL << 5)
#define VTD_FL_D_MASK   (1ULL << 6)

Thanks
Zhenzhong

RE: [PATCH v3 04/17] intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb invalidation

2024-09-26 Thread Duan, Zhenzhong



>-Original Message-
>From: Jason Wang 
>Subject: Re: [PATCH v3 04/17] intel_iommu: Flush stage-2 cache in PASID-
>selective PASID-based iotlb invalidation
>
>On Wed, Sep 11, 2024 at 1:26 PM Zhenzhong Duan
> wrote:
>>
>> Per spec 6.5.2.4, PADID-selective PASID-based iotlb invalidation will
>> flush stage-2 iotlb entries with matching domain id and pasid.
>>
>> With scalable modern mode introduced, guest could send PASID-selective
>> PASID-based iotlb invalidation to flush both stage-1 and stage-2 entries.
>>
>> By this chance, remove old IOTLB related definition.
>
>Nit: if there's a respin we'd better say those definitions is unused.

Sure, will be:

"By this chance, remove old IOTLB related definitions which were unused."

Thanks
Zhenzhong

>
>Other than this
>
>Acked-by: Jason Wang 
>
>Thanks
>
>>
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>  hw/i386/intel_iommu_internal.h | 14 +++---
>>  hw/i386/intel_iommu.c  | 81
>++
>>  2 files changed, 90 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 8fa27c7f3b..19e4ed52ca 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -402,11 +402,6 @@ typedef union VTDInvDesc VTDInvDesc;
>>  #define VTD_INV_DESC_IOTLB_AM(val)  ((val) & 0x3fULL)
>>  #define VTD_INV_DESC_IOTLB_RSVD_LO  0xff00ULL
>>  #define VTD_INV_DESC_IOTLB_RSVD_HI  0xf80ULL
>> -#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
>> -#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
>> -#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) &
>VTD_PASID_ID_MASK)
>> -#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO
>0xfff001c0ULL
>> -#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI  0xf80ULL
>>
>>  /* Mask for Device IOTLB Invalidate Descriptor */
>>  #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) &
>0xf000ULL)
>> @@ -438,6 +433,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>  (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
>VTD_SL_TM)) : \
>>  (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>>
>> +/* Masks for PIOTLB Invalidate Descriptor */
>> +#define VTD_INV_DESC_PIOTLB_G (3ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_INV_DESC_PIOTLB_PASID(val)(((val) >> 32) & 0xfULL)
>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0f1c0ULL
>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>> +
>>  /* Information about page-selective IOTLB invalidate */
>>  struct VTDIOTLBPageInvInfo {
>>  uint16_t domain_id;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 57c24f67b4..be30caef31 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2656,6 +2656,83 @@ static bool
>vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>>  return true;
>>  }
>>
>> +static gboolean vtd_hash_remove_by_pasid(gpointer key, gpointer value,
>> + gpointer user_data)
>> +{
>> +VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
>> +VTDIOTLBPageInvInfo *info = (VTDIOTLBPageInvInfo *)user_data;
>> +
>> +return ((entry->domain_id == info->domain_id) &&
>> +(entry->pasid == info->pasid));
>> +}
>> +
>> +static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>> +uint16_t domain_id, uint32_t pasid)
>> +{
>> +VTDIOTLBPageInvInfo info;
>> +VTDAddressSpace *vtd_as;
>> +VTDContextEntry ce;
>> +
>> +info.domain_id = domain_id;
>> +info.pasid = pasid;
>> +
>> +vtd_iommu_lock(s);
>> +g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
>> +&info);
>> +vtd_iommu_unlock(s);
>> +
>> +QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>> +if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>> +  vtd_as->devfn, &ce) &&
>> +domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>> +uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> +
>> +if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
>> +vtd_as->pasid != pasid) {
>> +continue;
>> +}
>> +
>> +if (!s->scalable_modern) {
>> +vtd_address_space_sync(vtd_as);
>> +}
>> +}
>> +}
>> +}
>> +
>> +static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>> +VTDInvDesc *inv_desc)
>> +{
>> +uint16_t domain_id;
>> +uint32_t pasid;
>> +
>> +if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
>> +(inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
>> +error_report_once("%s: invalid

Re: [PATCH v3 00/17] intel_iommu: Enable stage-1 translation for emulated device

2024-09-26 Thread Duan, Zhenzhong


Hi All,

Kindly ping, more comments are appreciated:)

Thanks
Zhenzhong

On 9/11/2024 1:22 PM, Zhenzhong Duan wrote:

Hi,

Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.

This series enables stage-1 translation support for emulated device
in intel iommu which we called "modern" mode.

PATCH1-5:  Some preparing work before support stage-1 translation
PATCH6-8:  Implement stage-1 translation for emulated device
PATCH9-13: Emulate iotlb invalidation of stage-1 mapping
PATCH14:   Set default aw_bits to 48 in scalable modren mode
PATCH15-16:Expose scalable "modern" mode and "x-cap-fs1gp" to cmdline
PATCH17:   Add qtest

Note in spec revision 3.4, it renames "First-level" to "First-stage",
"Second-level" to "Second-stage". But the scalable mode was added
before that change. So we keep old favor using First-level/fl/Second-level/sl
in code but change to use stage-1/stage-2 in commit log.
But keep in mind First-level/fl/stage-1 all have same meaning,
same for Second-level/sl/stage-2.

Qemu code can be found at [2]
The whole nesting series can be found at [3]

[1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_stage1_emu_v3
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2

Thanks
Zhenzhong

Changelog:
v3:
- drop unnecessary !(s->ecap & VTD_ECAP_SMTS) (Clement)
- simplify calculation of return value for vtd_iova_fl_check_canonical() (Liuyi)
- make A/D bit setting atomic (Liuyi)
- refine error msg (Clement, Liuyi)

v2:
- check ecap/cap bits instead of s->scalable_modern in vtd_pe_type_check() 
(Clement)
- declare VTD_ECAP_FLTS/FS1GP after the feature is implemented (Clement)
- define VTD_INV_DESC_PIOTLB_G (Clement)
- make error msg consistent in vtd_process_piotlb_desc() (Clement)
- refine commit log in patch16 (Clement)
- add VTD_ECAP_IR to ECAP_MODERN_FIXED1 (Clement)
- add a knob x-cap-fs1gp to control stage-1 1G paging capability
- collect Clement's R-B

v1:
- define VTD_HOST_AW_AUTO (Clement)
- passing pgtt as a parameter to vtd_update_iotlb (Clement)
- prefix sl_/fl_ to second/first level specific functions (Clement)
- pick reserved bit check from Clement, add his Co-developed-by
- Update test without using libqtest-single.h (Thomas)

rfcv2:
- split from nesting series (Jason)
- merged some commits from Clement
- add qtest (jason)


Clément Mathieu--Drif (4):
   intel_iommu: Check if the input address is canonical
   intel_iommu: Set accessed and dirty bits during first stage
 translation
   intel_iommu: Add an internal API to find an address space with PASID
   intel_iommu: Add support for PASID-based device IOTLB invalidation

Yi Liu (3):
   intel_iommu: Rename slpte to pte
   intel_iommu: Implement stage-1 translation
   intel_iommu: Modify x-scalable-mode to be string option to expose
 scalable modern mode

Yu Zhang (1):
   intel_iommu: Use the latest fault reasons defined by spec

Zhenzhong Duan (9):
   intel_iommu: Make pasid entry type check accurate
   intel_iommu: Add a placeholder variable for scalable modern mode
   intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb
 invalidation
   intel_iommu: Flush stage-1 cache in iotlb invalidation
   intel_iommu: Process PASID-based iotlb invalidation
   intel_iommu: piotlb invalidation should notify unmap
   intel_iommu: Set default aw_bits to 48 in scalable modern mode
   intel_iommu: Introduce a property to control FS1GP cap bit setting
   tests/qtest: Add intel-iommu test

  MAINTAINERS|   1 +
  hw/i386/intel_iommu_internal.h |  91 -
  include/hw/i386/intel_iommu.h  |   9 +-
  hw/i386/intel_iommu.c  | 694 +++--
  tests/qtest/intel-iommu-test.c |  70 
  tests/qtest/meson.build|   1 +
  6 files changed, 735 insertions(+), 131 deletions(-)
  create mode 100644 tests/qtest/intel-iommu-test.c

RE: [PATCH v3 00/17] intel_iommu: Enable stage-1 translation for emulated device

2024-09-11 Thread Duan, Zhenzhong

Hi Clement,

Thanks for your review. Hoping it could be accepted in the foreseeable future.

Thanks
Zhenzhong

>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v3 00/17] intel_iommu: Enable stage-1 translation for
>emulated device
>
>Hi Zhenzhong,
>
>Thanks for posting a new version.
>I think it starting to look good.
>Just a few comments.
>
> >cmd
>
>On 11/09/2024 07:22, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series enables stage-1 translation support for emulated device
>> in intel iommu which we called "modern" mode.
>>
>> PATCH1-5:  Some preparing work before support stage-1 translation
>> PATCH6-8:  Implement stage-1 translation for emulated device
>> PATCH9-13: Emulate iotlb invalidation of stage-1 mapping
>> PATCH14:   Set default aw_bits to 48 in scalable modren mode
>> PATCH15-16:Expose scalable "modern" mode and "x-cap-fs1gp" to cmdline
>> PATCH17:   Add qtest
>>
>> Note in spec revision 3.4, it renames "First-level" to "First-stage",
>> "Second-level" to "Second-stage". But the scalable mode was added
>> before that change. So we keep old favor using First-level/fl/Second-
>level/sl
>> in code but change to use stage-1/stage-2 in commit log.
>> But keep in mind First-level/fl/stage-1 all have same meaning,
>> same for Second-level/sl/stage-2.
>>
>> Qemu code can be found at [2]
>> The whole nesting series can be found at [3]
>>
>> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-
>01/msg02740.html
>> [2]
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_stage1_em
>u_v3
>> [3]
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfc
>v2
>>
>> Thanks
>> Zhenzhong
>>
>> Changelog:
>> v3:
>> - drop unnecessary !(s->ecap & VTD_ECAP_SMTS) (Clement)
>> - simplify calculation of return value for vtd_iova_fl_check_canonical()
>(Liuyi)
>> - make A/D bit setting atomic (Liuyi)
>> - refine error msg (Clement, Liuyi)
>>
>> v2:
>> - check ecap/cap bits instead of s->scalable_modern in
>vtd_pe_type_check() (Clement)
>> - declare VTD_ECAP_FLTS/FS1GP after the feature is implemented
>(Clement)
>> - define VTD_INV_DESC_PIOTLB_G (Clement)
>> - make error msg consistent in vtd_process_piotlb_desc() (Clement)
>> - refine commit log in patch16 (Clement)
>> - add VTD_ECAP_IR to ECAP_MODERN_FIXED1 (Clement)
>> - add a knob x-cap-fs1gp to control stage-1 1G paging capability
>> - collect Clement's R-B
>>
>> v1:
>> - define VTD_HOST_AW_AUTO (Clement)
>> - passing pgtt as a parameter to vtd_update_iotlb (Clement)
>> - prefix sl_/fl_ to second/first level specific functions (Clement)
>> - pick reserved bit check from Clement, add his Co-developed-by
>> - Update test without using libqtest-single.h (Thomas)
>>
>> rfcv2:
>> - split from nesting series (Jason)
>> - merged some commits from Clement
>> - add qtest (jason)
>>
>>
>> Clément Mathieu--Drif (4):
>>intel_iommu: Check if the input address is canonical
>>intel_iommu: Set accessed and dirty bits during first stage
>>  translation
>>intel_iommu: Add an internal API to find an address space with PASID
>>intel_iommu: Add support for PASID-based device IOTLB invalidation
>>
>> Yi Liu (3):
>>intel_iommu: Rename slpte to pte
>>intel_iommu: Implement stage-1 translation
>>intel_iommu: Modify x-scalable-mode to be string option to expose
>>  scalable modern mode
>>
>> Yu Zhang (1):
>>intel_iommu: Use the latest fault reasons defined by spec
>>
>> Zhenzhong Duan (9):
>>intel_iommu: Make pasid entry type check accurate
>>intel_iommu: Add a placeholder variable for scalable modern mode
>>intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb
>>  invalidation
>>intel_iommu: Flush stage-1 cache in iotlb invalidation
>>intel_iommu: Process PASID-based iotlb invalidation
>>intel_iommu: piotlb invalidation should notify unmap
>>intel_iommu: Set default aw_bits to 48 in scalable modern mode
>>intel_iommu: Introduce a property to control FS1GP cap bit setting
>>tests/qtest: Add intel-iommu test
>>
>>   MAINTAINERS|   1 +
>>   hw/i386/intel_iommu_internal.h |  91 -
>>   include/hw/i386/intel_iommu.h  |   9 +-
>>   hw/i386/intel_iommu.c  | 694 +++
>--
>>   tests/qtest/intel-iommu-test.c |  70 
>>   tests/qtest/meson.build|   1 +
>>   6 files changed, 735 insertions(+), 131 deletions(-)
>>   create mode 100644 tests/qtest/intel-iommu-test.c
>>
>> --
>> 2.34.1
>>

RE: [PATCH v3 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-09-11 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v3 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>
>
>On 11/09/2024 07:22, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Add an new element scalable_mode in IntelIOMMUState to mark scalable
>> modern mode, this element will be exposed as an intel_iommu property
>> finally.
>>
>> For now, it's only a placehholder and used for address width
>> compatibility check and block host device passthrough until nesting
>> is supported.
>>
>> Signed-off-by: Yi Liu 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   include/hw/i386/intel_iommu.h |  1 +
>>   hw/i386/intel_iommu.c | 22 ++
>>   2 files changed, 19 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 1eb05c29fc..788ed42477 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>
>>   bool caching_mode;  /* RO - is cap CM enabled? */
>>   bool scalable_mode; /* RO - is Scalable Mode supported? */
>> +bool scalable_modern;   /* RO - is modern SM supported? */
>>   bool snoop_control; /* RO - is SNP filed supported? */
>>
>>   dma_addr_t root;/* Current root table pointer */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index e3465fc27d..57c24f67b4 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3872,7 +3872,13 @@ static bool vtd_check_hiod(IntelIOMMUState
>*s, HostIOMMUDevice *hiod,
>>   return false;
>>   }
>>
>> -return true;
>> +if (!s->scalable_modern) {
>> +/* All checks requested by VTD non-modern mode pass */
>> +return true;
>> +}
>> +
>> +error_setg(errp, "host device is unsupported in scalable modern mode
>yet");
>> +return false;
>>   }
>>
>>   static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int
>devfn,
>> @@ -4262,14 +4268,22 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>   }
>>   }
>>
>> -/* Currently only address widths supported are 39 and 48 bits */
>>   if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>> -(s->aw_bits != VTD_HOST_AW_48BIT)) {
>> -error_setg(errp, "Supported values for aw-bits are: %d, %d",
>> +(s->aw_bits != VTD_HOST_AW_48BIT) &&
>> +!s->scalable_modern) {
>> +error_setg(errp, "%s mode: supported values for aw-bits
>are: %d, %d",
>> +   s->scalable_mode ? "Scalable legacy" : "Legacy",
>I think we should be consistent in the way we name things.
>s/Scalable legacy/Scalable

Will do.

>>  VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
>>   return false;
>>   }
>>
>> +if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {
>> +error_setg(errp,
>> +   "Scalable modern mode: supported values for aw-bits is: 
>> %d",
>> +   VTD_HOST_AW_48BIT);
>> +return false;
>> +}
>> +
>In both conditions, I would rather test the mode first to make the
>intention clearer.
>For instance,
>
>(s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern
>
>would become
>
>s->scalable_modern && (s->aw_bits != VTD_HOST_AW_48BIT)

Sure, will do.

Thanks
Zhenzhong

>
>Apart from these minor comments, the patch looks good to me
>
>>   if (s->scalable_mode && !s->dma_drain) {
>>   error_setg(errp, "Need to set dma_drain for scalable mode");
>>   return false;
>> --
>> 2.34.1
>>

RE: [PATCH v2 00/17] intel_iommu: Enable stage-1 translation for emulated device

2024-09-10 Thread Duan, Zhenzhong

Hi Clement,

Yes, I'll send a v3 in this week.

Thanks
Zhenzhong

>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v2 00/17] intel_iommu: Enable stage-1 translation for
>emulated device
>
>Hi Zhenzhong,
>
>Do you plan to post a v3 for this series?
>
>Thanks
> >cmd
>
>On 05/08/2024 08:27, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series enables stage-1 translation support for emulated device
>> in intel iommu which we called "modern" mode.
>>
>> PATCH1-5:  Some preparing work before support stage-1 translation
>> PATCH6-8:  Implement stage-1 translation for emulated device
>> PATCH9-13: Emulate iotlb invalidation of stage-1 mapping
>> PATCH14:   Set default aw_bits to 48 in scalable modren mode
>> PATCH15-16:Expose scalable "modern" mode and "x-cap-fs1gp" to cmdline
>> PATCH17:   Add qtest
>>
>> Note in spec revision 3.4, it renamed "First-level" to "First-stage",
>> "Second-level" to "Second-stage". But the scalable mode was added
>> before that change. So we keep old favor using First-level/fl/Second-
>level/sl
>> in code but change to use stage-1/stage-2 in commit log.
>> But keep in mind First-level/fl/stage-1 all have same meaning,
>> same for Second-level/sl/stage-2.
>>
>> Qemu code can be found at [2]
>> The whole nesting series can be found at [3]
>>
>> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-
>01/msg02740.html
>> [2]
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_stage1_em
>u_v2
>> [3]
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfc
>v2
>>
>> Thanks
>> Zhenzhong
>>
>> Changelog:
>> v2:
>> - check ecap/cap bits instead of s->scalable_modern in
>vtd_pe_type_check() (Clement)
>> - declare VTD_ECAP_FLTS/FS1GP after the feature is implemented
>(Clement)
>> - define VTD_INV_DESC_PIOTLB_G (Clement)
>> - make error msg consistent in vtd_process_piotlb_desc() (Clement)
>> - refine commit log in patch16 (Clement)
>> - add VTD_ECAP_IR to ECAP_MODERN_FIXED1 (Clement)
>> - add a knob x-cap-fs1gp to control stage-1 1G paging capability
>> - collect Clement's R-B
>>
>> v1:
>> - define VTD_HOST_AW_AUTO (Clement)
>> - passing pgtt as a parameter to vtd_update_iotlb (Clement)
>> - prefix sl_/fl_ to second/first level specific functions (Clement)
>> - pick reserved bit check from Clement, add his Co-developed-by
>> - Update test without using libqtest-single.h (Thomas)
>>
>> rfcv2:
>> - split from nesting series (Jason)
>> - merged some commits from Clement
>> - add qtest (jason)
>>
>> Clément Mathieu--Drif (4):
>>intel_iommu: Check if the input address is canonical
>>intel_iommu: Set accessed and dirty bits during first stage
>>  translation
>>intel_iommu: Add an internal API to find an address space with PASID
>>intel_iommu: Add support for PASID-based device IOTLB invalidation
>>
>> Yi Liu (3):
>>intel_iommu: Rename slpte to pte
>>intel_iommu: Implement stage-1 translation
>>intel_iommu: Modify x-scalable-mode to be string option to expose
>>  scalable modern mode
>>
>> Yu Zhang (1):
>>intel_iommu: Use the latest fault reasons defined by spec
>>
>> Zhenzhong Duan (9):
>>intel_iommu: Make pasid entry type check accurate
>>intel_iommu: Add a placeholder variable for scalable modern mode
>>intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb
>>  invalidation
>>intel_iommu: Flush stage-1 cache in iotlb invalidation
>>intel_iommu: Process PASID-based iotlb invalidation
>>intel_iommu: piotlb invalidation should notify unmap
>>intel_iommu: Set default aw_bits to 48 in scalable modren mode
>>intel_iommu: Introduce a property to control FS1GP cap bit setting
>>tests/qtest: Add intel-iommu test
>>
>>   MAINTAINERS|   1 +
>>   hw/i386/intel_iommu_internal.h |  91 -
>>   include/hw/i386/intel_iommu.h  |   9 +-
>>   hw/i386/intel_iommu.c  | 689 +++
>--
>>   tests/qtest/intel-iommu-test.c |  70 
>>   tests/qtest/meson.build|   1 +
>>   6 files changed, 731 insertions(+), 130 deletions(-)
>>   create mode 100644 tests/qtest/intel-iommu-test.c
>>
>> --
>> 2.34.1
>>

RE: [PATCH v4 0/2] intel_iommu minor fixes

2024-09-09 Thread Duan, Zhenzhong

Hi Michael,

Kindly ping, seems this small series missed.

Thanks
Zhenzhong


>-Original Message-
>From: Duan, Zhenzhong 
>Subject: [PATCH v4 0/2] intel_iommu minor fixes
>
>Hi
>
>Fixes two minor issues in intel iommu.
>See patch for details.
>
>Tested scalable mode and legacy mode with vfio device passthrough: PASS
>Tested intel-iommu.flat in kvm-unit-test: PASS
>
>Thanks
>Zhenzhong
>
>Changelog:
>v4:
>- Use 12 bytes commit id in fix tag (Liu Yi)
>
>v3:
>- add fix tag (Liu Yi)
>- collect R-B
>
>v2:
>- s/take/taking/ (Liu Yi)
>- add patch2 (Liu Yi)
>
>Zhenzhong Duan (2):
>  intel_iommu: Fix invalidation descriptor type field
>  intel_iommu: Make PASID-cache and PIOTLB type invalid in legacy mode
>
> hw/i386/intel_iommu_internal.h | 11 ++-
> hw/i386/intel_iommu.c  | 24 
> 2 files changed, 18 insertions(+), 17 deletions(-)
>
>--
>2.34.1

RE: [PATCH v2 13/17] intel_iommu: piotlb invalidation should notify unmap

2024-08-19 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 13/17] intel_iommu: piotlb invalidation should
>notify unmap
>
>On 2024/8/19 17:57, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Liu, Yi L 
>>> Subject: Re: [PATCH v2 13/17] intel_iommu: piotlb invalidation should
>>> notify unmap
>>>
>>> On 2024/8/5 14:27, Zhenzhong Duan wrote:
>>>> This is used by some emulated devices which caches address
>>>> translation result. When piotlb invalidation issued in guest,
>>>> those caches should be refreshed.
>>>
>>> Perhaps I have asked it in the before. :) To me, such emulated devices
>>> should implement an ATS-capability. You may mention the devices that
>>> does not implement ATS-capability, but caches the translation result,
>>> and note that it is better to implement ATS cap if there is need to
>>> cache the translation request.
>>
>> OK, will do. Will be like:
>>
>> "For device that does not implement ATS-capability or disable it
>> but still caches the translation result, it is better to implement ATS cap
>> or enable it if there is need to cache the translation request."
>
>sorry for a typo. s/request/result/

Applied.

Thanks
Zhenzhong

RE: [PATCH v2 13/17] intel_iommu: piotlb invalidation should notify unmap

2024-08-19 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 13/17] intel_iommu: piotlb invalidation should
>notify unmap
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> This is used by some emulated devices which caches address
>> translation result. When piotlb invalidation issued in guest,
>> those caches should be refreshed.
>
>Perhaps I have asked it in the before. :) To me, such emulated devices
>should implement an ATS-capability. You may mention the devices that
>does not implement ATS-capability, but caches the translation result,
>and note that it is better to implement ATS cap if there is need to
>cache the translation request.

OK, will do. Will be like:

"For device that does not implement ATS-capability or disable it
but still caches the translation result, it is better to implement ATS cap
or enable it if there is need to cache the translation request."

Thanks
Zhenzhong

>
>>
>> Signed-off-by: Yi Sun 
>> Signed-off-by: Zhenzhong Duan 
>> Reviewed-by: Clément Mathieu--Drif
>> ---
>>   hw/i386/intel_iommu.c | 35
>++-
>>   1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index fa00f85fd7..317e630e08 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2907,7 +2907,7 @@ static void
>vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>>   continue;
>>   }
>>
>> -if (!s->scalable_modern) {
>> +if (!s->scalable_modern || !vtd_as_has_map_notifier(vtd_as)) {
>>   vtd_address_space_sync(vtd_as);
>>   }
>>   }
>> @@ -2919,6 +2919,9 @@ static void
>vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>  bool ih)
>>   {
>>   VTDIOTLBPageInvInfo info;
>> +VTDAddressSpace *vtd_as;
>> +VTDContextEntry ce;
>> +hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>>
>>   info.domain_id = domain_id;
>>   info.pasid = pasid;
>> @@ -2929,6 +2932,36 @@ static void
>vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>   g_hash_table_foreach_remove(s->iotlb,
>>   vtd_hash_remove_by_page_piotlb, &info);
>>   vtd_iommu_unlock(s);
>> +
>> +QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>> +if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>> +  vtd_as->devfn, &ce) &&
>> +domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>> +uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> +IOMMUTLBEvent event;
>> +
>> +if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
>> +vtd_as->pasid != pasid) {
>> +continue;
>> +}
>> +
>> +/*
>> + * Page-Selective-within-PASID PASID-based-IOTLB Invalidation
>> + * does not flush stage-2 entries. See spec section 6.5.2.4
>> + */
>> +if (!s->scalable_modern) {
>> +continue;
>> +}
>> +
>> +event.type = IOMMU_NOTIFIER_UNMAP;
>> +event.entry.target_as = &address_space_memory;
>> +event.entry.iova = addr;
>> +event.entry.perm = IOMMU_NONE;
>> +event.entry.addr_mask = size - 1;
>> +event.entry.translated_addr = 0;
>> +memory_region_notify_iommu(&vtd_as->iommu, 0, event);
>> +}
>> +}
>>   }
>>
>>   static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>
>--
>Regards,
>Yi Liu

RE: [PATCH v2 16/17] intel_iommu: Introduce a property to control FS1GP cap bit setting

2024-08-19 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 16/17] intel_iommu: Introduce a property to control
>FS1GP cap bit setting
>
>On 2024/8/15 11:46, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Liu, Yi L 
>>> Subject: Re: [PATCH v2 16/17] intel_iommu: Introduce a property to
>control
>>> FS1GP cap bit setting
>>>
>>> On 2024/8/5 14:27, Zhenzhong Duan wrote:
>>>> When host IOMMU doesn't support FS1GP but vIOMMU does, host
>>> IOMMU
>>>> can't translate stage-1 page table from guest correctly.
>>>
>>> this series is for emulated devices, so the above statement does not
>>> belong to this series. Is there any other reason to have this option?
>>
>> Good catch, will remove this comment.
>> In fact, this patch is mainly for passthrough device where host IOMMU
>doesn't support fs1gp.
>
>I see. To me, as long as the vIOMMU page walk logic supports 1GP large
>pages, it's ok to report the FS1GP cap to VM. But it is still fine to
>have this property to opt-out FS1GP if admin/orchestration layer(e.g. libvirt)
>knows no hw iommu has this capability, so it is better to opt out it
>before invoking QEMU.
>
>Is this your motivation for this property?

Exactly.

>
>>>
>>>> Add a property x-cap-fs1gp for user to turn FS1GP off so that
>>>> nested page table on host side works.
>>>
>>> I guess you would need to sync the FS1GP cap with host before reporting
>it
>>> in vIOMMU when comes to support passthrough devices.
>>
>> Yes, we already have this check, see
>https://github.com/yiliu1765/qemu/commit/b7ac7ce3a2e21eb1b3172743
>ee6f73e80fe67b3a
>
>good to know it. :) Will you fail the VM if the device's iommu does not
>support FS1GP or just mask out the FS1GP?

For cold plugged VFIO device, it will fail the VM with "Stage-1 1GB huge page 
is unsupported by host IOMMU" error report.
For hotplug VFIO device, only hotplug fails with "Stage-1 1GB huge page is 
unsupported by host IOMMU".

We don't update vIOMMU cap/ecap from host cap/ecap per Michael's suggestion, 
only vIOMMU properties can control them.

Thanks
Zhenzhong

RE: [PATCH v2 08/17] intel_iommu: Set accessed and dirty bits during first stage translation

2024-08-15 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 08/17] intel_iommu: Set accessed and dirty bits
>during first stage translation
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> From: Clément Mathieu--Drif 
>>
>> Signed-off-by: Clément Mathieu--Drif 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   hw/i386/intel_iommu_internal.h |  3 +++
>>   hw/i386/intel_iommu.c  | 24 
>>   2 files changed, 27 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 668583aeca..7786ef7624 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -324,6 +324,7 @@ typedef enum VTDFaultReason {
>>
>>   /* Output address in the interrupt address range for scalable mode */
>>   VTD_FR_SM_INTERRUPT_ADDR = 0x87,
>> +VTD_FR_FS_BIT_UPDATE_FAILED = 0x91, /* SFS.10 */
>>   VTD_FR_MAX, /* Guard */
>>   } VTDFaultReason;
>>
>> @@ -549,6 +550,8 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   /* Masks for First Level Paging Entry */
>>   #define VTD_FL_P1ULL
>>   #define VTD_FL_RW_MASK  (1ULL << 1)
>> +#define VTD_FL_A0x20
>> +#define VTD_FL_D0x40
>>
>>   /* Second Level Page Translation Pointer*/
>>   #define VTD_SM_PASID_ENTRY_SLPTPTR (~0xfffULL)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 6121cca4cd..3c2ceed284 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1822,6 +1822,7 @@ static const bool vtd_qualified_faults[] = {
>>   [VTD_FR_PASID_TABLE_ENTRY_INV] = true,
>>   [VTD_FR_SM_INTERRUPT_ADDR] = true,
>>   [VTD_FR_FS_NON_CANONICAL] = true,
>> +[VTD_FR_FS_BIT_UPDATE_FAILED] = true,
>>   [VTD_FR_MAX] = false,
>>   };
>>
>> @@ -1939,6 +1940,20 @@ static bool
>vtd_iova_fl_check_canonical(IntelIOMMUState *s, uint64_t iova,
>>   );
>>   }
>>
>> +static MemTxResult vtd_set_flag_in_pte(dma_addr_t base_addr,
>uint32_t index,
>> +   uint64_t pte, uint64_t flag)
>> +{
>> +if (pte & flag) {
>> +return MEMTX_OK;
>> +}
>> +pte |= flag;
>> +pte = cpu_to_le64(pte);
>> +return dma_memory_write(&address_space_memory,
>> +base_addr + index * sizeof(pte),
>> +&pte, sizeof(pte),
>> +MEMTXATTRS_UNSPECIFIED);
>
>Can we ensure this write is atomic? A/D bit setting should be atomic from
>guest p.o.v.

Yes, what about below:

@@ -2096,7 +2096,7 @@ static int vtd_iova_to_flpte(IntelIOMMUState *s, 
VTDContextEntry *ce,
 dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
 uint32_t level = vtd_get_iova_level(s, ce, pasid);
 uint32_t offset;
-uint64_t flpte;
+uint64_t flpte, flag_ad = VTD_FL_A;

 if (!vtd_iova_fl_check_canonical(s, iova, ce, pasid)) {
 error_report_once("%s: detected non canonical IOVA (iova=0x%" PRIx64 
","
@@ -2134,16 +2134,15 @@ static int vtd_iova_to_flpte(IntelIOMMUState *s, 
VTDContextEntry *ce,
 return -VTD_FR_PAGING_ENTRY_RSVD;
 }

-if (vtd_set_flag_in_pte(addr, offset, flpte, VTD_FL_A) != MEMTX_OK) {
+if (vtd_is_last_pte(flpte, level) && is_write) {
+flag_ad |= VTD_FL_D;
+}
+
+if (vtd_set_flag_in_pte(addr, offset, flpte, flag_ad) != MEMTX_OK) {
 return -VTD_FR_FS_BIT_UPDATE_FAILED;
 }

 if (vtd_is_last_pte(flpte, level)) {
-if (is_write &&
-(vtd_set_flag_in_pte(addr, offset, flpte, VTD_FL_D) !=
-MEMTX_OK)) 
{
-return -VTD_FR_FS_BIT_UPDATE_FAILED;
-}
 *flptep = flpte;
 *flpte_level = level;
 return 0;

Thanks
Zhenzhong

RE: [PATCH v2 07/17] intel_iommu: Check if the input address is canonical

2024-08-15 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 07/17] intel_iommu: Check if the input address is
>canonical
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> From: Clément Mathieu--Drif 
>>
>> First stage translation must fail if the address to translate is
>> not canonical.
>>
>> Signed-off-by: Clément Mathieu--Drif 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   hw/i386/intel_iommu_internal.h |  2 ++
>>   hw/i386/intel_iommu.c  | 21 +
>>   2 files changed, 23 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 51e9b1fc43..668583aeca 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -320,6 +320,8 @@ typedef enum VTDFaultReason {
>>   VTD_FR_PASID_ENTRY_P = 0x59,
>>   VTD_FR_PASID_TABLE_ENTRY_INV = 0x5b,  /*Invalid PASID table entry
>*/
>>
>> +VTD_FR_FS_NON_CANONICAL = 0x80, /* SNG.1 : Address for FS not
>canonical.*/
>> +
>>   /* Output address in the interrupt address range for scalable mode */
>>   VTD_FR_SM_INTERRUPT_ADDR = 0x87,
>>   VTD_FR_MAX, /* Guard */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 0bcbd5b777..6121cca4cd 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1821,6 +1821,7 @@ static const bool vtd_qualified_faults[] = {
>>   [VTD_FR_PASID_ENTRY_P] = true,
>>   [VTD_FR_PASID_TABLE_ENTRY_INV] = true,
>>   [VTD_FR_SM_INTERRUPT_ADDR] = true,
>> +[VTD_FR_FS_NON_CANONICAL] = true,
>>   [VTD_FR_MAX] = false,
>>   };
>>
>> @@ -1924,6 +1925,20 @@ static inline bool vtd_flpte_present(uint64_t
>flpte)
>>   return !!(flpte & VTD_FL_P);
>>   }
>>
>> +/* Return true if IOVA is canonical, otherwise false. */
>> +static bool vtd_iova_fl_check_canonical(IntelIOMMUState *s, uint64_t
>iova,
>> +VTDContextEntry *ce, uint32_t pasid)
>> +{
>> +uint64_t iova_limit = vtd_iova_limit(s, ce, s->aw_bits, pasid);
>> +uint64_t upper_bits_mask = ~(iova_limit - 1);
>> +uint64_t upper_bits = iova & upper_bits_mask;
>> +bool msb = ((iova & (iova_limit >> 1)) != 0);
>> +return !(
>> + (!msb && (upper_bits != 0)) ||
>> + (msb && (upper_bits != upper_bits_mask))
>> +);
>> +}
>> +
>
>will the below be clearer?
>
> if (msb)
> return upper_bits == upper_bits_mask;
> else
> return !upper_bits;

Yes, clearer, will do.

Thanks
Zhenzhong

RE: [PATCH v2 04/17] intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb invalidation

2024-08-14 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 04/17] intel_iommu: Flush stage-2 cache in PASID-
>selective PASID-based iotlb invalidation
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> Per spec 6.5.2.4, PADID-selective PASID-based iotlb invalidation will
>> flush stage-2 iotlb entries with matching domain id and pasid.
>>
>> With scalable modern mode introduced, guest could send PASID-selective
>> PASID-based iotlb invalidation to flush both stage-1 and stage-2 entries.
>
>I'm not quite sure if this is correct. In the last collumn of the table 21
>in 6.5.2.4, the paging structures of SS will not be invalidated. So it's
>not quite recommended for software to invalidate the iotlb entries with
>PGTT==SS-only by P_IOTLB invalidation, it's more recommended to use the
>IOTLB invalidation.

Hmm, when pasid is used with SS-only, PASID-based iotlb invalidation can give 
better granularity, (DID,PASID) vs. (DID) for IOTLB invalidation.

If non-leaf SS-paging entry is updated, IOTLB invalidation should be used as 
SS-paging structure cache isn't flushed with PASID-selective PASID-based iotlb 
invalidation.

Thanks
Zhenzhong

>
>> By this chance, remove old IOTLB related definition.
>>
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   hw/i386/intel_iommu_internal.h | 14 +++---
>>   hw/i386/intel_iommu.c  | 81
>++
>>   2 files changed, 90 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 8fa27c7f3b..19e4ed52ca 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -402,11 +402,6 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_IOTLB_AM(val)  ((val) & 0x3fULL)
>>   #define VTD_INV_DESC_IOTLB_RSVD_LO  0xff00ULL
>>   #define VTD_INV_DESC_IOTLB_RSVD_HI  0xf80ULL
>> -#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
>> -#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
>> -#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) &
>VTD_PASID_ID_MASK)
>> -#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO
>0xfff001c0ULL
>> -#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI  0xf80ULL
>>
>>   /* Mask for Device IOTLB Invalidate Descriptor */
>>   #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) &
>0xf000ULL)
>> @@ -438,6 +433,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
>VTD_SL_TM)) : \
>>   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>>
>> +/* Masks for PIOTLB Invalidate Descriptor */
>> +#define VTD_INV_DESC_PIOTLB_G (3ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_INV_DESC_PIOTLB_PASID(val)(((val) >> 32) & 0xfULL)
>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0f1c0ULL
>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>> +
>>   /* Information about page-selective IOTLB invalidate */
>>   struct VTDIOTLBPageInvInfo {
>>   uint16_t domain_id;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index c1382a5651..df591419b7 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2656,6 +2656,83 @@ static bool
>vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>>   return true;
>>   }
>>
>> +static gboolean vtd_hash_remove_by_pasid(gpointer key, gpointer value,
>> + gpointer user_data)
>> +{
>> +VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
>> +VTDIOTLBPageInvInfo *info = (VTDIOTLBPageInvInfo *)user_data;
>> +
>> +return ((entry->domain_id == info->domain_id) &&
>> +(entry->pasid == info->pasid));
>> +}
>> +
>> +static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>> +uint16_t domain_id, uint32_t pasid)
>> +{
>> +VTDIOTLBPageInvInfo info;
>> +VTDAddressSpace *vtd_as;
>> +VTDContextEntry ce;
>> +
>> +info.domain_id = domain_id;
>> +info.pasid = pasid;
>> +
>> +vtd_iommu_lock(s);
>> +g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
>> +&info);
>> +vtd_iommu_unlock(s);
>> +
>> +QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>> +if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>> +  vtd_as->devfn, &ce) &&
>> +domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>> +uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> +
>> +if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
>> +vtd_as->pasid != pasid) {
>> +continue;
>> +}
>> +
>> +if (!s->scalable_modern) {
>> +vtd_address_space_sync(vtd_as);

RE: [PATCH v2 16/17] intel_iommu: Introduce a property to control FS1GP cap bit setting

2024-08-14 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 16/17] intel_iommu: Introduce a property to control
>FS1GP cap bit setting
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> When host IOMMU doesn't support FS1GP but vIOMMU does, host
>IOMMU
>> can't translate stage-1 page table from guest correctly.
>
>this series is for emulated devices, so the above statement does not
>belong to this series. Is there any other reason to have this option?

Good catch, will remove this comment.
In fact, this patch is mainly for passthrough device where host IOMMU doesn't 
support fs1gp.

>
>> Add a property x-cap-fs1gp for user to turn FS1GP off so that
>> nested page table on host side works.
>
>I guess you would need to sync the FS1GP cap with host before reporting it
>in vIOMMU when comes to support passthrough devices.

Yes, we already have this check, see 
https://github.com/yiliu1765/qemu/commit/b7ac7ce3a2e21eb1b3172743ee6f73e80fe67b3a

Thanks
Zhenzhong

RE: [PATCH v2 14/17] intel_iommu: Set default aw_bits to 48 in scalable modren mode

2024-08-14 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 14/17] intel_iommu: Set default aw_bits to 48 in
>scalable modren mode
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> According to VTD spec, stage-1 page table could support 4-level and
>> 5-level paging.
>>
>> However, 5-level paging translation emulation is unsupported yet.
>> That means the only supported value for aw_bits is 48.
>>
>> So default aw_bits to 48 in scalable modern mode. In other cases,
>> it is still default to 39 for compatibility.
>>
>> Add a check to ensure user specified value is 48 in modern mode
>> for now.
>>
>> Signed-off-by: Zhenzhong Duan 
>> Reviewed-by: Clément Mathieu--Drif
>> ---
>>   include/hw/i386/intel_iommu.h |  2 +-
>>   hw/i386/intel_iommu.c | 16 +++-
>>   2 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index b843d069cc..48134bda11 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -45,7 +45,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(IntelIOMMUState,
>INTEL_IOMMU_DEVICE)
>>   #define DMAR_REG_SIZE   0x230
>>   #define VTD_HOST_AW_39BIT   39
>>   #define VTD_HOST_AW_48BIT   48
>> -#define VTD_HOST_ADDRESS_WIDTH  VTD_HOST_AW_39BIT
>> +#define VTD_HOST_AW_AUTO0xff
>>   #define VTD_HAW_MASK(aw)((1ULL << (aw)) - 1)
>>
>>   #define DMAR_REPORT_F_INTR  (1)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 317e630e08..5469ab4f9b 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3770,7 +3770,7 @@ static Property vtd_properties[] = {
>>   ON_OFF_AUTO_AUTO),
>>   DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim,
>false),
>>   DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>> -  VTD_HOST_ADDRESS_WIDTH),
>> +  VTD_HOST_AW_AUTO),
>>   DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode,
>FALSE),
>>   DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState,
>scalable_mode, FALSE),
>>   DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState,
>snoop_control, false),
>> @@ -4685,6 +4685,14 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>   }
>>   }
>>
>> +if (s->aw_bits == VTD_HOST_AW_AUTO) {
>> +if (s->scalable_modern) {
>> +s->aw_bits = VTD_HOST_AW_48BIT;
>> +} else {
>> +s->aw_bits = VTD_HOST_AW_39BIT;
>> +}
>> +}
>> +
>>   if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>>   (s->aw_bits != VTD_HOST_AW_48BIT) &&
>>   !s->scalable_modern) {
>> @@ -4693,6 +4701,12 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>   return false;
>>   }
>>
>> +if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {
>> +error_setg(errp, "Supported values for aw-bits are: %d",
>> +   VTD_HOST_AW_48BIT);
>
>call out it is for scalable modern.:)

Sure, will be:

if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
(s->aw_bits != VTD_HOST_AW_48BIT) &&
!s->scalable_modern) {
error_setg(errp, "%s mode: supported values for aw-bits are: %d, %d",
   s->scalable_mode ? "Scalable legacy" : "Legacy",
   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
return false;
}

if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {
error_setg(errp,
   "Scalable modern mode: supported values for aw-bits is: %d",
   VTD_HOST_AW_48BIT);
return false;
}

Thanks
Zhenzhong

RE: [PATCH v3 2/2] intel_iommu: Make PASID-cache and PIOTLB type invalid in legacy mode

2024-08-13 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v3 2/2] intel_iommu: Make PASID-cache and PIOTLB
>type invalid in legacy mode
>
>On 2024/8/14 10:26, Zhenzhong Duan wrote:
>> In vtd_process_inv_desc(), VTD_INV_DESC_PC and VTD_INV_DESC_PIOTLB
>are
>> bypassed without scalable mode check. These two types are not valid
>> in legacy mode and we should report error.
>>
>> Fixes: 4a4f219e8a1 ("intel_iommu: add scalable-mode option to make
>scalable mode work")
>
>4a4f219e8a10 would be better. :)

Ah, OK, Michael, let me know if you want me send a new version.

Thanks
Zhenzhong

>
>> Suggested-by: Yi Liu 
>> Signed-off-by: Zhenzhong Duan 
>> Reviewed-by: Clément Mathieu--Drif
>> Reviewed-by: Yi Liu 
>> ---
>>   hw/i386/intel_iommu.c | 22 +++---
>>   1 file changed, 11 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 68cb72a481..90cd4e5044 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2763,17 +2763,6 @@ static bool
>vtd_process_inv_desc(IntelIOMMUState *s)
>>   }
>>   break;
>>
>> -/*
>> - * TODO: the entity of below two cases will be implemented in future
>series.
>> - * To make guest (which integrates scalable mode support patch set in
>> - * iommu driver) work, just return true is enough so far.
>> - */
>> -case VTD_INV_DESC_PC:
>> -break;
>> -
>> -case VTD_INV_DESC_PIOTLB:
>> -break;
>> -
>>   case VTD_INV_DESC_WAIT:
>>   trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
>>   if (!vtd_process_wait_desc(s, &inv_desc)) {
>> @@ -2795,6 +2784,17 @@ static bool
>vtd_process_inv_desc(IntelIOMMUState *s)
>>   }
>>   break;
>>
>> +/*
>> + * TODO: the entity of below two cases will be implemented in future
>series.
>> + * To make guest (which integrates scalable mode support patch set in
>> + * iommu driver) work, just return true is enough so far.
>> + */
>> +case VTD_INV_DESC_PC:
>> +case VTD_INV_DESC_PIOTLB:
>> +if (s->scalable_mode) {
>> +break;
>> +}
>> +/* fallthrough */
>>   default:
>>   error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
>> " (unknown type)", __func__, inv_desc.hi,
>
>--
>Regards,
>Yi Liu

RE: [PATCH v2 01/17] intel_iommu: Use the latest fault reasons defined by spec

2024-08-13 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v2 01/17] intel_iommu: Use the latest fault reasons
>defined by spec
>
>On 2024/8/5 14:27, Zhenzhong Duan wrote:
>> From: Yu Zhang 
>>
>> Spec revision 3.0 or above defines more detailed fault reasons for
>> scalable mode. So introduce them into emulation code, see spec
>> section 7.1.2 for details.
>>
>> Note spec revision has no relation with VERSION register, Guest
>> kernel should not use that register to judge what features are
>> supported. Instead cap/ecap bits should be checked.
>>
>> Signed-off-by: Yu Zhang 
>> Signed-off-by: Zhenzhong Duan 
>> Reviewed-by: Clément Mathieu--Drif
>> ---
>>   hw/i386/intel_iommu_internal.h |  9 -
>>   hw/i386/intel_iommu.c  | 25 -
>>   2 files changed, 24 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 5f32c36943..8fa27c7f3b 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -311,7 +311,14 @@ typedef enum VTDFaultReason {
>> * request while disabled */
>>   VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>
>> -VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
>> +/* PASID directory entry access failure */
>> +VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>> +/* The Present(P) field of pasid directory entry is 0 */
>> +VTD_FR_PASID_DIR_ENTRY_P = 0x51,
>> +VTD_FR_PASID_TABLE_ACCESS_ERR = 0x58, /* PASID table entry
>access failure */
>> +/* The Present(P) field of pasid table entry is 0 */
>> +VTD_FR_PASID_ENTRY_P = 0x59,
>> +VTD_FR_PASID_TABLE_ENTRY_INV = 0x5b,  /*Invalid PASID table entry
>*/
>
>how about making the comment line aligned? Either one line or two lines.

It looks the original rule is:
If one line exceeds 80 chars, split definition and comments to two lines.
If not, just use one line.

I'm following that rule.

Thanks
Zhenzhong

>Besides this, lgtm.
>
>Reviewed-by: Yi Liu 
>
>>
>>   /* Output address in the interrupt address range for scalable mode */
>>   VTD_FR_SM_INTERRUPT_ADDR = 0x87,
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 16d2885fcc..c52912f593 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -796,7 +796,7 @@ static int
>vtd_get_pdire_from_pdir_table(dma_addr_t pasid_dir_base,
>>   addr = pasid_dir_base + index * entry_size;
>>   if (dma_memory_read(&address_space_memory, addr,
>>   pdire, entry_size, MEMTXATTRS_UNSPECIFIED)) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +return -VTD_FR_PASID_DIR_ACCESS_ERR;
>>   }
>>
>>   pdire->val = le64_to_cpu(pdire->val);
>> @@ -814,6 +814,7 @@ static int
>vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>> dma_addr_t addr,
>> VTDPASIDEntry *pe)
>>   {
>> +uint8_t pgtt;
>>   uint32_t index;
>>   dma_addr_t entry_size;
>>   X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>> @@ -823,7 +824,7 @@ static int
>vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>>   addr = addr + index * entry_size;
>>   if (dma_memory_read(&address_space_memory, addr,
>>   pe, entry_size, MEMTXATTRS_UNSPECIFIED)) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +return -VTD_FR_PASID_TABLE_ACCESS_ERR;
>>   }
>>   for (size_t i = 0; i < ARRAY_SIZE(pe->val); i++) {
>>   pe->val[i] = le64_to_cpu(pe->val[i]);
>> @@ -831,11 +832,13 @@ static int
>vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>>
>>   /* Do translation type check */
>>   if (!vtd_pe_type_check(x86_iommu, pe)) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +return -VTD_FR_PASID_TABLE_ENTRY_INV;
>>   }
>>
>> -if (!vtd_is_level_supported(s, VTD_PE_GET_LEVEL(pe))) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +pgtt = VTD_PE_GET_TYPE(pe);
>> +if (pgtt == VTD_SM_PASID_ENTRY_SLT &&
>> +!vtd_is_level_supported(s, VTD_PE_GET_LEVEL(pe))) {
>> +return -VTD_FR_PASID_TABLE_ENTRY_INV;
>>   }
>>
>>   return 0;
>> @@ -876,7 +879,7 @@ static int
>vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>>   }
>>
>>   if (!vtd_pdire_present(&pdire)) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +return -VTD_FR_PASID_DIR_ENTRY_P;
>>   }
>>
>>   ret = vtd_get_pe_from_pdire(s, pasid, &pdire, pe);
>> @@ -885,7 +888,7 @@ static int
>vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>>   }
>>
>>   if (!vtd_pe_present(pe)) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +return -VTD_FR_PASID_ENTRY_P;
>>   }
>>
>>   return 0;
>> @@ -938,7 +941,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState
>*s,
>>   }
>>
>>   if (!vtd_pdire_present(&pdire)) {
>> -return -VTD_FR_PASID_TABLE_INV;
>> +r

RE: [PATCH] intel_iommu: Fix invalidation descriptor type field

2024-08-13 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH] intel_iommu: Fix invalidation descriptor type field
>
>On 2024/8/13 13:53, Zhenzhong Duan wrote:
>> According to spec, invalidation descriptor type is 7bits which is
>> concatenation of bits[11:9] and bits[3:0] of invalidation descriptor.
>>
>> Currently we only pick bits[3:0] as the invalidation type and treat
>> bits[11:9] as reserved zero. This is not a problem for now as bits[11:9]
>> is zero for all current invalidation types. But it will break if newer
>> type occupies bits[11:9].
>>
>> Fix it by take bits[11:9] into type and make reserved bits check accurate.
>
>s/take/taking/

Will fix.

>
>Reviewed-by: Yi Liu 
>
>There is another fix you may add. In vtd_process_inv_desc(), it should
>treat the type VTD_INV_DESC_PC and VTD_INV_DESC_PIOTLB as invalid type
>if vIOMMU is running in legacy mode.

Ah, indeed, will fix with a new adding patch. Thanks for suggesting.

>
>> Suggested-by: Clément Mathieu--Drif
>> Signed-off-by: Zhenzhong Duan 
>> ---
>> Tested intel-iommu.flat in kvm-unit-test: PASS
>> Tested vfio device hotplug: PASS
>> ---
>>   hw/i386/intel_iommu_internal.h | 11 ++-
>>   hw/i386/intel_iommu.c  |  2 +-
>>   2 files changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 5f32c36943..13d5d129ae 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -356,7 +356,8 @@ union VTDInvDesc {
>>   typedef union VTDInvDesc VTDInvDesc;
>>
>>   /* Masks for struct VTDInvDesc */
>> -#define VTD_INV_DESC_TYPE   0xf
>> +#define VTD_INV_DESC_TYPE(val)  val) >> 5) & 0x70ULL) | \
>> + ((val) & 0xfULL))
>>   #define VTD_INV_DESC_CC 0x1 /* Context-cache Invalidate 
>> Desc
>*/
>>   #define VTD_INV_DESC_IOTLB  0x2
>>   #define VTD_INV_DESC_DEVICE 0x3
>> @@ -372,7 +373,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_WAIT_IF(1ULL << 4)
>>   #define VTD_INV_DESC_WAIT_FN(1ULL << 6)
>>   #define VTD_INV_DESC_WAIT_DATA_SHIFT32
>> -#define VTD_INV_DESC_WAIT_RSVD_LO   0Xff80ULL
>> +#define VTD_INV_DESC_WAIT_RSVD_LO   0Xf180ULL
>>   #define VTD_INV_DESC_WAIT_RSVD_HI   3ULL
>>
>>   /* Masks for Context-cache Invalidation Descriptor */
>> @@ -383,7 +384,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_CC_DID(val)(((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>>   #define VTD_INV_DESC_CC_SID(val)(((val) >> 32) & 0xUL)
>>   #define VTD_INV_DESC_CC_FM(val) (((val) >> 48) & 3UL)
>> -#define VTD_INV_DESC_CC_RSVD0xfffcffc0ULL
>> +#define VTD_INV_DESC_CC_RSVD0xfffcf1c0ULL
>>
>>   /* Masks for IOTLB Invalidate Descriptor */
>>   #define VTD_INV_DESC_IOTLB_G(3ULL << 4)
>> @@ -393,7 +394,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_IOTLB_DID(val) (((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>>   #define VTD_INV_DESC_IOTLB_ADDR(val)((val) & ~0xfffULL)
>>   #define VTD_INV_DESC_IOTLB_AM(val)  ((val) & 0x3fULL)
>> -#define VTD_INV_DESC_IOTLB_RSVD_LO  0xff00ULL
>> +#define VTD_INV_DESC_IOTLB_RSVD_LO  0xf100ULL
>>   #define VTD_INV_DESC_IOTLB_RSVD_HI  0xf80ULL
>>   #define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
>>   #define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
>> @@ -406,7 +407,7 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_DEVICE_IOTLB_SIZE(val) ((val) & 0x1)
>>   #define VTD_INV_DESC_DEVICE_IOTLB_SID(val) (((val) >> 32) & 0xULL)
>>   #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_HI 0xffeULL
>> -#define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffe0fff8
>> +#define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffe0f1f0
>>
>>   /* Rsvd field masks for spte */
>>   #define VTD_SPTE_SNP 0x800ULL
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 16d2885fcc..68cb72a481 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2744,7 +2744,7 @@ static bool
>vtd_process_inv_desc(IntelIOMMUState *s)
>>   return false;
>>   }
>>
>> -desc_type = inv_desc.lo & VTD_INV_DESC_TYPE;
>> +desc_type = VTD_INV_DESC_TYPE(inv_desc.lo);
>>   /* FIXME: should update at first or at last? */
>>   s->iq_last_desc_type = desc_type;
>>
>
>--
>Regards,
>Yi Liu

RE: [PATCH v2 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-08-12 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v2 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>
>
>On 13/08/2024 04:20, Duan, Zhenzhong wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>>> -Original Message-
>>> From: CLEMENT MATHIEU--DRIF 
>>> Subject: Re: [PATCH v2 03/17] intel_iommu: Add a placeholder variable
>for
>>> scalable modern mode
>>>
>>>
>>>
>>> On 08/08/2024 14:31, Duan, Zhenzhong wrote:
>>>> Caution: External email. Do not open attachments or click links,
>>>> unless this email comes from a known sender and you know the content
>>>> is safe.
>>>>
>>>>
>>>> On 8/6/2024 2:35 PM, CLEMENT MATHIEU--DRIF wrote:
>>>>> On 05/08/2024 08:27, Zhenzhong Duan wrote:
>>>>>> Caution: External email. Do not open attachments or click links,
>>>>>> unless this email comes from a known sender and you know the
>content
>>>>>> is safe.
>>>>>>
>>>>>>
>>>>>> Add an new element scalable_mode in IntelIOMMUState to mark
>>> scalable
>>>>>> modern mode, this element will be exposed as an intel_iommu
>property
>>>>>> finally.
>>>>>>
>>>>>> For now, it's only a placehholder and used for address width
>>>>>> compatibility check and block host device passthrough until nesting
>>>>>> is supported.
>>>>>>
>>>>>> Signed-off-by: Yi Liu 
>>>>>> Signed-off-by: Zhenzhong Duan 
>>>>>> ---
>>>>>> include/hw/i386/intel_iommu.h |  1 +
>>>>>> hw/i386/intel_iommu.c | 12 +---
>>>>>> 2 files changed, 10 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>>>> b/include/hw/i386/intel_iommu.h
>>>>>> index 1eb05c29fc..788ed42477 100644
>>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>>>>>
>>>>>> bool caching_mode;  /* RO - is cap CM enabled? */
>>>>>> bool scalable_mode; /* RO - is Scalable Mode
>>>>>> supported? */
>>>>>> +bool scalable_modern;   /* RO - is modern SM supported? */
>>>>>> bool snoop_control; /* RO - is SNP filed
>>>>>> supported? */
>>>>>>
>>>>>> dma_addr_t root;/* Current root table pointer */
>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>> index e3465fc27d..c1382a5651 100644
>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>> @@ -3872,7 +3872,13 @@ static bool
>>> vtd_check_hiod(IntelIOMMUState
>>>>>> *s, HostIOMMUDevice *hiod,
>>>>>> return false;
>>>>>> }
>>>>>>
>>>>>> -return true;
>>>>>> +if (!s->scalable_modern) {
>>>>>> +/* All checks requested by VTD non-modern mode pass */
>>>>>> +return true;
>>>>>> +}
>>>>>> +
>>>>>> +error_setg(errp, "host device is unsupported in scalable modern
>>>>>> mode yet");
>>>>>> +return false;
>>>>>> }
>>>>>>
>>>>>> static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>>>>>> int devfn,
>>>>>> @@ -4262,9 +4268,9 @@ static bool
>>> vtd_decide_config(IntelIOMMUState
>>>>>> *s, Error **errp)
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> -/* Currently only address widths supported are 39 and 48 bits */
>>>>>> if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>>>>>> -(s->aw_bits != VTD_HOST_AW_48BIT)) {
>>>>>> +(s->aw_bits != VTD_HOST_AW_48BIT) &&am

RE: [PATCH v2 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-08-12 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v2 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>
>
>On 08/08/2024 14:31, Duan, Zhenzhong wrote:
>> Caution: External email. Do not open attachments or click links,
>> unless this email comes from a known sender and you know the content
>> is safe.
>>
>>
>> On 8/6/2024 2:35 PM, CLEMENT MATHIEU--DRIF wrote:
>>>
>>> On 05/08/2024 08:27, Zhenzhong Duan wrote:
>>>> Caution: External email. Do not open attachments or click links,
>>>> unless this email comes from a known sender and you know the content
>>>> is safe.
>>>>
>>>>
>>>> Add an new element scalable_mode in IntelIOMMUState to mark
>scalable
>>>> modern mode, this element will be exposed as an intel_iommu property
>>>> finally.
>>>>
>>>> For now, it's only a placehholder and used for address width
>>>> compatibility check and block host device passthrough until nesting
>>>> is supported.
>>>>
>>>> Signed-off-by: Yi Liu 
>>>> Signed-off-by: Zhenzhong Duan 
>>>> ---
>>>>    include/hw/i386/intel_iommu.h |  1 +
>>>>    hw/i386/intel_iommu.c | 12 +---
>>>>    2 files changed, 10 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>> b/include/hw/i386/intel_iommu.h
>>>> index 1eb05c29fc..788ed42477 100644
>>>> --- a/include/hw/i386/intel_iommu.h
>>>> +++ b/include/hw/i386/intel_iommu.h
>>>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>>>
>>>>    bool caching_mode;  /* RO - is cap CM enabled? */
>>>>    bool scalable_mode; /* RO - is Scalable Mode
>>>> supported? */
>>>> +    bool scalable_modern;   /* RO - is modern SM supported? */
>>>>    bool snoop_control; /* RO - is SNP filed
>>>> supported? */
>>>>
>>>>    dma_addr_t root;    /* Current root table pointer */
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index e3465fc27d..c1382a5651 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -3872,7 +3872,13 @@ static bool
>vtd_check_hiod(IntelIOMMUState
>>>> *s, HostIOMMUDevice *hiod,
>>>>    return false;
>>>>    }
>>>>
>>>> -    return true;
>>>> +    if (!s->scalable_modern) {
>>>> +    /* All checks requested by VTD non-modern mode pass */
>>>> +    return true;
>>>> +    }
>>>> +
>>>> +    error_setg(errp, "host device is unsupported in scalable modern
>>>> mode yet");
>>>> +    return false;
>>>>    }
>>>>
>>>>    static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque,
>>>> int devfn,
>>>> @@ -4262,9 +4268,9 @@ static bool
>vtd_decide_config(IntelIOMMUState
>>>> *s, Error **errp)
>>>>    }
>>>>    }
>>>>
>>>> -    /* Currently only address widths supported are 39 and 48 bits */
>>>>    if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>>>> -    (s->aw_bits != VTD_HOST_AW_48BIT)) {
>>>> +    (s->aw_bits != VTD_HOST_AW_48BIT) &&
>>>> +    !s->scalable_modern) {
>>> Why does scalable_modern allow to use a value other than 39 or 48?
>>> Is it safe?
>>
>> The check for scalable_modern is in patch14:
>>
>> if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {
>>
>> error_setg(errp, "Supported values for aw-bits are: %d",
>> VTD_HOST_AW_48BIT);
>>
>> return false;
>>
>> }
>>
>> Let me know if you prefer to move it in this patch.
>Yes, you are right, it would be better to move the check here.
>
>But I think the first check should also fail even when scalable_modern
>is enabled because values other than 39 and 48 are not supported at all,
>whatever the mode.
>Then, we should check if the value is valid for scalable_modern mode.

Right, I wrote that way with a possible plan to support VTD_HOST_AW_52BIT.
What about this:

if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
(s->aw_bits != VTD_HOST_AW_48BIT) &&
!s->scalable_modern) {
error_setg(errp, "Scalable legacy mode: supported values for aw-bits 
are: %d, %d",
   VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
return false;
}

if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {
error_setg(errp, "Scalable modern mode: supported values for aw-bits 
is: %d",
   VTD_HOST_AW_48BIT);
return false;
}

Thanks
Zhenzhong

RE: [PATCH v2 04/17] intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb invalidation

2024-08-12 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v2 04/17] intel_iommu: Flush stage-2 cache in PASID-
>selective PASID-based iotlb invalidation
>
>
>
>On 08/08/2024 14:40, Duan, Zhenzhong wrote:
>> Caution: External email. Do not open attachments or click links,
>> unless this email comes from a known sender and you know the content
>> is safe.
>>
>>
>> On 8/6/2024 2:35 PM, CLEMENT MATHIEU--DRIF wrote:
>>>
>>> On 05/08/2024 08:27, Zhenzhong Duan wrote:
>>>> Caution: External email. Do not open attachments or click links,
>>>> unless this email comes from a known sender and you know the content
>>>> is safe.
>>>>
>>>>
>>>> Per spec 6.5.2.4, PADID-selective PASID-based iotlb invalidation will
>>>> flush stage-2 iotlb entries with matching domain id and pasid.
>>>>
>>>> With scalable modern mode introduced, guest could send PASID-
>selective
>>>> PASID-based iotlb invalidation to flush both stage-1 and stage-2
>>>> entries.
>>>>
>>>> By this chance, remove old IOTLB related definition.
>>>>
>>>> Signed-off-by: Zhenzhong Duan 
>>>> ---
>>>>    hw/i386/intel_iommu_internal.h | 14 +++---
>>>>    hw/i386/intel_iommu.c  | 81
>>>> ++
>>>>    2 files changed, 90 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu_internal.h
>>>> b/hw/i386/intel_iommu_internal.h
>>>> index 8fa27c7f3b..19e4ed52ca 100644
>>>> --- a/hw/i386/intel_iommu_internal.h
>>>> +++ b/hw/i386/intel_iommu_internal.h
>>>> @@ -402,11 +402,6 @@ typedef union VTDInvDesc VTDInvDesc;
>>>>    #define VTD_INV_DESC_IOTLB_AM(val)  ((val) & 0x3fULL)
>>>>    #define VTD_INV_DESC_IOTLB_RSVD_LO 0xff00ULL
>>>>    #define VTD_INV_DESC_IOTLB_RSVD_HI  0xf80ULL
>>>> -#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
>>>> -#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
>>>> -#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) &
>>>> VTD_PASID_ID_MASK)
>>>> -#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO
>0xfff001c0ULL
>>>> -#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI  0xf80ULL
>>>>
>>>>    /* Mask for Device IOTLB Invalidate Descriptor */
>>>>    #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) &
>>>> 0xf000ULL)
>>>> @@ -438,6 +433,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>>>    (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
>>>> VTD_SL_TM)) : \
>>>>    (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>>>>
>>>> +/* Masks for PIOTLB Invalidate Descriptor */
>>>> +#define VTD_INV_DESC_PIOTLB_G (3ULL << 4)
>>>> +#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
>>>> +#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
>>>> +#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) &
>>>> VTD_DOMAIN_ID_MASK)
>>>> +#define VTD_INV_DESC_PIOTLB_PASID(val)    (((val) >> 32) & 0xfULL)
>>>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0f1c0ULL
>>> Why did this value change since last post? The 'type' field should
>>> always be zero in this desc
>>
>> Yes, type[6:4] are all zero for all existing invalidation type. But they
>> are not real reserved bits.
>>
>> So I removed them from VTD_INV_DESC_PIOTLB_RSVD_VAL0.
>Other masks consider these zeroes as reserved.
>I think we should do the same.
>For instance, context cache invalidation is : #define
>VTD_INV_DESC_CC_RSVD 0xfffcffc0ULL

Yes, I'll make a separate patch to fix it.

Thanks
Zhenzhong

Re: [PATCH v2 04/17] intel_iommu: Flush stage-2 cache in PASID-selective PASID-based iotlb invalidation

2024-08-08 Thread Duan, Zhenzhong




On 8/6/2024 2:35 PM, CLEMENT MATHIEU--DRIF wrote:


On 05/08/2024 08:27, Zhenzhong Duan wrote:

Caution: External email. Do not open attachments or click links, unless this 
email comes from a known sender and you know the content is safe.


Per spec 6.5.2.4, PADID-selective PASID-based iotlb invalidation will
flush stage-2 iotlb entries with matching domain id and pasid.

With scalable modern mode introduced, guest could send PASID-selective
PASID-based iotlb invalidation to flush both stage-1 and stage-2 entries.

By this chance, remove old IOTLB related definition.

Signed-off-by: Zhenzhong Duan 
---
   hw/i386/intel_iommu_internal.h | 14 +++---
   hw/i386/intel_iommu.c  | 81 ++
   2 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 8fa27c7f3b..19e4ed52ca 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -402,11 +402,6 @@ typedef union VTDInvDesc VTDInvDesc;
   #define VTD_INV_DESC_IOTLB_AM(val)  ((val) & 0x3fULL)
   #define VTD_INV_DESC_IOTLB_RSVD_LO  0xff00ULL
   #define VTD_INV_DESC_IOTLB_RSVD_HI  0xf80ULL
-#define VTD_INV_DESC_IOTLB_PASID_PASID  (2ULL << 4)
-#define VTD_INV_DESC_IOTLB_PASID_PAGE   (3ULL << 4)
-#define VTD_INV_DESC_IOTLB_PASID(val)   (((val) >> 32) & VTD_PASID_ID_MASK)
-#define VTD_INV_DESC_IOTLB_PASID_RSVD_LO  0xfff001c0ULL
-#define VTD_INV_DESC_IOTLB_PASID_RSVD_HI  0xf80ULL

   /* Mask for Device IOTLB Invalidate Descriptor */
   #define VTD_INV_DESC_DEVICE_IOTLB_ADDR(val) ((val) & 0xf000ULL)
@@ -438,6 +433,15 @@ typedef union VTDInvDesc VTDInvDesc;
   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : 
\
   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))

+/* Masks for PIOTLB Invalidate Descriptor */
+#define VTD_INV_DESC_PIOTLB_G (3ULL << 4)
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_PASID(val)(((val) >> 32) & 0xfULL)
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0f1c0ULL

Why did this value change since last post? The 'type' field should
always be zero in this desc


Yes, type[6:4] are all zero for all existing invalidation type. But they 
are not real reserved bits.


So I removed them from VTD_INV_DESC_PIOTLB_RSVD_VAL0.

Thanks

Zhenzhong

Re: [PATCH v2 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-08-08 Thread Duan, Zhenzhong




On 8/6/2024 2:35 PM, CLEMENT MATHIEU--DRIF wrote:


On 05/08/2024 08:27, Zhenzhong Duan wrote:

Caution: External email. Do not open attachments or click links, unless this 
email comes from a known sender and you know the content is safe.


Add an new element scalable_mode in IntelIOMMUState to mark scalable
modern mode, this element will be exposed as an intel_iommu property
finally.

For now, it's only a placehholder and used for address width
compatibility check and block host device passthrough until nesting
is supported.

Signed-off-by: Yi Liu 
Signed-off-by: Zhenzhong Duan 
---
   include/hw/i386/intel_iommu.h |  1 +
   hw/i386/intel_iommu.c | 12 +---
   2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 1eb05c29fc..788ed42477 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -262,6 +262,7 @@ struct IntelIOMMUState {

   bool caching_mode;  /* RO - is cap CM enabled? */
   bool scalable_mode; /* RO - is Scalable Mode supported? */
+bool scalable_modern;   /* RO - is modern SM supported? */
   bool snoop_control; /* RO - is SNP filed supported? */

   dma_addr_t root;/* Current root table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e3465fc27d..c1382a5651 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3872,7 +3872,13 @@ static bool vtd_check_hiod(IntelIOMMUState *s, 
HostIOMMUDevice *hiod,
   return false;
   }

-return true;
+if (!s->scalable_modern) {
+/* All checks requested by VTD non-modern mode pass */
+return true;
+}
+
+error_setg(errp, "host device is unsupported in scalable modern mode yet");
+return false;
   }

   static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
@@ -4262,9 +4268,9 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error 
**errp)
   }
   }

-/* Currently only address widths supported are 39 and 48 bits */
   if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
-(s->aw_bits != VTD_HOST_AW_48BIT)) {
+(s->aw_bits != VTD_HOST_AW_48BIT) &&
+!s->scalable_modern) {

Why does scalable_modern allow to use a value other than 39 or 48?
Is it safe?


The check for scalable_modern is in patch14:

if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {

error_setg(errp, "Supported values for aw-bits are: %d", VTD_HOST_AW_48BIT);

return false;

}

Let me know if you prefer to move it in this patch.

Thanks

Zhenzhong

Re: [PATCH v2 15/17] intel_iommu: Modify x-scalable-mode to be string option to expose scalable modern mode

2024-08-08 Thread Duan, Zhenzhong




On 8/6/2024 2:34 PM, CLEMENT MATHIEU--DRIF wrote:



On 05/08/2024 08:27, Zhenzhong Duan wrote:

Caution: External email. Do not open attachments or click links, unless this 
email comes from a known sender and you know the content is safe.


From: Yi Liu

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants to simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"

  - "legacy": gives support for stage-2 page table
  - "modern": gives support for stage-1 page table
  - "off": no scalable mode support
  - any other string, will throw error

If x-scalable-mode is not configured, it is equivalent to x-scalable-mode=off.

With scalable modern mode exposed to user, also accurate the pasid entry
check in vtd_pe_type_check().

Signed-off-by: Yi Liu
Signed-off-by: Yi Sun
Signed-off-by: Zhenzhong Duan
---
  hw/i386/intel_iommu_internal.h |  2 ++
  include/hw/i386/intel_iommu.h  |  1 +
  hw/i386/intel_iommu.c  | 46 ++
  3 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 52bdbf3bc5..af99deb4cd 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -195,6 +195,7 @@
  #define VTD_ECAP_PASID  (1ULL << 40)
  #define VTD_ECAP_SMTS   (1ULL << 43)
  #define VTD_ECAP_SLTS   (1ULL << 46)
+#define VTD_ECAP_FLTS   (1ULL << 47)

  /* CAP_REG */
  /* (offset >> 4) << 24 */
@@ -211,6 +212,7 @@
  #define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
  #define VTD_CAP_DRAIN_WRITE (1ULL << 54)
  #define VTD_CAP_DRAIN_READ  (1ULL << 55)
+#define VTD_CAP_FS1GP   (1ULL << 56)
  #define VTD_CAP_DRAIN   (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
  #define VTD_CAP_CM  (1ULL << 7)
  #define VTD_PASID_ID_SHIFT  20
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 48134bda11..650641544c 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -263,6 +263,7 @@ struct IntelIOMMUState {

  bool caching_mode;  /* RO - is cap CM enabled? */
  bool scalable_mode; /* RO - is Scalable Mode supported? */
+char *scalable_mode_str;/* RO - admin's Scalable Mode config */
  bool scalable_modern;   /* RO - is modern SM supported? */
  bool snoop_control; /* RO - is SNP filed supported? */

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5469ab4f9b..9e973bd710 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -803,16 +803,18 @@ static inline bool 
vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
  }

  /* Return true if check passed, otherwise false */
-static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
- VTDPASIDEntry *pe)
+static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
  {
  switch (VTD_PE_GET_TYPE(pe)) {
-case VTD_SM_PASID_ENTRY_SLT:
-return true;
-case VTD_SM_PASID_ENTRY_PT:
-return x86_iommu->pt_supported;
  case VTD_SM_PASID_ENTRY_FLT:
+return !!(s->ecap & VTD_ECAP_FLTS);
+case VTD_SM_PASID_ENTRY_SLT:
+return !!(s->ecap & VTD_ECAP_SLTS) || !(s->ecap & VTD_ECAP_SMTS);
Can '!(s->ecap & VTD_ECAP_SMTS)' be evaluated to true in this function 
event though we have found a pasid entry?


Good suggestion, it's unnecessary, I'll drop that check.

Thanks

Zhenzhong

RE: [PATCH v1 11/17] intel_iommu: Extract device IOTLB invalidation logic

2024-07-24 Thread Duan, Zhenzhong

Sure, thanks for reminding.

BRs.
Zhenzhong

>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 11/17] intel_iommu: Extract device IOTLB
>invalidation logic
>
>Hi Zhenzhong,
>
>This patch has been merged into staging this morning, be careful when
>re-sending your series.
>Here is the link :
>https://github.com/qemu/qemu/commit/6410f877f5ed535acd01bbfaa4ba
>ec379e44d0ef#diff-
>c19adbf518f644e9b651b67266802e14787292ab9d6cd4210b4f974585be6
>009

Sure, thanks for reminding😊

BRs.
Zhenzhong

RE: [PATCH v1 17/17] tests/qtest: Add intel-iommu test

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 17/17] tests/qtest: Add intel-iommu test
>
>
>
>On 18/07/2024 10:16, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Add the framework to test the intel-iommu device.
>>
>> Currently only tested cap/ecap bits correctness in scalable
>> modern mode. Also tested cap/ecap bits consistency before
>> and after system reset.
>>
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   MAINTAINERS|  1 +
>>   include/hw/i386/intel_iommu.h  |  1 +
>>   tests/qtest/intel-iommu-test.c | 71
>++
>>   tests/qtest/meson.build|  1 +
>>   4 files changed, 74 insertions(+)
>>   create mode 100644 tests/qtest/intel-iommu-test.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 7d9811458c..ec765bf3d3 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3662,6 +3662,7 @@ S: Supported
>>   F: hw/i386/intel_iommu.c
>>   F: hw/i386/intel_iommu_internal.h
>>   F: include/hw/i386/intel_iommu.h
>> +F: tests/qtest/intel-iommu-test.c
>>
>>   AMD-Vi Emulation
>>   S: Orphan
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 650641544c..b1848dbec6 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -47,6 +47,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(IntelIOMMUState,
>INTEL_IOMMU_DEVICE)
>>   #define VTD_HOST_AW_48BIT   48
>>   #define VTD_HOST_AW_AUTO0xff
>>   #define VTD_HAW_MASK(aw)((1ULL << (aw)) - 1)
>> +#define VTD_MGAW_FROM_CAP(cap)  ((cap >> 16) & 0x3fULL)
>>
>>   #define DMAR_REPORT_F_INTR  (1)
>>
>> diff --git a/tests/qtest/intel-iommu-test.c b/tests/qtest/intel-iommu-test.c
>> new file mode 100644
>> index 00..8e07034f6f
>> --- /dev/null
>> +++ b/tests/qtest/intel-iommu-test.c
>> @@ -0,0 +1,71 @@
>> +/*
>> + * QTest testcase for intel-iommu
>> + *
>> + * Copyright (c) 2024 Intel, Inc.
>> + *
>> + * Author: Zhenzhong Duan 
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "libqtest.h"
>> +#include "hw/i386/intel_iommu_internal.h"
>> +
>> +#define CAP_MODERN_FIXED1(VTD_CAP_FRO | VTD_CAP_NFR |
>VTD_CAP_ND | \
>> +  VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS)
>> +#define ECAP_MODERN_FIXED1   (VTD_ECAP_QI |  VTD_ECAP_IRO |
>VTD_ECAP_MHMV | \
>> +  VTD_ECAP_SMTS | VTD_ECAP_FLTS)
>> +
>> +static inline uint32_t vtd_reg_readl(QTestState *s, uint64_t offset)
>> +{
>> +return qtest_readl(s, Q35_HOST_BRIDGE_IOMMU_ADDR + offset);
>> +}
>> +
>> +static inline uint64_t vtd_reg_readq(QTestState *s, uint64_t offset)
>> +{
>> +return qtest_readq(s, Q35_HOST_BRIDGE_IOMMU_ADDR + offset);
>> +}
>> +
>> +static void test_intel_iommu_modern(void)
>> +{
>> +uint8_t init_csr[DMAR_REG_SIZE]; /* register values */
>> +uint8_t post_reset_csr[DMAR_REG_SIZE]; /* register values */
>> +uint64_t cap, ecap, tmp;
>> +QTestState *s;
>> +
>> +s = qtest_init("-M q35 -device intel-iommu,x-scalable-mode=modern");
>> +
>> +cap = vtd_reg_readq(s, DMAR_CAP_REG);
>> +g_assert((cap & CAP_MODERN_FIXED1) == CAP_MODERN_FIXED1);
>> +
>> +tmp = cap & VTD_CAP_SAGAW_MASK;
>> +g_assert(tmp == (VTD_CAP_SAGAW_39bit | VTD_CAP_SAGAW_48bit));
>> +
>> +tmp = VTD_MGAW_FROM_CAP(cap);
>> +g_assert(tmp == VTD_HOST_AW_48BIT - 1);
>> +
>> +ecap = vtd_reg_readq(s, DMAR_ECAP_REG);
>> +g_assert((ecap & ECAP_MODERN_FIXED1) == ECAP_MODERN_FIXED1);
>> +g_assert(ecap & VTD_ECAP_IR);
>Can we add VTD_ECAP_IR to ECAP_MODERN_FIXED1?

Will do.

Thanks
Zhenzhong

>> +
>> +qtest_memread(s, Q35_HOST_BRIDGE_IOMMU_ADDR, init_csr,
>DMAR_REG_SIZE);
>> +
>> +qobject_unref(qtest_qmp(s, "{ 'execute': 'system_reset' }"));
>> +qtest_qmp_eventwait(s, "RESET");
>> +
>> +qtest_memread(s, Q35_HOST_BRIDGE_IOMMU_ADDR, post_reset_csr,
>DMAR_REG_SIZE);
>> +/* Ensure registers are consistent after hard reset */
>> +g_assert(!memcmp(init_csr, post_reset_csr, DMAR_REG_SIZE));
>> +
>> +qtest_quit(s);
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +g_test_init(&argc, &argv, NULL);
>> +qtest_add_func("/q35/intel-iommu/modern",
>test_intel_iommu_modern);
>> +
>> +return g_test_run();
>> +}
>> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
>> index 6508bfb1a2..20d05d471b 100644
>> --- a/tests/qtest/meson.build
>> +++ b/tests/qtest/meson.build
>> @@ -79,6 +79,7 @@ qtests_i386 = \
>> (config_all_devices.has_key('CONFIG_SB16') ? ['fuzz-sb16-test'] : []) +
>\
>> (config_all_devices.has_key('CONFIG_SDHCI_PCI') ? ['fuzz-sdcard-test'] :
>[]) +\
>> (config_all_devices.has_key('CONFIG_ESP_PCI')

RE: [PATCH v1 14/17] intel_iommu: piotlb invalidation should notify unmap

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 14/17] intel_iommu: piotlb invalidation should
>notify unmap
>
>
>
>On 24/07/2024 07:45, CLEMENT MATHIEU--DRIF wrote:
>> Maybe I'm missing something but why do we invalidate device IOTLB
>> upon piotlb receipt of a regular IOTLB inv desc?
>> I don't get why we don't wait for a device IOTLB inv desc?
>I thought you were planning to remove that after the last rfc version

Look at vtd_iotlb_page_invalidate(), it has same operation.
Reason is even if we don't enable device IOTLB, devices such as vhost may still 
caches IOTLB entries. So we need to flush those stale IOTLB entries in this 
case.

Thanks
Zhenzhong

>>
>> On 18/07/2024 10:16, Zhenzhong Duan wrote:
>>> Caution: External email. Do not open attachments or click links, unless
>this email comes from a known sender and you know the content is safe.
>>>
>>>
>>> This is used by some emulated devices which caches address
>>> translation result. When piotlb invalidation issued in guest,
>>> those caches should be refreshed.
>>>
>>> Signed-off-by: Yi Sun 
>>> Signed-off-by: Zhenzhong Duan 
>>> ---
>>>hw/i386/intel_iommu.c | 35
>++-
>>>1 file changed, 34 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 8b66d6cfa5..c0116497b1 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -2910,7 +2910,7 @@ static void
>vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>>>continue;
>>>}
>>>
>>> -if (!s->scalable_modern) {
>>> +if (!s->scalable_modern || !vtd_as_has_map_notifier(vtd_as)) {
>>>vtd_address_space_sync(vtd_as);
>>>}
>>>}
>>> @@ -2922,6 +2922,9 @@ static void
>vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>   bool ih)
>>>{
>>>VTDIOTLBPageInvInfo info;
>>> +VTDAddressSpace *vtd_as;
>>> +VTDContextEntry ce;
>>> +hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>>>
>>>info.domain_id = domain_id;
>>>info.pasid = pasid;
>>> @@ -2932,6 +2935,36 @@ static void
>vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>g_hash_table_foreach_remove(s->iotlb,
>>>vtd_hash_remove_by_page_piotlb, &info);
>>>vtd_iommu_unlock(s);
>>> +
>>> +QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>>> +if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>> +  vtd_as->devfn, &ce) &&
>>> +domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>>> +uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>>> +IOMMUTLBEvent event;
>>> +
>>> +if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
>>> +vtd_as->pasid != pasid) {
>>> +continue;
>>> +}
>>> +
>>> +/*
>>> + * Page-Selective-within-PASID PASID-based-IOTLB Invalidation
>>> + * does not flush stage-2 entries. See spec section 6.5.2.4
>>> + */
>>> +if (!s->scalable_modern) {
>>> +continue;
>>> +}
>>> +
>>> +event.type = IOMMU_NOTIFIER_UNMAP;
>>> +event.entry.target_as = &address_space_memory;
>>> +event.entry.iova = addr;
>>> +event.entry.perm = IOMMU_NONE;
>>> +event.entry.addr_mask = size - 1;
>>> +event.entry.translated_addr = 0;
>>> +memory_region_notify_iommu(&vtd_as->iommu, 0, event);
>>> +}
>>> +}
>>>}
>>>
>>>static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>>> --
>>> 2.34.1
>>>

RE: [PATCH v1 04/17] intel_iommu: Flush stage-2 cache in PADID-selective PASID-based iotlb invalidation

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 04/17] intel_iommu: Flush stage-2 cache in PADID-
>selective PASID-based iotlb invalidation
>
>
>
>On 24/07/2024 04:59, Duan, Zhenzhong wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>>> -Original Message-
>>> From: CLEMENT MATHIEU--DRIF 
>>> Subject: Re: [PATCH v1 04/17] intel_iommu: Flush stage-2 cache in
>PADID-
>>> selective PASID-based iotlb invalidation
>>>
>>>
>>>
>>> On 18/07/2024 10:16, Zhenzhong Duan wrote:
>>>> Caution: External email. Do not open attachments or click links, unless
>this
>>> email comes from a known sender and you know the content is safe.
>>>>
>>>> Per spec 6.5.2.4, PADID-selective PASID-based iotlb invalidation will
>>>> flush stage-2 iotlb entries with matching domain id and pasid.
>>>>
>>>> With scalable modern mode introduced, guest could send PASID-
>selective
>>>> PASID-based iotlb invalidation to flush both stage-1 and stage-2 entries.
>>>>
>>>> Signed-off-by: Zhenzhong Duan 
>>>> ---
>>>>hw/i386/intel_iommu_internal.h | 10 +
>>>>hw/i386/intel_iommu.c  | 78
>>> ++
>>>>2 files changed, 88 insertions(+)
>>>>
>>>> diff --git a/hw/i386/intel_iommu_internal.h
>>> b/hw/i386/intel_iommu_internal.h
>>>> index 4e0331caba..f71fc91234 100644
>>>> --- a/hw/i386/intel_iommu_internal.h
>>>> +++ b/hw/i386/intel_iommu_internal.h
>>>> @@ -440,6 +440,16 @@ typedef union VTDInvDesc VTDInvDesc;
>>>>(0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
>>> VTD_SL_TM)) : \
>>>>(0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>>>>
>>>> +#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
>>>> +#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
>>>> +
>>>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0ffc0ULL
>>>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>>>> +
>>>> +#define VTD_INV_DESC_PIOTLB_PASID(val)(((val) >> 32) & 0xfULL)
>>>> +#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) & \
>>>> + VTD_DOMAIN_ID_MASK)
>>>> +
>>>>/* Information about page-selective IOTLB invalidate */
>>>>struct VTDIOTLBPageInvInfo {
>>>>uint16_t domain_id;
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 40cbd4a0f4..075a27adac 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -2659,6 +2659,80 @@ static bool
>>> vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>>>>return true;
>>>>}
>>>>
>>>> +static gboolean vtd_hash_remove_by_pasid(gpointer key, gpointer
>value,
>>>> + gpointer user_data)
>>>> +{
>>>> +VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
>>>> +VTDIOTLBPageInvInfo *info = (VTDIOTLBPageInvInfo *)user_data;
>>>> +
>>>> +return ((entry->domain_id == info->domain_id) &&
>>>> +(entry->pasid == info->pasid));
>>>> +}
>>>> +
>>>> +static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>>>> +uint16_t domain_id, uint32_t 
>>>> pasid)
>>>> +{
>>>> +VTDIOTLBPageInvInfo info;
>>>> +VTDAddressSpace *vtd_as;
>>>> +VTDContextEntry ce;
>>>> +
>>>> +info.domain_id = domain_id;
>>>> +info.pasid = pasid;
>>>> +
>>>> +vtd_iommu_lock(s);
>>>> +g_hash_table_foreach_remove(s->iotlb,
>vtd_hash_remove_by_pasid,
>>>> +&info);
>>>> +vtd_iommu_unlock(s);
>>>> +
>>>> +QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>>>> +if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>> +  vtd_as->devfn, &ce) &&
>>&

RE: [PATCH v1 04/17] intel_iommu: Flush stage-2 cache in PADID-selective PASID-based iotlb invalidation

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 04/17] intel_iommu: Flush stage-2 cache in PADID-
>selective PASID-based iotlb invalidation
>
>
>
>On 18/07/2024 10:16, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Per spec 6.5.2.4, PADID-selective PASID-based iotlb invalidation will
>> flush stage-2 iotlb entries with matching domain id and pasid.
>>
>> With scalable modern mode introduced, guest could send PASID-selective
>> PASID-based iotlb invalidation to flush both stage-1 and stage-2 entries.
>>
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   hw/i386/intel_iommu_internal.h | 10 +
>>   hw/i386/intel_iommu.c  | 78
>++
>>   2 files changed, 88 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 4e0331caba..f71fc91234 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -440,6 +440,16 @@ typedef union VTDInvDesc VTDInvDesc;
>>   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
>VTD_SL_TM)) : \
>>   (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
>>
>> +#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
>> +#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
>> +
>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0ffc0ULL
>> +#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>> +
>> +#define VTD_INV_DESC_PIOTLB_PASID(val)(((val) >> 32) & 0xfULL)
>> +#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) & \
>> + VTD_DOMAIN_ID_MASK)
>> +
>>   /* Information about page-selective IOTLB invalidate */
>>   struct VTDIOTLBPageInvInfo {
>>   uint16_t domain_id;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 40cbd4a0f4..075a27adac 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2659,6 +2659,80 @@ static bool
>vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>>   return true;
>>   }
>>
>> +static gboolean vtd_hash_remove_by_pasid(gpointer key, gpointer value,
>> + gpointer user_data)
>> +{
>> +VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
>> +VTDIOTLBPageInvInfo *info = (VTDIOTLBPageInvInfo *)user_data;
>> +
>> +return ((entry->domain_id == info->domain_id) &&
>> +(entry->pasid == info->pasid));
>> +}
>> +
>> +static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>> +uint16_t domain_id, uint32_t pasid)
>> +{
>> +VTDIOTLBPageInvInfo info;
>> +VTDAddressSpace *vtd_as;
>> +VTDContextEntry ce;
>> +
>> +info.domain_id = domain_id;
>> +info.pasid = pasid;
>> +
>> +vtd_iommu_lock(s);
>> +g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
>> +&info);
>> +vtd_iommu_unlock(s);
>> +
>> +QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
>> +if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>> +  vtd_as->devfn, &ce) &&
>> +domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>> +uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> +
>> +if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
>> +vtd_as->pasid != pasid) {
>> +continue;
>> +}
>> +
>> +if (!s->scalable_modern) {
>> +vtd_address_space_sync(vtd_as);
>> +}
>> +}
>> +}
>> +}
>> +
>> +static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>> +VTDInvDesc *inv_desc)
>> +{
>> +uint16_t domain_id;
>> +uint32_t pasid;
>> +
>> +if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
>> +(inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
>> +error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%"
>PRIx64
>> +  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
>This error is not formatted as the other similar messages we print when
>reserved bits are non-zero.
>Here is what we've done in vtd_process_iotlb_desc:

Sure, will change as below,

>
>     error_report_once("%s: invalid iotlb inv desc: hi=0x%"PRIx64
>   ", lo=0x%"PRIx64" (reserved bits unzero)",
>   __func__, inv_desc->hi, inv_desc->lo);
>> +return false;
>> +}
>> +
>> +domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
>> +pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
>> +switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
>Not critical but why don't we have VTD_INV_DESC_PIOTLB_G?

Will add.

>> +case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
>> +vtd_piotlb_pasid_invalidate(s, do

RE: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>
>
>On 19/07/2024 05:39, Duan, Zhenzhong wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>>> -Original Message-
>>> From: Duan, Zhenzhong]
>>> Subject: RE: [PATCH v1 03/17] intel_iommu: Add a placeholder variable
>for
>>> scalable modern mode
>>>
>>>
>>>
>>>> -Original Message-
>>>> From: Liu, Yi L 
>>>> Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable
>for
>>>> scalable modern mode
>>>>
>>>> On 2024/7/19 10:47, Duan, Zhenzhong wrote:
>>>>>
>>>>>> -Original Message-
>>>>>> From: CLEMENT MATHIEU--DRIF 
>>>>>> Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder
>variable
>>>> for
>>>>>> scalable modern mode
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 18/07/2024 10:16, Zhenzhong Duan wrote:
>>>>>>> Caution: External email. Do not open attachments or click links,
>unless
>>>> this
>>>>>> email comes from a known sender and you know the content is safe.
>>>>>>>
>>>>>>> Add an new element scalable_mode in IntelIOMMUState to mark
>>>> scalable
>>>>>>> modern mode, this element will be exposed as an intel_iommu
>>> property
>>>>>>> finally.
>>>>>>>
>>>>>>> For now, it's only a placehholder and used for cap/ecap initialization,
>>>>>>> compatibility check and block host device passthrough until nesting
>>>>>>> is supported.
>>>>>>>
>>>>>>> Signed-off-by: Yi Liu 
>>>>>>> Signed-off-by: Zhenzhong Duan 
>>>>>>> ---
>>>>>>> hw/i386/intel_iommu_internal.h |  2 ++
>>>>>>> include/hw/i386/intel_iommu.h  |  1 +
>>>>>>> hw/i386/intel_iommu.c  | 34 +++--
>-
>>> --
>>>> --
>>>>>>> 3 files changed, 26 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/i386/intel_iommu_internal.h
>>>>>> b/hw/i386/intel_iommu_internal.h
>>>>>>> index c0ca7b372f..4e0331caba 100644
>>>>>>> --- a/hw/i386/intel_iommu_internal.h
>>>>>>> +++ b/hw/i386/intel_iommu_internal.h
>>>>>>> @@ -195,6 +195,7 @@
>>>>>>> #define VTD_ECAP_PASID  (1ULL << 40)
>>>>>>> #define VTD_ECAP_SMTS   (1ULL << 43)
>>>>>>> #define VTD_ECAP_SLTS   (1ULL << 46)
>>>>>>> +#define VTD_ECAP_FLTS   (1ULL << 47)
>>>>>>>
>>>>>>> /* CAP_REG */
>>>>>>> /* (offset >> 4) << 24 */
>>>>>>> @@ -211,6 +212,7 @@
>>>>>>> #define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
>>>>>>> #define VTD_CAP_DRAIN_WRITE (1ULL << 54)
>>>>>>> #define VTD_CAP_DRAIN_READ  (1ULL << 55)
>>>>>>> +#define VTD_CAP_FS1GP   (1ULL << 56)
>>>>>>> #define VTD_CAP_DRAIN   (VTD_CAP_DRAIN_READ |
>>>>>> VTD_CAP_DRAIN_WRITE)
>>>>>>> #define VTD_CAP_CM  (1ULL << 7)
>>>>>>> #define VTD_PASID_ID_SHIFT  20
>>>>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>>>> b/include/hw/i386/intel_iommu.h
>>>>>>> index 1eb05c29fc..788ed42477 100644
>>>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>>>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>>>>>>
>>>>>>> bool caching_mode;  /* RO - is cap CM enabled? */
>>>>>>> bool scalable_mode; /* RO - is Scalable Mode 
>>>>>>> supported?
>>> */

RE: [PATCH v6 4/9] vfio/{iommufd,container}: Invoke HostIOMMUDevice::realize() during attach_device()

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v6 4/9] vfio/{iommufd,container}: Invoke
>HostIOMMUDevice::realize() during attach_device()
>
>On 23/07/2024 08:55, Eric Auger wrote:
>>
>>
>> On 7/23/24 09:44, Cédric Le Goater wrote:
>>> On 7/23/24 09:38, Eric Auger wrote:
 Hi Joao,

 On 7/22/24 23:13, Joao Martins wrote:
> Move the HostIOMMUDevice::realize() to be invoked during the attach
> of the device
> before we allocate IOMMUFD hardware pagetable objects (HWPT). This
> allows the use
> of the hw_caps obtained by IOMMU_GET_HW_INFO that essentially
>tell
> if the IOMMU
> behind the device supports dirty tracking.
>
> Note: The HostIOMMUDevice data from legacy backend is static and
> doesn't
> need any information from the (type1-iommu) backend to be
>initialized.
> In contrast however, the IOMMUFD HostIOMMUDevice data requires
>the
> iommufd FD to be connected and having a devid to be able to
> successfully
 Nit: maybe this comment shall be also added in iommufd.c before the
>call
 to vfio_device_hiod_realize() to avoid someone else to move that call
 earlier at some point
> GET_HW_INFO. This means vfio_device_hiod_realize() is called in
> different places within the backend .attach_device() implementation.
>
> Suggested-by: Cédric Le Goater 
> Signed-off-by: Joao Martins 
> Reviewed-by: Zhenzhong Duan 
> ---
>   include/hw/vfio/vfio-common.h |  1 +
>   hw/vfio/common.c  | 16 ++--
>   hw/vfio/container.c   |  4 
>   hw/vfio/helpers.c | 11 +++
>   hw/vfio/iommufd.c |  4 
>   5 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-common.h
> b/include/hw/vfio/vfio-common.h
> index 1a96678f8c38..4e44b26d3c45 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -242,6 +242,7 @@ void vfio_region_finalize(VFIORegion *region);
>   void vfio_reset_handler(void *opaque);
>   struct vfio_device_info *vfio_get_device_info(int fd);
>   bool vfio_device_is_mdev(VFIODevice *vbasedev);
> +bool vfio_device_hiod_realize(VFIODevice *vbasedev, Error **errp);
>   bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>   AddressSpace *as, Error **errp);
>   void vfio_detach_device(VFIODevice *vbasedev);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 784e266e6aab..da12cbd56408 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1537,7 +1537,7 @@ bool vfio_attach_device(char *name,
>VFIODevice
> *vbasedev,
>   {
>   const VFIOIOMMUClass *ops =
>
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
> -    HostIOMMUDevice *hiod;
> +    HostIOMMUDevice *hiod = NULL;
>     if (vbasedev->iommufd) {
>   ops =
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUF
>D));
> @@ -1545,21 +1545,17 @@ bool vfio_attach_device(char *name,
> VFIODevice *vbasedev,
>     assert(ops);
>   -    if (!ops->attach_device(name, vbasedev, as, errp)) {
> -    return false;
> -    }
>   -    if (vbasedev->mdev) {
> -    return true;
> +    if (!vbasedev->mdev) {
> +    hiod = HOST_IOMMU_DEVICE(object_new(ops-
>>hiod_typename));
> +    vbasedev->hiod = hiod;
>   }
>   -    hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
> -    if (!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod,
>vbasedev,
> errp)) {
> +    if (!ops->attach_device(name, vbasedev, as, errp)) {
>   object_unref(hiod);
> -    ops->detach_device(vbasedev);
> +    vbasedev->hiod = NULL;
>   return false;
>   }
> -    vbasedev->hiod = hiod;
>     return true;
>   }
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 10cb4b4320ac..9ccdb639ac84 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -914,6 +914,10 @@ static bool vfio_legacy_attach_device(const
> char *name, VFIODevice *vbasedev,
>     trace_vfio_attach_device(vbasedev->name, groupid);
>   +    if (!vfio_device_hiod_realize(vbasedev, errp)) {
> +    return false;
 don't you want to go to err_alloc_ioas instead?
>>>
>>> hmm, the err_alloc_ioas label is in a different function
>>> iommufd_cdev_attach().
>>>
>>> may be you meant the comment for routine iommufd_cdev_attach() and
>>> label err_connect_bind ?
>>>
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
> +    }
> +
>   group = vfio_get_group(groupid, as, errp);
>   if (!group) {
>   return false;
> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> index 7e23e9080c9d..ea15c79db0a3 100644
> --- a/hw/vfio/helpers.c
>>

RE: [PATCH v6 5/9] vfio/iommufd: Probe and request hwpt dirty tracking capability

2024-07-23 Thread Duan, Zhenzhong



>-Original Message-
>From: Cédric Le Goater 
>Subject: Re: [PATCH v6 5/9] vfio/iommufd: Probe and request hwpt dirty
>tracking capability
>
>On 7/23/24 08:13, Joao Martins wrote:
>> On 23/07/2024 06:11, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: Joao Martins 
>>>> Subject: [PATCH v6 5/9] vfio/iommufd: Probe and request hwpt dirty
>>>> tracking capability
>>>>
>>>> In preparation to using the dirty tracking UAPI, probe whether the
>IOMMU
>>>> supports dirty tracking. This is done via the data stored in
>>>> hiod::caps::hw_caps initialized from GET_HW_INFO.
>>>>
>>>> Qemu doesn't know if VF dirty tracking is supported when allocating
>>>> hardware pagetable in iommufd_cdev_autodomains_get(). This is
>because
>>>> VFIODevice migration state hasn't been initialized *yet* hence it can't
>pick
>>>> between VF dirty tracking vs IOMMU dirty tracking. So, if IOMMU
>supports
>>>> dirty tracking it always creates HWPTs with
>>>> IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>>>> even if later on VFIOMigration decides to use VF dirty tracking instead.
>>>>
>>>> Signed-off-by: Joao Martins 
>>>> ---
>>>> include/hw/vfio/vfio-common.h |  2 ++
>>>> hw/vfio/iommufd.c | 20 
>>>> 2 files changed, 22 insertions(+)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>> common.h
>>>> index 4e44b26d3c45..1e02c98b09ba 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -97,6 +97,7 @@ typedef struct IOMMUFDBackend
>IOMMUFDBackend;
>>>>
>>>> typedef struct VFIOIOASHwpt {
>>>>  uint32_t hwpt_id;
>>>> +uint32_t hwpt_flags;
>>>>  QLIST_HEAD(, VFIODevice) device_list;
>>>>  QLIST_ENTRY(VFIOIOASHwpt) next;
>>>> } VFIOIOASHwpt;
>>>> @@ -139,6 +140,7 @@ typedef struct VFIODevice {
>>>>  OnOffAuto pre_copy_dirty_page_tracking;
>>>>  bool dirty_pages_supported;
>>>>  bool dirty_tracking;
>>>> +bool iommu_dirty_tracking;
>>>>  HostIOMMUDevice *hiod;
>>>>  int devid;
>>>>  IOMMUFDBackend *iommufd;
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 2324bf892c56..7afea0b041ed 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -110,6 +110,11 @@ static void
>>>> iommufd_cdev_unbind_and_disconnect(VFIODevice *vbasedev)
>>>>  iommufd_backend_disconnect(vbasedev->iommufd);
>>>> }
>>>>
>>>> +static bool iommufd_hwpt_dirty_tracking(VFIOIOASHwpt *hwpt)
>>>> +{
>>>> +return hwpt && hwpt->hwpt_flags &
>>>> IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>> +}
>>>> +
>>>> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
>>>> {
>>>>  ERRP_GUARD();
>>>> @@ -246,6 +251,17 @@ static bool
>>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>  }
>>>>  }
>>>>
>>>> +/*
>>>> + * This is quite early and VFIO Migration state isn't yet fully
>>>> + * initialized, thus rely only on IOMMU hardware capabilities as to
>>>> + * whether IOMMU dirty tracking is going to be requested. Later
>>>> + * vfio_migration_realize() may decide to use VF dirty tracking
>>>> + * instead.
>>>> + */
>>>> +if (vbasedev->hiod->caps.hw_caps &
>>>> IOMMU_HW_CAP_DIRTY_TRACKING) {
>>>> +flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>> +}
>>>> +
>>>>  if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>>>  container->ioas_id, flags,
>>>>  IOMMU_HWPT_DATA_NONE, 0, NULL,
>>>> @@ -255,6 +271,7 @@ static bool
>>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>
>>>>  hwpt = g_malloc0(sizeof(*hwpt));
>>>>  hwpt->hwpt_id = hwpt_id;
>>>> +hwpt->hwpt_flags = flags;
>>>>  QLIST_INIT(&hwpt->device_list);
>>>>
>>>>  ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>errp);
>>>> @@ -265,8 +282,11 @@ static bool
>>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>  }
>>>>
>>>>  vbasedev->hwpt = hwpt;
>>>> +vbasedev->iommu_dirty_tracking =
>>>> iommufd_hwpt_dirty_tracking(hwpt);
>>>
>>> Don't we need to do same if attach to existing hwpt?
>>>
>>
>> Nice catch!
>>
>> Yes, we do need it e.g. we will need this fix up fo this patch
>
>
>Fixed on vfio-9.1.

Feel free to add my RB,

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>
>Thanks,
>
>C.
>
>
>
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 92b976464283..833a7400486c 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -305,6 +305,7 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>   } else {
>>   vbasedev->hwpt = hwpt;
>>   QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> +vbasedev->iommu_dirty_tracking =
>iommufd_hwpt_dirty_tracking(hwpt);
>>   return true;
>>   }
>>   }
>>

RE: [PATCH v5 06/13] vfio/{iommufd,container}: Remove caps::aw_bits

2024-07-22 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v5 06/13] vfio/{iommufd,container}: Remove
>caps::aw_bits
>
>On 22/07/2024 06:22, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: [PATCH v5 06/13] vfio/{iommufd,container}: Remove
>caps::aw_bits
>>>
>>> Remove caps::aw_bits which requires the bcontainer::iova_ranges being
>>> initialized after device is actually attached. Instead defer that to
>>> .get_cap() and call vfio_device_get_aw_bits() directly.
>>>
>>> This is in preparation for HostIOMMUDevice::realize() being called early
>>> during attach_device().
>>>
>>> Suggested-by: Zhenzhong Duan 
>>> Signed-off-by: Joao Martins 
>>> Reviewed-by: Cédric Le Goater >> ---
>>> include/sysemu/host_iommu_device.h | 3 ---
>>> backends/iommufd.c | 3 ++-
>>> hw/vfio/container.c| 5 +
>>> hw/vfio/iommufd.c  | 1 -
>>> 4 files changed, 3 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/sysemu/host_iommu_device.h
>>> b/include/sysemu/host_iommu_device.h
>>> index ee6c813c8b22..cdeeccec7671 100644
>>> --- a/include/sysemu/host_iommu_device.h
>>> +++ b/include/sysemu/host_iommu_device.h
>>> @@ -19,12 +19,9 @@
>>>  * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
>>>  *
>>>  * @type: host platform IOMMU type.
>>> - *
>>> - * @aw_bits: host IOMMU address width. 0xff if no limitation.
>>>  */
>>> typedef struct HostIOMMUDeviceCaps {
>>> uint32_t type;
>>> -uint8_t aw_bits;
>>> } HostIOMMUDeviceCaps;
>>>
>>> #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index a94d3b90c05c..58032e588f49 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -18,6 +18,7 @@
>>> #include "qemu/error-report.h"
>>> #include "monitor/monitor.h"
>>> #include "trace.h"
>>> +#include "hw/vfio/vfio-common.h"
>>> #include 
>>> #include 
>>>
>>> @@ -270,7 +271,7 @@ static int
>>> hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>>> case HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE:
>>> return caps->type;
>>> case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>>> -return caps->aw_bits;
>>> +return vfio_device_get_aw_bits(hiod->agent);
>>
>> I just realized there is an open here. hiod->agent is not necessarily VFIO
>device, can be VDPA device.
>> May need a bit more work on this.
>>
>
>Broadly speaking I agree, that this needs some sort of IOMMUDevice
>structure
>with a agent type that it needs to abstract from instead of an opaque object.
>
>But feels unrelated to this patch exactly, as the existing code was already
>making assumptions that ::opaque is a VFIODevice.

Currently only VFIODevice is supported, so hiod->agent can only points to a 
VFIODevice.
In future, when VDPA is supported, hiod->agent can point to some kind of 
VDPADevice structure after ::realize() initialize it.

But I'm ok to leave it to VDPA to fix this as for now hiod->agent only points 
to VFIODevice.

Thanks
Zhenzhong

>
>> Thanks
>> Zhenzhong
>>
>>> default:
>>> error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>>> return -EINVAL;
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index 88ede913d6f7..c27f448ba26e 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -1144,7 +1144,6 @@ static bool
>>> hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>> VFIODevice *vdev = opaque;
>>>
>>> hiod->name = g_strdup(vdev->name);
>>> -hiod->caps.aw_bits = vfio_device_get_aw_bits(vdev);
>>> hiod->agent = opaque;
>>>
>>> return true;
>>> @@ -1153,11 +1152,9 @@ static bool
>>> hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>> static int hiod_legacy_vfio_get_cap(HostIOMMUDevice *hiod, int cap,
>>> Error **errp)
>>> {
>>> -HostIOMMUDeviceCaps *caps = &hiod->caps;
>>> -
>>> switch (cap) {
>>> case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>>> -return caps->aw_bits;
>>> +return vfio_device_get_aw_bits(hiod->agent);
>>> default:
>>> error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>>> return -EINVAL;
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 545f4a404125..028533bc39b9 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -724,7 +724,6 @@ static bool
>>> hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>>
>>> hiod->name = g_strdup(vdev->name);
>>> caps->type = type;
>>> -caps->aw_bits = vfio_device_get_aw_bits(vdev);
>>>
>>> return true;
>>> }
>>> --
>>> 2.17.2
>>

RE: [PATCH v6 3/9] vfio/iommufd: Add hw_caps field to HostIOMMUDeviceCaps

2024-07-22 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v6 3/9] vfio/iommufd: Add hw_caps field to
>HostIOMMUDeviceCaps
>
>Store the value of @caps returned by iommufd_backend_get_device_info()
>in a new field HostIOMMUDeviceCaps::hw_caps. Right now the only value is
>whether device IOMMU supports dirty tracking
>(IOMMU_HW_CAP_DIRTY_TRACKING).
>
>This is in preparation for HostIOMMUDevice::realize() being called early
>during attach_device().
>
>Signed-off-by: Joao Martins 
>Reviewed-by: Cédric Le Goater 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> include/sysemu/host_iommu_device.h | 4 
> hw/vfio/iommufd.c  | 1 +
> 2 files changed, 5 insertions(+)
>
>diff --git a/include/sysemu/host_iommu_device.h
>b/include/sysemu/host_iommu_device.h
>index d1c10ff7c239..809cced4ba5c 100644
>--- a/include/sysemu/host_iommu_device.h
>+++ b/include/sysemu/host_iommu_device.h
>@@ -19,9 +19,13 @@
>  * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
>  *
>  * @type: host platform IOMMU type.
>+ *
>+ * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this
>represents
>+ *   the @out_capabilities value returned from
>IOMMU_GET_HW_INFO ioctl)
>  */
> typedef struct HostIOMMUDeviceCaps {
> uint32_t type;
>+uint64_t hw_caps;
> } HostIOMMUDeviceCaps;
>
> #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 5bb623879abe..5e2fc1ce089d 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -724,6 +724,7 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>
> hiod->name = g_strdup(vdev->name);
> caps->type = type;
>+caps->hw_caps = hw_caps;
>
> return true;
> }
>--
>2.17.2

RE: [PATCH v6 5/9] vfio/iommufd: Probe and request hwpt dirty tracking capability

2024-07-22 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v6 5/9] vfio/iommufd: Probe and request hwpt dirty
>tracking capability
>
>In preparation to using the dirty tracking UAPI, probe whether the IOMMU
>supports dirty tracking. This is done via the data stored in
>hiod::caps::hw_caps initialized from GET_HW_INFO.
>
>Qemu doesn't know if VF dirty tracking is supported when allocating
>hardware pagetable in iommufd_cdev_autodomains_get(). This is because
>VFIODevice migration state hasn't been initialized *yet* hence it can't pick
>between VF dirty tracking vs IOMMU dirty tracking. So, if IOMMU supports
>dirty tracking it always creates HWPTs with
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>even if later on VFIOMigration decides to use VF dirty tracking instead.
>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h |  2 ++
> hw/vfio/iommufd.c | 20 
> 2 files changed, 22 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 4e44b26d3c45..1e02c98b09ba 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -97,6 +97,7 @@ typedef struct IOMMUFDBackend IOMMUFDBackend;
>
> typedef struct VFIOIOASHwpt {
> uint32_t hwpt_id;
>+uint32_t hwpt_flags;
> QLIST_HEAD(, VFIODevice) device_list;
> QLIST_ENTRY(VFIOIOASHwpt) next;
> } VFIOIOASHwpt;
>@@ -139,6 +140,7 @@ typedef struct VFIODevice {
> OnOffAuto pre_copy_dirty_page_tracking;
> bool dirty_pages_supported;
> bool dirty_tracking;
>+bool iommu_dirty_tracking;
> HostIOMMUDevice *hiod;
> int devid;
> IOMMUFDBackend *iommufd;
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 2324bf892c56..7afea0b041ed 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -110,6 +110,11 @@ static void
>iommufd_cdev_unbind_and_disconnect(VFIODevice *vbasedev)
> iommufd_backend_disconnect(vbasedev->iommufd);
> }
>
>+static bool iommufd_hwpt_dirty_tracking(VFIOIOASHwpt *hwpt)
>+{
>+return hwpt && hwpt->hwpt_flags &
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>+}
>+
> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
> {
> ERRP_GUARD();
>@@ -246,6 +251,17 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> }
> }
>
>+/*
>+ * This is quite early and VFIO Migration state isn't yet fully
>+ * initialized, thus rely only on IOMMU hardware capabilities as to
>+ * whether IOMMU dirty tracking is going to be requested. Later
>+ * vfio_migration_realize() may decide to use VF dirty tracking
>+ * instead.
>+ */
>+if (vbasedev->hiod->caps.hw_caps &
>IOMMU_HW_CAP_DIRTY_TRACKING) {
>+flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>+}
>+
> if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> container->ioas_id, flags,
> IOMMU_HWPT_DATA_NONE, 0, NULL,
>@@ -255,6 +271,7 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>
> hwpt = g_malloc0(sizeof(*hwpt));
> hwpt->hwpt_id = hwpt_id;
>+hwpt->hwpt_flags = flags;
> QLIST_INIT(&hwpt->device_list);
>
> ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>@@ -265,8 +282,11 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> }
>
> vbasedev->hwpt = hwpt;
>+vbasedev->iommu_dirty_tracking =
>iommufd_hwpt_dirty_tracking(hwpt);

Don't we need to do same if attach to existing hwpt?

> QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>+container->bcontainer.dirty_pages_supported |=
>+vbasedev->iommu_dirty_tracking;
> return true;
> }
>
>--
>2.17.2

RE: [PATCH v6 9/9] vfio/common: Allow disabling device dirty page tracking

2024-07-22 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v6 9/9] vfio/common: Allow disabling device dirty page
>tracking
>
>The property 'x-pre-copy-dirty-page-tracking' allows disabling the whole
>tracking of VF pre-copy phase of dirty page tracking, though it means
>that it will only be used at the start of the switchover phase.
>
>Add an option that disables the VF dirty page tracking, and fall
>back into container-based dirty page tracking. This also allows to
>use IOMMU dirty tracking even on VFs with their own dirty
>tracker scheme.
>
>Signed-off-by: Joao Martins 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> include/hw/vfio/vfio-common.h | 1 +
> hw/vfio/common.c  | 3 +++
> hw/vfio/migration.c   | 4 +++-
> hw/vfio/pci.c | 3 +++
> 4 files changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 1e02c98b09ba..fed499b199f0 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -138,6 +138,7 @@ typedef struct VFIODevice {
> VFIOMigration *migration;
> Error *migration_blocker;
> OnOffAuto pre_copy_dirty_page_tracking;
>+OnOffAuto device_dirty_page_tracking;
> bool dirty_pages_supported;
> bool dirty_tracking;
> bool iommu_dirty_tracking;
>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>index da12cbd56408..36d0cf6585b2 100644
>--- a/hw/vfio/common.c
>+++ b/hw/vfio/common.c
>@@ -199,6 +199,9 @@ bool vfio_devices_all_device_dirty_tracking(const
>VFIOContainerBase *bcontainer)
> VFIODevice *vbasedev;
>
> QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>+if (vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF) {
>+return false;
>+}
> if (!vbasedev->dirty_pages_supported) {
> return false;
> }
>diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>index cbfaef7afffe..262d42a46e58 100644
>--- a/hw/vfio/migration.c
>+++ b/hw/vfio/migration.c
>@@ -1036,7 +1036,9 @@ bool vfio_migration_realize(VFIODevice
>*vbasedev, Error **errp)
> return !vfio_block_migration(vbasedev, err, errp);
> }
>
>-if (!vbasedev->dirty_pages_supported && !vbasedev-
>>iommu_dirty_tracking) {
>+if ((!vbasedev->dirty_pages_supported ||
>+ vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>+!vbasedev->iommu_dirty_tracking) {
> if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
> error_setg(&err,
>"%s: VFIO device doesn't support device and "
>diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>index 8c0f212a163e..a0767de54b8d 100644
>--- a/hw/vfio/pci.c
>+++ b/hw/vfio/pci.c
>@@ -3364,6 +3364,9 @@ static Property vfio_pci_dev_properties[] = {
> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking",
>VFIOPCIDevice,
> vbasedev.pre_copy_dirty_page_tracking,
> ON_OFF_AUTO_ON),
>+DEFINE_PROP_ON_OFF_AUTO("x-device-dirty-page-tracking",
>VFIOPCIDevice,
>+vbasedev.device_dirty_page_tracking,
>+ON_OFF_AUTO_ON),
> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
> display, ON_OFF_AUTO_OFF),
> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>--
>2.17.2

RE: [PATCH v6 8/9] vfio/migration: Don't block migration device dirty tracking is unsupported

2024-07-22 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v6 8/9] vfio/migration: Don't block migration device dirty
>tracking is unsupported
>
>By default VFIO migration is set to auto, which will support live
>migration if the migration capability is set *and* also dirty page
>tracking is supported.
>
>For testing purposes one can force enable without dirty page tracking
>via enable-migration=on, but that option is generally left for testing
>purposes.
>
>So starting with IOMMU dirty tracking it can use to accomodate the lack of
>VF dirty page tracking allowing us to minimize the VF requirements for
>migration and thus enabling migration by default for those too.
>
>While at it change the error messages to mention IOMMU dirty tracking as
>well.
>
>Signed-off-by: Joao Martins 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> hw/vfio/migration.c | 10 +-
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
>diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>index 34d4be2ce1b1..cbfaef7afffe 100644
>--- a/hw/vfio/migration.c
>+++ b/hw/vfio/migration.c
>@@ -1036,16 +1036,16 @@ bool vfio_migration_realize(VFIODevice
>*vbasedev, Error **errp)
> return !vfio_block_migration(vbasedev, err, errp);
> }
>
>-if (!vbasedev->dirty_pages_supported) {
>+if (!vbasedev->dirty_pages_supported && !vbasedev-
>>iommu_dirty_tracking) {
> if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
> error_setg(&err,
>-   "%s: VFIO device doesn't support device dirty 
>tracking",
>-   vbasedev->name);
>+   "%s: VFIO device doesn't support device and "
>+   "IOMMU dirty tracking", vbasedev->name);
> goto add_blocker;
> }
>
>-warn_report("%s: VFIO device doesn't support device dirty tracking",
>-vbasedev->name);
>+warn_report("%s: VFIO device doesn't support device and "
>+"IOMMU dirty tracking", vbasedev->name);
> }
>
> ret = vfio_block_multiple_devices_migration(vbasedev, errp);
>--
>2.17.2

RE: [PATCH v6 1/9] vfio/iommufd: Introduce auto domain creation

2024-07-22 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v6 1/9] vfio/iommufd: Introduce auto domain creation
>
>There's generally two modes of operation for IOMMUFD:
>
>1) The simple user API which intends to perform relatively simple things
>with IOMMUs e.g. DPDK. The process generally creates an IOAS and attaches
>to VFIO and mainly performs IOAS_MAP and UNMAP.
>
>2) The native IOMMUFD API where you have fine grained control of the
>IOMMU domain and model it accordingly. This is where most new feature
>are being steered to.
>
>For dirty tracking 2) is required, as it needs to ensure that
>the stage-2/parent IOMMU domain will only attach devices
>that support dirty tracking (so far it is all homogeneous in x86, likely
>not the case for smmuv3). Such invariant on dirty tracking provides a
>useful guarantee to VMMs that will refuse incompatible device
>attachments for IOMMU domains.
>
>Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>responsible for creating an IOMMU domain. This is contrast to the
>'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
>when it attaches to VFIO (usually referred as autodomains) but it has
>the needed handling for mdevs.
>
>To support dirty tracking with the advanced IOMMUFD API, it needs
>similar logic, where IOMMU domains are created and devices attached to
>compatible domains. Essentially mimicking kernel
>iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU
>domain
>it falls back to IOAS attach.
>
>The auto domain logic allows different IOMMU domains to be created when
>DMA dirty tracking is not desired (and VF can provide it), and others where
>it is. Here it is not used in this way given how VFIODevice migration
>state is initialized after the device attachment. But such mixed mode of
>IOMMU dirty tracking + device dirty tracking is an improvement that can
>be added on. Keep the 'all of nothing' of type1 approach that we have
>been using so far between container vs device dirty tracking.
>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h |  9 
> include/sysemu/iommufd.h  |  5 +++
> backends/iommufd.c| 30 +
> hw/vfio/iommufd.c | 84
>+++
> backends/trace-events |  1 +
> 5 files changed, 129 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 98acae8c1c97..1a96678f8c38 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>
> typedef struct IOMMUFDBackend IOMMUFDBackend;
>
>+typedef struct VFIOIOASHwpt {
>+uint32_t hwpt_id;
>+QLIST_HEAD(, VFIODevice) device_list;
>+QLIST_ENTRY(VFIOIOASHwpt) next;
>+} VFIOIOASHwpt;
>+
> typedef struct VFIOIOMMUFDContainer {
> VFIOContainerBase bcontainer;
> IOMMUFDBackend *be;
> uint32_t ioas_id;
>+QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
>
> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
>@@ -135,6 +142,8 @@ typedef struct VFIODevice {
> HostIOMMUDevice *hiod;
> int devid;
> IOMMUFDBackend *iommufd;
>+VFIOIOASHwpt *hwpt;
>+QLIST_ENTRY(VFIODevice) hwpt_next;
> } VFIODevice;
>
> struct VFIODeviceOps {
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 57d502a1c79a..e917e7591d05 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -50,6 +50,11 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp);
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+uint32_t pt_id, uint32_t flags,
>+uint32_t data_type, uint32_t data_len,
>+void *data_ptr, uint32_t *out_hwpt,
>+Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 48dfd3962474..60a3d14bfab4 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -207,6 +207,36 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> return ret;
> }
>
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+uint32_t pt_id, uint32_t flags,
>+uint32_t data_type, uint32_t data_len,
>+void *data_ptr, uint32_t *out_hwpt,
>+Error **errp)
>+{
>+int ret, fd = be->fd;
>+struct iommu_hwpt_alloc alloc_hwpt = {
>+.size = sizeof(struct iommu_hwpt_alloc),
>+.flags = flags,
>+.dev_id = dev_id,
>+.p

RE: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain creation

2024-07-22 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain
>creation
>
>On 22/07/2024 06:16, Duan, Zhenzhong wrote:
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain
>creation
>>>
>>> There's generally two modes of operation for IOMMUFD:
>>>
>>> 1) The simple user API which intends to perform relatively simple things
>>> with IOMMUs e.g. DPDK. The process generally creates an IOAS and
>attaches
>>> to VFIO and mainly performs IOAS_MAP and UNMAP.
>>>
>>> 2) The native IOMMUFD API where you have fine grained control of the
>>> IOMMU domain and model it accordingly. This is where most new feature
>>> are being steered to.
>>>
>>> For dirty tracking 2) is required, as it needs to ensure that
>>> the stage-2/parent IOMMU domain will only attach devices
>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>> useful guarantee to VMMs that will refuse incompatible device
>>> attachments for IOMMU domains.
>>>
>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>> responsible for creating an IOMMU domain. This is contrast to the
>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>> automatically
>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>> the needed handling for mdevs.
>>>
>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>> similar logic, where IOMMU domains are created and devices attached to
>>> compatible domains. Essentially mimicking kernel
>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>IOMMU
>>> domain
>>> it falls back to IOAS attach.
>>>
>>> The auto domain logic allows different IOMMU domains to be created
>when
>>> DMA dirty tracking is not desired (and VF can provide it), and others
>where
>>> it is. Here it is not used in this way given how VFIODevice migration
>>> state is initialized after the device attachment. But such mixed mode of
>>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>> been using so far between container vs device dirty tracking.
>>>
>>> Signed-off-by: Joao Martins 
>>> ---
>>> include/hw/vfio/vfio-common.h |  9 
>>> include/sysemu/iommufd.h  |  5 +++
>>> backends/iommufd.c| 30 +
>>> hw/vfio/iommufd.c | 84
>>> +++
>>> backends/trace-events |  1 +
>>> 5 files changed, 129 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>> common.h
>>> index 98acae8c1c97..1a96678f8c38 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>
>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>
>>> +typedef struct VFIOIOASHwpt {
>>> +uint32_t hwpt_id;
>>> +QLIST_HEAD(, VFIODevice) device_list;
>>> +QLIST_ENTRY(VFIOIOASHwpt) next;
>>> +} VFIOIOASHwpt;
>>> +
>>> typedef struct VFIOIOMMUFDContainer {
>>> VFIOContainerBase bcontainer;
>>> IOMMUFDBackend *be;
>>> uint32_t ioas_id;
>>> +QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>> } VFIOIOMMUFDContainer;
>>>
>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>> VFIO_IOMMU_IOMMUFD);
>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>> HostIOMMUDevice *hiod;
>>> int devid;
>>> IOMMUFDBackend *iommufd;
>>> +VFIOIOASHwpt *hwpt;
>>> +QLIST_ENTRY(VFIODevice) hwpt_next;
>>> } VFIODevice;
>>>
>>> struct VFIODeviceOps {
>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>> index 57d502a1c79a..e917e7591d05 100644
>>> --- a/include/sysemu/iommufd.h
>>> +++ b/include/sysemu/iommufd.h
>>> @@ -50,6 +50,11 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>uint32_t
>&

RE: [PATCH v5 09/13] vfio/iommufd: Probe and request hwpt dirty tracking capability

2024-07-22 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v5 09/13] vfio/iommufd: Probe and request hwpt dirty
>tracking capability
>
>On 22/07/2024 15:09, Joao Martins wrote:
>> On 22/07/2024 09:58, Joao Martins wrote:
>>> On 22/07/2024 07:05, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -Original Message-
>>>>> From: Joao Martins 
>>>>> Subject: [PATCH v5 09/13] vfio/iommufd: Probe and request hwpt
>dirty
>>>>> tracking capability
>>>>>
>>>>> In preparation to using the dirty tracking UAPI, probe whether the
>IOMMU
>>>>> supports dirty tracking. This is done via the data stored in
>>>>> hiod::caps::hw_caps initialized from GET_HW_INFO.
>>>>>
>>>>> Qemu doesn't know if VF dirty tracking is supported when allocating
>>>>> hardware pagetable in iommufd_cdev_autodomains_get(). This is
>because
>>>>> VFIODevice migration state hasn't been initialized *yet* hence it can't
>pick
>>>>> between VF dirty tracking vs IOMMU dirty tracking. So, if IOMMU
>supports
>>>>> dirty tracking it always creates HWPTs with
>>>>> IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>>>>> even if later on VFIOMigration decides to use VF dirty tracking instead.
>>>>
>>>> I thought there is no overhead for HWPT with
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING vs. HWPT without
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING if we don't enable dirty tracking.
>Right?
>>>>
>>>
>>> Correct.
>>>
>>>>>
>>>>> Signed-off-by: Joao Martins 
>>>>> ---
>>>>> include/hw/vfio/vfio-common.h |  1 +
>>>>> hw/vfio/iommufd.c | 19 +++
>>>>> 2 files changed, 20 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>>> common.h
>>>>> index 4e44b26d3c45..7e530c7869dc 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -97,6 +97,7 @@ typedef struct IOMMUFDBackend
>IOMMUFDBackend;
>>>>>
>>>>> typedef struct VFIOIOASHwpt {
>>>>> uint32_t hwpt_id;
>>>>> +uint32_t hwpt_flags;
>>>>> QLIST_HEAD(, VFIODevice) device_list;
>>>>> QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>> } VFIOIOASHwpt;
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index bb44d948c735..2e5c207bbca0 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -110,6 +110,11 @@ static void
>>>>> iommufd_cdev_unbind_and_disconnect(VFIODevice *vbasedev)
>>>>> iommufd_backend_disconnect(vbasedev->iommufd);
>>>>> }
>>>>>
>>>>> +static bool iommufd_hwpt_dirty_tracking(VFIOIOASHwpt *hwpt)
>>>>> +{
>>>>> +return hwpt && hwpt->hwpt_flags &
>>>>> IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>>> +}
>>>>> +
>>>>> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
>>>>> {
>>>>> ERRP_GUARD();
>>>>> @@ -246,6 +251,17 @@ static bool
>>>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>> }
>>>>> }
>>>>>
>>>>> +/*
>>>>> + * This is quite early and VFIO Migration state isn't yet fully
>>>>> + * initialized, thus rely only on IOMMU hardware capabilities as to
>>>>> + * whether IOMMU dirty tracking is going to be requested. Later
>>>>> + * vfio_migration_realize() may decide to use VF dirty tracking
>>>>> + * instead.
>>>>> + */
>>>>> +if (vbasedev->hiod->caps.hw_caps &
>>>>> IOMMU_HW_CAP_DIRTY_TRACKING) {
>>>>
>>>> Looks there is still reference to hw_caps, then would suggest to bring
>back the NEW CAP.
>>>>
>>> Ah, but below helper is checking for GET_HW_INFO stuff, and not hwpt
>flags
>>> given that we haven't allocated a hwpt yet.
>>>
>>> While I could place this check into a helper it would only have an user. I
>will
>>> need below helper iommufd_hwpt_dirty_tracking() in another patch, so
>this is a
>>> bit of a on

RE: [PATCH 2/2] vfio/ccw: Don't initialize HOST_IOMMU_DEVICE with mdev

2024-07-22 Thread Duan, Zhenzhong



>-Original Message-
>From: Eric Farman 
>Subject: Re: [PATCH 2/2] vfio/ccw: Don't initialize HOST_IOMMU_DEVICE
>with mdev
>
>On Mon, 2024-07-22 at 17:36 +0200, Cédric Le Goater wrote:
>> On 7/22/24 17:09, Joao Martins wrote:
>> > On 22/07/2024 15:57, Eric Farman wrote:
>> > > On Mon, 2024-07-22 at 15:07 +0800, Zhenzhong Duan wrote:
>> > > > mdevs aren't "physical" devices and when asking for backing IOMMU
>info,
>> > > > it fails the entire provisioning of the guest. Fix that by setting
>> > > > vbasedev->mdev true so skipping HostIOMMUDevice initialization in
>the
>> > > > presence of mdevs.
>> > >
>> > > Hmm, picking the two commits that Cedric mentioned in his cover-
>letter reply [1] doesn't "fail the entire provisioning of the guest" for me.
>> > >
>> > > Applying this patch on top of that causes the call from
>vfio_attach_device() to hiod_legacy_vfio_realize() to be skipped, which
>seems odd. What am I missing?
>> > >
>> > > [1] https://lore.kernel.org/qemu-devel/4c9a184b-514c-4276-95ca-
>9ed86623b...@redhat.com/
>> > >
>> >
>> > If you are using IOMMUFD
>> >
>
>Which is not the case in defconfig.
>
>> >  it will fail the entire provisioning i.e. GET_HW_INFO
>> > fails because there's no actual device/IOMMU you can probe hardware
>information
>> > from and you can't start a guest. This happened at least for me in x86
>vfio-pci
>> > mdevs (or at least I reproduced it when trying to test mdev_tty)
>> >
>> > But if you don't support IOMMUFD, then it probably makes no difference
>as type1
>> > doesn't do anything particularly special besides initializing some static
>data.
>
>This was my concern. The static data doesn't look particularly exciting, but it
>does seem strange to
>be skipping over it in the non-iommufd case now.

Thanks Joao and Cédric for helping explain and confirm.

Yes, after this fix HostIOMMUDevice is totally bypassed for mdev even in 
non-iommufd case.
In non-iommufd case, the only supported HostIOMMUDevice capability is aw_bits 
which is calculated through bcontainer->iova_ranges which is always NULL for 
mdev.
So HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX(64) is returned which is larger enough 
that vIOMMU can safely ignore. Then we can safely bypass entire HostIOMMUDevice 
for mdev.

Thanks
Zhenzhong

>
>> > The realize is skipped because you technically don't have a physical host
>IOMMU
>> > directly behind the mdev, but rather some parent function related
>software
>> > entity doing that for you.
>> >
>> > Zhengzhong noticed there were some other mdevs aside from vfio-pci
>and in an
>> > attempt to prevent regression elsewhere it posted for the other mdevs in
>qemu.
>>
>>
>> yes. I confirm with :
>>
>>-device vfio-ap,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/8eb8351a-
>e656-4187-b773-fea4e926310d,iommufd=iommufd0 \
>>-object iommufd,id=iommufd0 \
>>-trace 'iommufd*'
>>
>> iommufd_cdev_getfd  /dev/vfio/devices/vfio4 (fd=28)
>
>Ah, right... Need to enable iommufd AND vfio_device_cdev to get into this
>potential situation. I
>guess this is better than random failures down the road.
>
>Acked-by: Eric Farman 
>
>> iommufd_backend_connect fd=27 owned=1 users=1
>> iommufd_cdev_connect_and_bind  [iommufd=27] Successfully bound
>device 8eb8351a-e656-4187-b773-fea4e926310d (fd=28): output devid=1
>> iommufd_backend_alloc_ioas  iommufd=27 ioas=2
>> iommufd_cdev_alloc_ioas  [iommufd=27] new IOMMUFD container with
>ioasid=2
>> iommufd_cdev_attach_ioas_hwpt  [iommufd=27] Successfully attached
>device 8eb8351a-e656-4187-b773-fea4e926310d (28) to id=2
>> iommufd_backend_map_dma  iommufd=27 ioas=2 iova=0x0
>size=0x2 addr=0x3fd9ff0 readonly=0 (0)
>> iommufd_cdev_device_info  8eb8351a-e656-4187-b773-fea4e926310d
>(28) num_irqs=1 num_regions=0 flags=33
>> iommufd_cdev_detach_ioas_hwpt  [iommufd=27] Successfully detached
>8eb8351a-e656-4187-b773-fea4e926310d
>> iommufd_backend_unmap_dma  iommufd=27 ioas=2 iova=0x0
>size=0x2 (0)
>> iommufd_backend_free_id  iommufd=27 id=2 (0)
>> iommufd_backend_disconnect fd=-1 users=0
>>
>> qemu-kvm: -device vfio-
>ap,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/8eb8351a-e656-4187-
>b773-fea4e926310d,iommufd=iommufd0: vfio 8eb8351a-e656-4187-b773-
>fea4e926310d: Failed to get hardware info: No such file or directory
>>
>>
>>
>> Thanks,
>>
>> C.
>>
>>

RE: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain creation

2024-07-22 Thread Duan, Zhenzhong



>-Original Message-
>From: Cédric Le Goater 
>Subject: Re: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain
>creation
>
>On 7/22/24 10:50, Joao Martins wrote:
>> On 22/07/2024 06:16, Duan, Zhenzhong wrote:
>>>> -Original Message-
>>>> From: Joao Martins 
>>>> Subject: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain
>creation
>>>>
>>>> There's generally two modes of operation for IOMMUFD:
>>>>
>>>> 1) The simple user API which intends to perform relatively simple things
>>>> with IOMMUs e.g. DPDK. The process generally creates an IOAS and
>attaches
>>>> to VFIO and mainly performs IOAS_MAP and UNMAP.
>>>>
>>>> 2) The native IOMMUFD API where you have fine grained control of the
>>>> IOMMU domain and model it accordingly. This is where most new
>feature
>>>> are being steered to.
>>>>
>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>> the stage-2/parent IOMMU domain will only attach devices
>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>> useful guarantee to VMMs that will refuse incompatible device
>>>> attachments for IOMMU domains.
>>>>
>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>>> automatically
>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>> the needed handling for mdevs.
>>>>
>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>> similar logic, where IOMMU domains are created and devices attached
>to
>>>> compatible domains. Essentially mimicking kernel
>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>IOMMU
>>>> domain
>>>> it falls back to IOAS attach.
>>>>
>>>> The auto domain logic allows different IOMMU domains to be created
>when
>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>where
>>>> it is. Here it is not used in this way given how VFIODevice migration
>>>> state is initialized after the device attachment. But such mixed mode of
>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>can
>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>> been using so far between container vs device dirty tracking.
>>>>
>>>> Signed-off-by: Joao Martins 
>>>> ---
>>>> include/hw/vfio/vfio-common.h |  9 
>>>> include/sysemu/iommufd.h  |  5 +++
>>>> backends/iommufd.c| 30 +
>>>> hw/vfio/iommufd.c | 84
>>>> +++
>>>> backends/trace-events |  1 +
>>>> 5 files changed, 129 insertions(+)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>> common.h
>>>> index 98acae8c1c97..1a96678f8c38 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>
>>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>
>>>> +typedef struct VFIOIOASHwpt {
>>>> +uint32_t hwpt_id;
>>>> +QLIST_HEAD(, VFIODevice) device_list;
>>>> +QLIST_ENTRY(VFIOIOASHwpt) next;
>>>> +} VFIOIOASHwpt;
>>>> +
>>>> typedef struct VFIOIOMMUFDContainer {
>>>>  VFIOContainerBase bcontainer;
>>>>  IOMMUFDBackend *be;
>>>>  uint32_t ioas_id;
>>>> +QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>> } VFIOIOMMUFDContainer;
>>>>
>>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>>> VFIO_IOMMU_IOMMUFD);
>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>  HostIOMMUDevice *hiod;
>>>>  int devid;
>>>>  IOMMUFDBackend *iommufd;
>>>> +VFIOIOASHwpt *hwpt;
>>>> +QLIST_ENTRY(VFIODevice) hwpt_next;
>>>> } VFIODevice;
>>>>
>>>> struct VFIODeviceOps {
>>>> diff --git a/

RE: [PATCH] hw/vfio/container: Fix SIGSEV on vfio_container_instance_finalize()

2024-07-21 Thread Duan, Zhenzhong




>-Original Message-
>From: Eric Auger 
>Subject: [PATCH] hw/vfio/container: Fix SIGSEV on
>vfio_container_instance_finalize()
>
>In vfio_connect_container's error path, the base container is
>removed twice form the VFIOAddressSpace QLIST: first on the
>listener_release_exit label and second, on free_container_exit
>label, through object_unref(container), which calls
>vfio_container_instance_finalize().
>
>Let's remove the first instance.
>
>Fixes: 938026053f4 ("vfio/container: Switch to QOM")
>Signed-off-by: Eric Auger 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> hw/vfio/container.c | 1 -
> 1 file changed, 1 deletion(-)
>
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index 425db1a14c..d8b7c533af 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -657,7 +657,6 @@ static bool vfio_connect_container(VFIOGroup
>*group, AddressSpace *as,
> return true;
> listener_release_exit:
> QLIST_REMOVE(group, container_next);
>-QLIST_REMOVE(bcontainer, next);
> vfio_kvm_device_del_group(group);
> memory_listener_unregister(&bcontainer->listener);
> if (vioc->release) {
>--
>2.41.0

RE: [PATCH v5 11/13] vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap support

2024-07-21 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v5 11/13] vfio/iommufd: Implement
>VFIOIOMMUClass::query_dirty_bitmap support
>
>ioctl(iommufd, IOMMU_HWPT_GET_DIRTY_BITMAP, arg) is the UAPI
>that fetches the bitmap that tells what was dirty in an IOVA
>range.
>
>A single bitmap is allocated and used across all the hwpts
>sharing an IOAS which is then used in log_sync() to set Qemu
>global bitmaps.
>
>Signed-off-by: Joao Martins 
>Reviewed-by: Cédric Le Goater 
>Reviewed-by: Eric Auger 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> include/sysemu/iommufd.h |  4 
> backends/iommufd.c   | 29 +
> hw/vfio/iommufd.c| 28 
> backends/trace-events|  1 +
> 4 files changed, 62 insertions(+)
>
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 6fb412f61144..4c4886c7787b 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -57,6 +57,10 @@ bool
>iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> Error **errp);
> bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
>uint32_t hwpt_id,
> bool start, Error **errp);
>+bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>uint32_t hwpt_id,
>+  uint64_t iova, ram_addr_t size,
>+  uint64_t page_size, uint64_t *data,
>+  Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 1ae4751a1b2c..bd4fd49d2536 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -262,6 +262,35 @@ bool
>iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
> return true;
> }
>
>+bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>+  uint32_t hwpt_id,
>+  uint64_t iova, ram_addr_t size,
>+  uint64_t page_size, uint64_t *data,
>+  Error **errp)
>+{
>+int ret;
>+struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>+.size = sizeof(get_dirty_bitmap),
>+.hwpt_id = hwpt_id,
>+.iova = iova,
>+.length = size,
>+.page_size = page_size,
>+.data = (uintptr_t)data,
>+};
>+
>+ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>&get_dirty_bitmap);
>+trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova, size,
>+   page_size, ret ? errno : 0);
>+if (ret) {
>+error_setg_errno(errp, errno,
>+ "IOMMU_HWPT_GET_DIRTY_BITMAP (iova:
>0x%"HWADDR_PRIx
>+ " size: 0x"RAM_ADDR_FMT") failed", iova, size);
>+return false;
>+}
>+
>+return true;
>+}
>+
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp)
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 7137faaf4540..7dd5d43ce06a 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -25,6 +25,7 @@
> #include "qemu/cutils.h"
> #include "qemu/chardev_open.h"
> #include "pci.h"
>+#include "exec/ram_addr.h"
>
> static int iommufd_cdev_map(const VFIOContainerBase *bcontainer,
>hwaddr iova,
> ram_addr_t size, void *vaddr, bool readonly)
>@@ -146,6 +147,32 @@ err:
> return -EINVAL;
> }
>
>+static int iommufd_query_dirty_bitmap(const VFIOContainerBase
>*bcontainer,
>+  VFIOBitmap *vbmap, hwaddr iova,
>+  hwaddr size, Error **errp)
>+{
>+VFIOIOMMUFDContainer *container = container_of(bcontainer,
>+   VFIOIOMMUFDContainer,
>+   bcontainer);
>+unsigned long page_size = qemu_real_host_page_size();
>+VFIOIOASHwpt *hwpt;
>+
>+QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+if (!iommufd_hwpt_dirty_tracking(hwpt)) {
>+continue;
>+}
>+
>+if (!iommufd_backend_get_dirty_bitmap(container->be, hwpt-
>>hwpt_id,
>+  iova, size, page_size,
>+  (uint64_t *)vbmap->bitmap,
>+  errp)) {
>+return -EINVAL;
>+}
>+}
>+
>+return 0;
>+}
>+
> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
> {
> ERRP_GUARD();
>@@ -756,6 +783,7 @@ static void
>vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
> vioc->detach_device = iommufd_cdev_detach;
> vioc->pci_hot_reset = iommufd_cdev_pci

RE: [PATCH v5 10/13] vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking support

2024-07-21 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v5 10/13] vfio/iommufd: Implement
>VFIOIOMMUClass::set_dirty_tracking support
>
>ioctl(iommufd, IOMMU_HWPT_SET_DIRTY_TRACKING, arg) is the UAPI that
>enables or disables dirty page tracking. The ioctl is used if the hwpt
>has been created with dirty tracking supported domain (stored in
>hwpt::flags) and it is called on the whole list of iommu domains.
>
>Signed-off-by: Joao Martins 
>---
> include/sysemu/iommufd.h |  2 ++
> backends/iommufd.c   | 23 +++
> hw/vfio/iommufd.c| 32 
> backends/trace-events|  1 +
> 4 files changed, 58 insertions(+)
>
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index e917e7591d05..6fb412f61144 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -55,6 +55,8 @@ bool
>iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> uint32_t data_type, uint32_t data_len,
> void *data_ptr, uint32_t *out_hwpt,
> Error **errp);
>+bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
>uint32_t hwpt_id,
>+bool start, Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 58032e588f49..1ae4751a1b2c 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -239,6 +239,29 @@ bool
>iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> return true;
> }
>
>+bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
>+uint32_t hwpt_id, bool start,
>+Error **errp)
>+{
>+int ret;
>+struct iommu_hwpt_set_dirty_tracking set_dirty = {
>+.size = sizeof(set_dirty),
>+.hwpt_id = hwpt_id,
>+.flags = start ? IOMMU_HWPT_DIRTY_TRACKING_ENABLE : 0,
>+};
>+
>+ret = ioctl(be->fd, IOMMU_HWPT_SET_DIRTY_TRACKING, &set_dirty);
>+trace_iommufd_backend_set_dirty(be->fd, hwpt_id, start, ret ? errno :
>0);
>+if (ret) {
>+error_setg_errno(errp, errno,
>+ "IOMMU_HWPT_SET_DIRTY_TRACKING(hwpt_id %u) failed",
>+ hwpt_id);
>+return false;
>+}
>+
>+return true;
>+}
>+
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp)
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 2e5c207bbca0..7137faaf4540 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -115,6 +115,37 @@ static bool
>iommufd_hwpt_dirty_tracking(VFIOIOASHwpt *hwpt)
> return hwpt && hwpt->hwpt_flags &
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> }
>
>+static int iommufd_set_dirty_page_tracking(const VFIOContainerBase
>*bcontainer,
>+   bool start, Error **errp)
>+{
>+const VFIOIOMMUFDContainer *container =
>+container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>+VFIOIOASHwpt *hwpt;
>+
>+QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+if (!iommufd_hwpt_dirty_tracking(hwpt)) {
>+continue;
>+}
>+
>+if (!iommufd_backend_set_dirty_tracking(container->be,
>+hwpt->hwpt_id, start, errp)) {
>+goto err;
>+}
>+}
>+
>+return 0;
>+
>+err:
>+QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+if (!iommufd_hwpt_dirty_tracking(hwpt)) {
>+continue;
>+}
>+iommufd_backend_set_dirty_tracking(container->be,
>+   hwpt->hwpt_id, !start, NULL);
>+}

Not sure if deserved to optimize a bit with breaking out from the failing hwpt.

With or without that,

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>+return -EINVAL;
>+}
>+
> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
> {
> ERRP_GUARD();
>@@ -724,6 +755,7 @@ static void
>vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
> vioc->attach_device = iommufd_cdev_attach;
> vioc->detach_device = iommufd_cdev_detach;
> vioc->pci_hot_reset = iommufd_cdev_pci_hot_reset;
>+vioc->set_dirty_page_tracking = iommufd_set_dirty_page_tracking;
> };
>
> static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void
>*opaque,
>diff --git a/backends/trace-events b/backends/trace-events
>index 4d8ac02fe7d6..28aca3b859d4 100644
>--- a/backends/trace-events
>+++ b/backends/trace-events
>@@ -16,3 +16,4 @@ iommufd_backend_unmap_dma(int iommufd,
>uint32_t ioas, uint64_t iova, uint64_t si
> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d
>ioas=%d"
> iommufd_backend_alloc_hwpt(int iommuf

RE: [PATCH v5 09/13] vfio/iommufd: Probe and request hwpt dirty tracking capability

2024-07-21 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v5 09/13] vfio/iommufd: Probe and request hwpt dirty
>tracking capability
>
>In preparation to using the dirty tracking UAPI, probe whether the IOMMU
>supports dirty tracking. This is done via the data stored in
>hiod::caps::hw_caps initialized from GET_HW_INFO.
>
>Qemu doesn't know if VF dirty tracking is supported when allocating
>hardware pagetable in iommufd_cdev_autodomains_get(). This is because
>VFIODevice migration state hasn't been initialized *yet* hence it can't pick
>between VF dirty tracking vs IOMMU dirty tracking. So, if IOMMU supports
>dirty tracking it always creates HWPTs with
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>even if later on VFIOMigration decides to use VF dirty tracking instead.

I thought there is no overhead for HWPT with IOMMU_HWPT_ALLOC_DIRTY_TRACKING 
vs. HWPT without IOMMU_HWPT_ALLOC_DIRTY_TRACKING if we don't enable dirty 
tracking. Right?

>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h |  1 +
> hw/vfio/iommufd.c | 19 +++
> 2 files changed, 20 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 4e44b26d3c45..7e530c7869dc 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -97,6 +97,7 @@ typedef struct IOMMUFDBackend IOMMUFDBackend;
>
> typedef struct VFIOIOASHwpt {
> uint32_t hwpt_id;
>+uint32_t hwpt_flags;
> QLIST_HEAD(, VFIODevice) device_list;
> QLIST_ENTRY(VFIOIOASHwpt) next;
> } VFIOIOASHwpt;
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index bb44d948c735..2e5c207bbca0 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -110,6 +110,11 @@ static void
>iommufd_cdev_unbind_and_disconnect(VFIODevice *vbasedev)
> iommufd_backend_disconnect(vbasedev->iommufd);
> }
>
>+static bool iommufd_hwpt_dirty_tracking(VFIOIOASHwpt *hwpt)
>+{
>+return hwpt && hwpt->hwpt_flags &
>IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>+}
>+
> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
> {
> ERRP_GUARD();
>@@ -246,6 +251,17 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> }
> }
>
>+/*
>+ * This is quite early and VFIO Migration state isn't yet fully
>+ * initialized, thus rely only on IOMMU hardware capabilities as to
>+ * whether IOMMU dirty tracking is going to be requested. Later
>+ * vfio_migration_realize() may decide to use VF dirty tracking
>+ * instead.
>+ */
>+if (vbasedev->hiod->caps.hw_caps &
>IOMMU_HW_CAP_DIRTY_TRACKING) {

Looks there is still reference to hw_caps, then would suggest to bring back the 
NEW CAP.

>+flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>+}
>+
> if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> container->ioas_id, flags,
> IOMMU_HWPT_DATA_NONE, 0, NULL,
>@@ -255,6 +271,7 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>
> hwpt = g_malloc0(sizeof(*hwpt));
> hwpt->hwpt_id = hwpt_id;
>+hwpt->hwpt_flags = flags;
> QLIST_INIT(&hwpt->device_list);
>
> ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>@@ -267,6 +284,8 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> vbasedev->hwpt = hwpt;
> QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>+container->bcontainer.dirty_pages_supported |=
>+  iommufd_hwpt_dirty_tracking(hwpt);

If there is at least one hwpt without dirty tracking, shouldn't we make 
bcontainer.dirty_pages_supported false?

Thanks
Zhenzhong

> return true;
> }
>
>--
>2.17.2

RE: [PATCH v5 08/13] vfio/{iommufd,container}: Invoke HostIOMMUDevice::realize() during attach_device()

2024-07-21 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v5 08/13] vfio/{iommufd,container}: Invoke
>HostIOMMUDevice::realize() during attach_device()
>
>Move the HostIOMMUDevice::realize() to be invoked during the attach of
>the device
>before we allocate IOMMUFD hardware pagetable objects (HWPT). This
>allows the use
>of the hw_caps obtained by IOMMU_GET_HW_INFO that essentially tell if
>the IOMMU
>behind the device supports dirty tracking.
>
>Note: The HostIOMMUDevice data from legacy backend is static and doesn't
>need any information from the (type1-iommu) backend to be initialized.
>In contrast however, the IOMMUFD HostIOMMUDevice data requires the
>iommufd FD to be connected and having a devid to be able to successfully
>GET_HW_INFO. This means vfio_device_hiod_realize() is called in
>different places within the backend .attach_device() implementation.
>
>Suggested-by: Cédric Le Goater 
>Signed-off-by: Joao Martins 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> include/hw/vfio/vfio-common.h |  1 +
> hw/vfio/common.c  | 16 ++--
> hw/vfio/container.c   |  4 
> hw/vfio/helpers.c | 11 +++
> hw/vfio/iommufd.c |  4 
> 5 files changed, 26 insertions(+), 10 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 1a96678f8c38..4e44b26d3c45 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -242,6 +242,7 @@ void vfio_region_finalize(VFIORegion *region);
> void vfio_reset_handler(void *opaque);
> struct vfio_device_info *vfio_get_device_info(int fd);
> bool vfio_device_is_mdev(VFIODevice *vbasedev);
>+bool vfio_device_hiod_realize(VFIODevice *vbasedev, Error **errp);
> bool vfio_attach_device(char *name, VFIODevice *vbasedev,
> AddressSpace *as, Error **errp);
> void vfio_detach_device(VFIODevice *vbasedev);
>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>index b0beed44116e..cc14f0e3fe24 100644
>--- a/hw/vfio/common.c
>+++ b/hw/vfio/common.c
>@@ -1544,7 +1544,7 @@ bool vfio_attach_device(char *name, VFIODevice
>*vbasedev,
> {
> const VFIOIOMMUClass *ops =
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
>-HostIOMMUDevice *hiod;
>+HostIOMMUDevice *hiod = NULL;
>
> if (vbasedev->iommufd) {
> ops =
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUF
>D));
>@@ -1552,21 +1552,17 @@ bool vfio_attach_device(char *name,
>VFIODevice *vbasedev,
>
> assert(ops);
>
>-if (!ops->attach_device(name, vbasedev, as, errp)) {
>-return false;
>-}
>
>-if (vbasedev->mdev) {
>-return true;
>+if (!vbasedev->mdev) {
>+hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>+vbasedev->hiod = hiod;
> }
>
>-hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>-if (!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp)) {
>+if (!ops->attach_device(name, vbasedev, as, errp)) {
> object_unref(hiod);
>-ops->detach_device(vbasedev);
>+vbasedev->hiod = NULL;
> return false;
> }
>-vbasedev->hiod = hiod;
>
> return true;
> }
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index c27f448ba26e..adb302216e23 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -917,6 +917,10 @@ static bool vfio_legacy_attach_device(const char
>*name, VFIODevice *vbasedev,
>
> trace_vfio_attach_device(vbasedev->name, groupid);
>
>+if (!vfio_device_hiod_realize(vbasedev, errp)) {
>+return false;
>+}
>+
> group = vfio_get_group(groupid, as, errp);
> if (!group) {
> return false;
>diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>index 7e23e9080c9d..ea15c79db0a3 100644
>--- a/hw/vfio/helpers.c
>+++ b/hw/vfio/helpers.c
>@@ -689,3 +689,14 @@ bool vfio_device_is_mdev(VFIODevice *vbasedev)
> subsys = realpath(tmp, NULL);
> return subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
> }
>+
>+bool vfio_device_hiod_realize(VFIODevice *vbasedev, Error **errp)
>+{
>+HostIOMMUDevice *hiod = vbasedev->hiod;
>+
>+if (!hiod) {
>+return true;
>+}
>+
>+return HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp);
>+}
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 7a10b1e90a6f..bb44d948c735 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -403,6 +403,10 @@ static bool iommufd_cdev_attach(const char
>*name, VFIODevice *vbasedev,
>
> space = vfio_get_address_space(as);
>
>+if (!vfio_device_hiod_realize(vbasedev, errp)) {
>+return false;
>+}
>+
> /* try to attach to an existing container in this space */
> QLIST_FOREACH(bcontainer, &space->containers, next) {
> container = container_of(bcontainer, VFIOIOMMUFDContainer,
>bcontainer);
>--
>2.17.2

RE: [PATCH v5 06/13] vfio/{iommufd,container}: Remove caps::aw_bits

2024-07-21 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v5 06/13] vfio/{iommufd,container}: Remove caps::aw_bits
>
>Remove caps::aw_bits which requires the bcontainer::iova_ranges being
>initialized after device is actually attached. Instead defer that to
>.get_cap() and call vfio_device_get_aw_bits() directly.
>
>This is in preparation for HostIOMMUDevice::realize() being called early
>during attach_device().
>
>Suggested-by: Zhenzhong Duan 
>Signed-off-by: Joao Martins 
>Reviewed-by: Cédric Le Goater ---
> include/sysemu/host_iommu_device.h | 3 ---
> backends/iommufd.c | 3 ++-
> hw/vfio/container.c| 5 +
> hw/vfio/iommufd.c  | 1 -
> 4 files changed, 3 insertions(+), 9 deletions(-)
>
>diff --git a/include/sysemu/host_iommu_device.h
>b/include/sysemu/host_iommu_device.h
>index ee6c813c8b22..cdeeccec7671 100644
>--- a/include/sysemu/host_iommu_device.h
>+++ b/include/sysemu/host_iommu_device.h
>@@ -19,12 +19,9 @@
>  * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
>  *
>  * @type: host platform IOMMU type.
>- *
>- * @aw_bits: host IOMMU address width. 0xff if no limitation.
>  */
> typedef struct HostIOMMUDeviceCaps {
> uint32_t type;
>-uint8_t aw_bits;
> } HostIOMMUDeviceCaps;
>
> #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index a94d3b90c05c..58032e588f49 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -18,6 +18,7 @@
> #include "qemu/error-report.h"
> #include "monitor/monitor.h"
> #include "trace.h"
>+#include "hw/vfio/vfio-common.h"
> #include 
> #include 
>
>@@ -270,7 +271,7 @@ static int
>hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
> case HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE:
> return caps->type;
> case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>-return caps->aw_bits;
>+return vfio_device_get_aw_bits(hiod->agent);

I just realized there is an open here. hiod->agent is not necessarily VFIO 
device, can be VDPA device.
May need a bit more work on this.

Thanks
Zhenzhong

> default:
> error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
> return -EINVAL;
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index 88ede913d6f7..c27f448ba26e 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -1144,7 +1144,6 @@ static bool
>hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
> VFIODevice *vdev = opaque;
>
> hiod->name = g_strdup(vdev->name);
>-hiod->caps.aw_bits = vfio_device_get_aw_bits(vdev);
> hiod->agent = opaque;
>
> return true;
>@@ -1153,11 +1152,9 @@ static bool
>hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
> static int hiod_legacy_vfio_get_cap(HostIOMMUDevice *hiod, int cap,
> Error **errp)
> {
>-HostIOMMUDeviceCaps *caps = &hiod->caps;
>-
> switch (cap) {
> case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>-return caps->aw_bits;
>+return vfio_device_get_aw_bits(hiod->agent);
> default:
> error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
> return -EINVAL;
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 545f4a404125..028533bc39b9 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -724,7 +724,6 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>
> hiod->name = g_strdup(vdev->name);
> caps->type = type;
>-caps->aw_bits = vfio_device_get_aw_bits(vdev);
>
> return true;
> }
>--
>2.17.2

RE: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain creation

2024-07-21 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v5 05/13] vfio/iommufd: Introduce auto domain creation
>
>There's generally two modes of operation for IOMMUFD:
>
>1) The simple user API which intends to perform relatively simple things
>with IOMMUs e.g. DPDK. The process generally creates an IOAS and attaches
>to VFIO and mainly performs IOAS_MAP and UNMAP.
>
>2) The native IOMMUFD API where you have fine grained control of the
>IOMMU domain and model it accordingly. This is where most new feature
>are being steered to.
>
>For dirty tracking 2) is required, as it needs to ensure that
>the stage-2/parent IOMMU domain will only attach devices
>that support dirty tracking (so far it is all homogeneous in x86, likely
>not the case for smmuv3). Such invariant on dirty tracking provides a
>useful guarantee to VMMs that will refuse incompatible device
>attachments for IOMMU domains.
>
>Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>responsible for creating an IOMMU domain. This is contrast to the
>'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
>when it attaches to VFIO (usually referred as autodomains) but it has
>the needed handling for mdevs.
>
>To support dirty tracking with the advanced IOMMUFD API, it needs
>similar logic, where IOMMU domains are created and devices attached to
>compatible domains. Essentially mimicking kernel
>iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU
>domain
>it falls back to IOAS attach.
>
>The auto domain logic allows different IOMMU domains to be created when
>DMA dirty tracking is not desired (and VF can provide it), and others where
>it is. Here it is not used in this way given how VFIODevice migration
>state is initialized after the device attachment. But such mixed mode of
>IOMMU dirty tracking + device dirty tracking is an improvement that can
>be added on. Keep the 'all of nothing' of type1 approach that we have
>been using so far between container vs device dirty tracking.
>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h |  9 
> include/sysemu/iommufd.h  |  5 +++
> backends/iommufd.c| 30 +
> hw/vfio/iommufd.c | 84
>+++
> backends/trace-events |  1 +
> 5 files changed, 129 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 98acae8c1c97..1a96678f8c38 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>
> typedef struct IOMMUFDBackend IOMMUFDBackend;
>
>+typedef struct VFIOIOASHwpt {
>+uint32_t hwpt_id;
>+QLIST_HEAD(, VFIODevice) device_list;
>+QLIST_ENTRY(VFIOIOASHwpt) next;
>+} VFIOIOASHwpt;
>+
> typedef struct VFIOIOMMUFDContainer {
> VFIOContainerBase bcontainer;
> IOMMUFDBackend *be;
> uint32_t ioas_id;
>+QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
>
> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
>@@ -135,6 +142,8 @@ typedef struct VFIODevice {
> HostIOMMUDevice *hiod;
> int devid;
> IOMMUFDBackend *iommufd;
>+VFIOIOASHwpt *hwpt;
>+QLIST_ENTRY(VFIODevice) hwpt_next;
> } VFIODevice;
>
> struct VFIODeviceOps {
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 57d502a1c79a..e917e7591d05 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -50,6 +50,11 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp);
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+uint32_t pt_id, uint32_t flags,
>+uint32_t data_type, uint32_t data_len,
>+void *data_ptr, uint32_t *out_hwpt,
>+Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 2b3d51af26d2..a94d3b90c05c 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -208,6 +208,36 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> return ret;
> }
>
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+uint32_t pt_id, uint32_t flags,
>+uint32_t data_type, uint32_t data_len,
>+void *data_ptr, uint32_t *out_hwpt,
>+Error **errp)
>+{
>+int ret, fd = be->fd;
>+struct iommu_hwpt_alloc alloc_hwpt = {
>+.size = sizeof(struct iommu_hwpt_alloc),
>+.flags = flags,
>+.dev_id = dev_id,
>+

RE: [PATCH v5 01/13] vfio/pci: Extract mdev check into an helper

2024-07-21 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Sent: Friday, July 19, 2024 8:05 PM
>To: qemu-devel@nongnu.org
>Cc: Liu, Yi L ; Eric Auger ; Duan,
>Zhenzhong ; Alex Williamson
>; Cedric Le Goater ; Jason
>Gunthorpe ; Avihai Horon ; Joao
>Martins 
>Subject: [PATCH v5 01/13] vfio/pci: Extract mdev check into an helper
>
>In preparation to skip initialization of the HostIOMMUDevice for mdev,
>extract the checks that validate if a device is an mdev into helpers.
>
>A vfio_device_is_mdev() is created, and subsystems consult
>VFIODevice::mdev
>to check if it's mdev or not.
>
>Signed-off-by: Joao Martins 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> include/hw/vfio/vfio-common.h |  2 ++
> hw/vfio/helpers.c | 14 ++
> hw/vfio/pci.c | 12 +++-
> 3 files changed, 19 insertions(+), 9 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index e8ddf92bb185..98acae8c1c97 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -116,6 +116,7 @@ typedef struct VFIODevice {
> DeviceState *dev;
> int fd;
> int type;
>+bool mdev;
> bool reset_works;
> bool needs_reset;
> bool no_mmap;
>@@ -231,6 +232,7 @@ void vfio_region_exit(VFIORegion *region);
> void vfio_region_finalize(VFIORegion *region);
> void vfio_reset_handler(void *opaque);
> struct vfio_device_info *vfio_get_device_info(int fd);
>+bool vfio_device_is_mdev(VFIODevice *vbasedev);
> bool vfio_attach_device(char *name, VFIODevice *vbasedev,
> AddressSpace *as, Error **errp);
> void vfio_detach_device(VFIODevice *vbasedev);
>diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>index b14edd46edc9..7e23e9080c9d 100644
>--- a/hw/vfio/helpers.c
>+++ b/hw/vfio/helpers.c
>@@ -675,3 +675,17 @@ int vfio_device_get_aw_bits(VFIODevice *vdev)
>
> return HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX;
> }
>+
>+bool vfio_device_is_mdev(VFIODevice *vbasedev)
>+{
>+g_autofree char *subsys = NULL;
>+g_autofree char *tmp = NULL;
>+
>+if (!vbasedev->sysfsdev) {
>+return false;
>+}
>+
>+tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev);
>+subsys = realpath(tmp, NULL);
>+return subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
>+}
>diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>index e03d9f3ba546..b34e91468a53 100644
>--- a/hw/vfio/pci.c
>+++ b/hw/vfio/pci.c
>@@ -2963,12 +2963,9 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
> ERRP_GUARD();
> VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> VFIODevice *vbasedev = &vdev->vbasedev;
>-char *subsys;
> int i, ret;
>-bool is_mdev;
> char uuid[UUID_STR_LEN];
> g_autofree char *name = NULL;
>-g_autofree char *tmp = NULL;
>
> if (vbasedev->fd < 0 && !vbasedev->sysfsdev) {
> if (!(~vdev->host.domain || ~vdev->host.bus ||
>@@ -2997,14 +2994,11 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
>  * stays in sync with the active working set of the guest driver.  Prevent
>  * the x-balloon-allowed option unless this is minimally an mdev device.
>  */
>-tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev);
>-subsys = realpath(tmp, NULL);
>-is_mdev = subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
>-free(subsys);
>+vbasedev->mdev = vfio_device_is_mdev(vbasedev);
>
>-trace_vfio_mdev(vbasedev->name, is_mdev);
>+trace_vfio_mdev(vbasedev->name, vbasedev->mdev);
>
>-if (vbasedev->ram_block_discard_allowed && !is_mdev) {
>+if (vbasedev->ram_block_discard_allowed && !vbasedev->mdev) {
> error_setg(errp, "x-balloon-allowed only potentially compatible "
>"with mdev devices");
> goto error;
>--
>2.17.2

RE: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Duan, Zhenzhong]
>Subject: RE: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>
>
>>-Original Message-
>>From: Liu, Yi L 
>>Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for
>>scalable modern mode
>>
>>On 2024/7/19 10:47, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: CLEMENT MATHIEU--DRIF 
>>>> Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable
>>for
>>>> scalable modern mode
>>>>
>>>>
>>>>
>>>> On 18/07/2024 10:16, Zhenzhong Duan wrote:
>>>>> Caution: External email. Do not open attachments or click links, unless
>>this
>>>> email comes from a known sender and you know the content is safe.
>>>>>
>>>>>
>>>>> Add an new element scalable_mode in IntelIOMMUState to mark
>>scalable
>>>>> modern mode, this element will be exposed as an intel_iommu
>property
>>>>> finally.
>>>>>
>>>>> For now, it's only a placehholder and used for cap/ecap initialization,
>>>>> compatibility check and block host device passthrough until nesting
>>>>> is supported.
>>>>>
>>>>> Signed-off-by: Yi Liu 
>>>>> Signed-off-by: Zhenzhong Duan 
>>>>> ---
>>>>>hw/i386/intel_iommu_internal.h |  2 ++
>>>>>include/hw/i386/intel_iommu.h  |  1 +
>>>>>hw/i386/intel_iommu.c  | 34 +++---
>--
>>--
>>>>>3 files changed, 26 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu_internal.h
>>>> b/hw/i386/intel_iommu_internal.h
>>>>> index c0ca7b372f..4e0331caba 100644
>>>>> --- a/hw/i386/intel_iommu_internal.h
>>>>> +++ b/hw/i386/intel_iommu_internal.h
>>>>> @@ -195,6 +195,7 @@
>>>>>#define VTD_ECAP_PASID  (1ULL << 40)
>>>>>#define VTD_ECAP_SMTS   (1ULL << 43)
>>>>>#define VTD_ECAP_SLTS   (1ULL << 46)
>>>>> +#define VTD_ECAP_FLTS   (1ULL << 47)
>>>>>
>>>>>/* CAP_REG */
>>>>>/* (offset >> 4) << 24 */
>>>>> @@ -211,6 +212,7 @@
>>>>>#define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
>>>>>#define VTD_CAP_DRAIN_WRITE (1ULL << 54)
>>>>>#define VTD_CAP_DRAIN_READ  (1ULL << 55)
>>>>> +#define VTD_CAP_FS1GP   (1ULL << 56)
>>>>>#define VTD_CAP_DRAIN   (VTD_CAP_DRAIN_READ |
>>>> VTD_CAP_DRAIN_WRITE)
>>>>>#define VTD_CAP_CM  (1ULL << 7)
>>>>>#define VTD_PASID_ID_SHIFT  20
>>>>> diff --git a/include/hw/i386/intel_iommu.h
>>>> b/include/hw/i386/intel_iommu.h
>>>>> index 1eb05c29fc..788ed42477 100644
>>>>> --- a/include/hw/i386/intel_iommu.h
>>>>> +++ b/include/hw/i386/intel_iommu.h
>>>>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>>>>
>>>>>bool caching_mode;  /* RO - is cap CM enabled? */
>>>>>bool scalable_mode; /* RO - is Scalable Mode supported?
>*/
>>>>> +bool scalable_modern;   /* RO - is modern SM supported? */
>>>>>bool snoop_control; /* RO - is SNP filed supported? */
>>>>>
>>>>>dma_addr_t root;/* Current root table pointer */
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 1cff8b00ae..40cbd4a0f4 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -755,16 +755,20 @@ static inline bool
>>>> vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
>>>>>}
>>>>>
>>>>>/* Return true if check passed, otherwise false */
>>>>> -static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>>>>> - VTDPASIDEntry *pe)
>>>>> +static inline bool vtd_pe_type_check(IntelIOMMUState *s,
>>>> VTDPASIDEntry *pe)
>>>&

RE: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>On 2024/7/19 10:47, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: CLEMENT MATHIEU--DRIF 
>>> Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable
>for
>>> scalable modern mode
>>>
>>>
>>>
>>> On 18/07/2024 10:16, Zhenzhong Duan wrote:
>>>> Caution: External email. Do not open attachments or click links, unless
>this
>>> email comes from a known sender and you know the content is safe.
>>>>
>>>>
>>>> Add an new element scalable_mode in IntelIOMMUState to mark
>scalable
>>>> modern mode, this element will be exposed as an intel_iommu property
>>>> finally.
>>>>
>>>> For now, it's only a placehholder and used for cap/ecap initialization,
>>>> compatibility check and block host device passthrough until nesting
>>>> is supported.
>>>>
>>>> Signed-off-by: Yi Liu 
>>>> Signed-off-by: Zhenzhong Duan 
>>>> ---
>>>>hw/i386/intel_iommu_internal.h |  2 ++
>>>>include/hw/i386/intel_iommu.h  |  1 +
>>>>hw/i386/intel_iommu.c  | 34 +++-
>--
>>>>3 files changed, 26 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu_internal.h
>>> b/hw/i386/intel_iommu_internal.h
>>>> index c0ca7b372f..4e0331caba 100644
>>>> --- a/hw/i386/intel_iommu_internal.h
>>>> +++ b/hw/i386/intel_iommu_internal.h
>>>> @@ -195,6 +195,7 @@
>>>>#define VTD_ECAP_PASID  (1ULL << 40)
>>>>#define VTD_ECAP_SMTS   (1ULL << 43)
>>>>#define VTD_ECAP_SLTS   (1ULL << 46)
>>>> +#define VTD_ECAP_FLTS   (1ULL << 47)
>>>>
>>>>/* CAP_REG */
>>>>/* (offset >> 4) << 24 */
>>>> @@ -211,6 +212,7 @@
>>>>#define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
>>>>#define VTD_CAP_DRAIN_WRITE (1ULL << 54)
>>>>#define VTD_CAP_DRAIN_READ  (1ULL << 55)
>>>> +#define VTD_CAP_FS1GP   (1ULL << 56)
>>>>#define VTD_CAP_DRAIN   (VTD_CAP_DRAIN_READ |
>>> VTD_CAP_DRAIN_WRITE)
>>>>#define VTD_CAP_CM  (1ULL << 7)
>>>>#define VTD_PASID_ID_SHIFT  20
>>>> diff --git a/include/hw/i386/intel_iommu.h
>>> b/include/hw/i386/intel_iommu.h
>>>> index 1eb05c29fc..788ed42477 100644
>>>> --- a/include/hw/i386/intel_iommu.h
>>>> +++ b/include/hw/i386/intel_iommu.h
>>>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>>>
>>>>bool caching_mode;  /* RO - is cap CM enabled? */
>>>>bool scalable_mode; /* RO - is Scalable Mode supported? 
>>>> */
>>>> +bool scalable_modern;   /* RO - is modern SM supported? */
>>>>bool snoop_control; /* RO - is SNP filed supported? */
>>>>
>>>>dma_addr_t root;/* Current root table pointer */
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 1cff8b00ae..40cbd4a0f4 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -755,16 +755,20 @@ static inline bool
>>> vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
>>>>}
>>>>
>>>>/* Return true if check passed, otherwise false */
>>>> -static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>>>> - VTDPASIDEntry *pe)
>>>> +static inline bool vtd_pe_type_check(IntelIOMMUState *s,
>>> VTDPASIDEntry *pe)
>>>>{
>>> What about using the cap/ecap registers to know if the translation types
>>> are supported or not.
>>> Otherwise, we could add a comment to explain why we expect
>>> s->scalable_modern to give us enough information.
>>
>> What about below:
>>
>> /*
>>   *VTD_ECAP_FLTS in ecap is set if s->scalable_modern is true, or else
>VTD_ECAP_SLTS can be set or not depending on s->scalable_mode.
>>   *So it's simple

RE: [PATCH v1 16/17] intel_iommu: Modify x-scalable-mode to be string option

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 16/17] intel_iommu: Modify x-scalable-mode to be
>string option
>
>
>
>On 18/07/2024 10:16, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> From: Yi Liu 
>>
>> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
>> related to scalable mode translation, thus there are multiple combinations.
>> While this vIOMMU implementation wants to simplify it for user by
>providing
>> typical combinations. User could config it by "x-scalable-mode" option. The
>> usage is as below:
>>
>> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
>>
>>   - "legacy": gives support for stage-2 page table
>>   - "modern": gives support for stage-1 page table
>>   - "off": no scalable mode support
>>   -  if not configured, means no scalable mode support, if not proper
>>  configured, will throw error
>s/proper/properly
>Maybe we could split and rephrase the last bullet point to make it clear
>that "off" is equivalent to not using the option at all

You mean split last bullet as a separate paragraph?
Then what about below:

  - "legacy": gives support for stage-2 page table
  - "modern": gives support for stage-1 page table
  - "off": no scalable mode support
  -  any other string, will throw error

If x-scalable-mode is not configured, it is equivalent to x-scalable-mode=off.

Thanks
Zhenzhong

>>
>> Signed-off-by: Yi Liu 
>> Signed-off-by: Yi Sun 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   include/hw/i386/intel_iommu.h |  1 +
>>   hw/i386/intel_iommu.c | 24 +++-
>>   2 files changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 48134bda11..650641544c 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -263,6 +263,7 @@ struct IntelIOMMUState {
>>
>>   bool caching_mode;  /* RO - is cap CM enabled? */
>>   bool scalable_mode; /* RO - is Scalable Mode supported? */
>> +char *scalable_mode_str;/* RO - admin's Scalable Mode config */
>>   bool scalable_modern;   /* RO - is modern SM supported? */
>>   bool snoop_control; /* RO - is SNP filed supported? */
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 2804c3628a..14d05fce1d 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3770,7 +3770,7 @@ static Property vtd_properties[] = {
>>   DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>> VTD_HOST_AW_AUTO),
>>   DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode,
>FALSE),
>> -DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState,
>scalable_mode, FALSE),
>> +DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState,
>scalable_mode_str),
>>   DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState,
>snoop_control, false),
>>   DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>>   DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>> @@ -4686,6 +4686,28 @@ static bool
>vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>   }
>>   }
>>
>> +if (s->scalable_mode_str &&
>> +(strcmp(s->scalable_mode_str, "off") &&
>> + strcmp(s->scalable_mode_str, "modern") &&
>> + strcmp(s->scalable_mode_str, "legacy"))) {
>> +error_setg(errp, "Invalid x-scalable-mode config,"
>> + "Please use \"modern\", \"legacy\" or \"off\"");
>> +return false;
>> +}
>> +
>> +if (s->scalable_mode_str &&
>> +!strcmp(s->scalable_mode_str, "legacy")) {
>> +s->scalable_mode = true;
>> +s->scalable_modern = false;
>> +} else if (s->scalable_mode_str &&
>> +!strcmp(s->scalable_mode_str, "modern")) {
>> +s->scalable_mode = true;
>> +s->scalable_modern = true;
>> +} else {
>> +s->scalable_mode = false;
>> +s->scalable_modern = false;
>> +}
>> +
>>   if (s->aw_bits == VTD_HOST_AW_AUTO) {
>>   if (s->scalable_modern) {
>>   s->aw_bits = VTD_HOST_AW_48BIT;
>> --
>> 2.34.1
>>
>LGTM

RE: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for scalable modern mode

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: Re: [PATCH v1 03/17] intel_iommu: Add a placeholder variable for
>scalable modern mode
>
>
>
>On 18/07/2024 10:16, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Add an new element scalable_mode in IntelIOMMUState to mark scalable
>> modern mode, this element will be exposed as an intel_iommu property
>> finally.
>>
>> For now, it's only a placehholder and used for cap/ecap initialization,
>> compatibility check and block host device passthrough until nesting
>> is supported.
>>
>> Signed-off-by: Yi Liu 
>> Signed-off-by: Zhenzhong Duan 
>> ---
>>   hw/i386/intel_iommu_internal.h |  2 ++
>>   include/hw/i386/intel_iommu.h  |  1 +
>>   hw/i386/intel_iommu.c  | 34 +++---
>>   3 files changed, 26 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index c0ca7b372f..4e0331caba 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -195,6 +195,7 @@
>>   #define VTD_ECAP_PASID  (1ULL << 40)
>>   #define VTD_ECAP_SMTS   (1ULL << 43)
>>   #define VTD_ECAP_SLTS   (1ULL << 46)
>> +#define VTD_ECAP_FLTS   (1ULL << 47)
>>
>>   /* CAP_REG */
>>   /* (offset >> 4) << 24 */
>> @@ -211,6 +212,7 @@
>>   #define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
>>   #define VTD_CAP_DRAIN_WRITE (1ULL << 54)
>>   #define VTD_CAP_DRAIN_READ  (1ULL << 55)
>> +#define VTD_CAP_FS1GP   (1ULL << 56)
>>   #define VTD_CAP_DRAIN   (VTD_CAP_DRAIN_READ |
>VTD_CAP_DRAIN_WRITE)
>>   #define VTD_CAP_CM  (1ULL << 7)
>>   #define VTD_PASID_ID_SHIFT  20
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 1eb05c29fc..788ed42477 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -262,6 +262,7 @@ struct IntelIOMMUState {
>>
>>   bool caching_mode;  /* RO - is cap CM enabled? */
>>   bool scalable_mode; /* RO - is Scalable Mode supported? */
>> +bool scalable_modern;   /* RO - is modern SM supported? */
>>   bool snoop_control; /* RO - is SNP filed supported? */
>>
>>   dma_addr_t root;/* Current root table pointer */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1cff8b00ae..40cbd4a0f4 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -755,16 +755,20 @@ static inline bool
>vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
>>   }
>>
>>   /* Return true if check passed, otherwise false */
>> -static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>> - VTDPASIDEntry *pe)
>> +static inline bool vtd_pe_type_check(IntelIOMMUState *s,
>VTDPASIDEntry *pe)
>>   {
>What about using the cap/ecap registers to know if the translation types
>are supported or not.
>Otherwise, we could add a comment to explain why we expect
>s->scalable_modern to give us enough information.

What about below:

/*
 *VTD_ECAP_FLTS in ecap is set if s->scalable_modern is true, or else 
VTD_ECAP_SLTS can be set or not depending on s->scalable_mode.
 *So it's simpler to check s->scalable_modern directly for a PASID entry type 
instead ecap bits.
 */

Thanks
Zhenzhong

>> +X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>> +
>>   switch (VTD_PE_GET_TYPE(pe)) {
>> +case VTD_SM_PASID_ENTRY_FLT:
>> +return s->scalable_modern;
>>   case VTD_SM_PASID_ENTRY_SLT:
>> -return true;
>> +return !s->scalable_modern;
>> +case VTD_SM_PASID_ENTRY_NESTED:
>> +/* Not support NESTED page table type yet */
>> +return false;
>>   case VTD_SM_PASID_ENTRY_PT:
>>   return x86_iommu->pt_supported;
>> -case VTD_SM_PASID_ENTRY_FLT:
>> -case VTD_SM_PASID_ENTRY_NESTED:
>>   default:
>>   /* Unknown type */
>>   return false;
>> @@ -813,7 +817,6 @@ static int
>vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>>   uint8_t pgtt;
>>   uint32_t index;
>>   dma_addr_t entry_size;
>> -X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>
>>   index = VTD_PASID_TABLE_INDEX(pasid);
>>   entry_size = VTD_PASID_ENTRY_SIZE;
>> @@ -827,7 +830,7 @@ static int
>vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>>   }
>>
>>   /* Do translation type check */
>> -if (!vtd_pe_type_check(x86_iommu, pe)) {
>> +if (!vtd_pe_type_check(s, pe)) {
>>   return -VTD_FR_PASID_TABLE_ENTRY_INV;
>>   }
>>
>> @@ -3861,7 +3864,13 @@ static bool vtd_check_hiod(IntelIOMMUState
>*s, HostIOMMUDevice *hiod,
>>   return false;
>>   }
>>
>> -return true;
>> +if (!s->scalable_mo

RE: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 18/07/2024 08:44, Duan, Zhenzhong wrote:
>>>>>> If existing hwpt doesn't support dirty tracking.
>>>>>> Another device supporting dirty tracking attaches to that hwpt, what
>>> will
>>>>> happen?
>>>>>>
>>>>>
>>>>> Hmm, It succeeds as there's no incompatbility. At the very least I plan
>on
>>>>> blocking migration if the device neither has VF dirty tracking, nor
>IOMMU
>>>>> dirty
>>>>> tracking (and patch 11 needs to be adjusted to check hwpt_flags
>instead
>>> of
>>>>> container).
>>>>
>>>> When bcontainer->dirty_pages_supported is true, I think that container
>>> should only contains hwpt list that support dirty tracking. All hwpt not
>>> supporting dirty tracking should be in other container.
>>>>
>>> Well but we are adopting this auto domains scheme and works for any
>>> device,
>>> dirty tracking or not. We already track hwpt flags so we know which ones
>>> support
>>> dirty tracking. This differentiation would (IMHO) complicate more and I
>am
>>> not
>>> sure the gain
>>
>> OK, I was trying to make bcontainer->dirty_pages_supported  accurate
>because it is used in many functions such as vfio_get_dirty_bitmap() which
>require an accurate value. If there is mix of hwpt in that container, that's
>impossible.
>>
>> But as you say you want to address the mix issue in a follow-up and
>presume all are homogeneous hw for now, then OK, there is no conflict.
>>
>
>Right
>
>>>
>>>> If device supports dirty tracking, it should bypass attaching container
>that
>>> doesn't support dirty tracking. Vise versa.
>>>> This way we can support the mixing environment.
>>>>
>>>
>>> It's not that easy as the whole flow doesn't handle this mixed mode (even
>>> excluding this series). We would to have device-dirty-tracking start all
>>> non-disabled device trackers first [and stop them as well], and then we
>>> would
>>> always iterate those first (if device dirty trackers are active), and then
>defer
>>> to IOMMU tracker for those who don't.
>>
>> Why is device-dirty-tracking preferred over IOMMU dirty tracking?
>> Imagine if many devices attached to same domain.
>>
>
>The heuristic or expectation is that device dirty tracking doesn't involve a
>compromise for SW because it can a) perform lowest granularity of IOVA
>range
>being dirty with b) no DMA penalty. With IOMMU though, SW needs to
>worry about
>managing page tables to dictate the granularity and those take time to walk
>the
>deeper the level we descend into. I used to think that IOMMU we have DMA
>penalty
>(because of the IOTLB flushes to clear dirty bit, and IOTLB cache misses) but I
>haven't yet that materialized in the field yet (at least for 100Gbit/s rates).
>
>TL;DR At the end of the day with device dirty tracking you have less to worry
>about, and it's the VF doing most of the heavy lifting. In theory with device
>dirty tracking you could even perform sub basepage tracking if the device
>allows
>it to do so.

Clear, thanks Joao.

BRs.
Zhenzhong

RE: [PATCH v4 00/12] hw/iommufd: IOMMUFD Dirty Tracking

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 00/12] hw/iommufd: IOMMUFD Dirty Tracking
>
>On 16/07/2024 09:20, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: [PATCH v4 00/12] hw/iommufd: IOMMUFD Dirty Tracking
>>>
>>> This small series adds support for IOMMU dirty tracking support via the
>>> IOMMUFD backend. The hardware capability is available on most recent
>x86
>>> hardware. The series is divided organized as follows:
>>>
>>> * Patch 1-2: Fixes a regression into mdev support with IOMMUFD. This
>>> one is independent of the series but happened to cross it
>>> while testing mdev with this series
>>
>> I guess VFIO ap/ccw may need fixes too.
>> Will you help on that or I can take it if you want to focus on dirty 
>> tracking.
>> The fix may be trivial, just assign VFIODevice->mdev = true.
>>
>
>If you have something in mind already by all means go ahead.

OK, will be after your 'dirty tracking' v5 as there is dependency.

>
>But from the code are we sure these are mdev bus devices? Certainly are
>grepping
>with 'mdev' but unclear if that's abbreviation for 'My Device' or actually bus
>mdev/mediated-device?

I think so, docs/system/s390x/vfio-[ap|ccw].rst shows /sys/bus/mdev

Thanks
Zhenzhong

RE: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 11:05, Duan, Zhenzhong wrote:
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>
>>> On 17/07/2024 03:18, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -Original Message-
>>>>> From: Joao Martins 
>>>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>>>
>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>
>>>>> * The simple user API which intends to perform relatively simple things
>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to
>VFIO
>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>
>>>>> * The native IOMMUFD API where you have fine grained control of the
>>>>> IOMMU domain and model it accordingly. This is where most new
>feature
>>>>> are being steered to.
>>>>>
>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>> attachments for IOMMU domains.
>>>>>
>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>>>> automatically
>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>> the needed handling for mdevs.
>>>>>
>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>> similar logic, where IOMMU domains are created and devices attached
>to
>>>>> compatible domains. Essentially mimmicing kernel
>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>> IOMMU
>>>>> domain
>>>>> it falls back to IOAS attach.
>>>>>
>>>>> The auto domain logic allows different IOMMU domains to be created
>>> when
>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>> where
>>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>>> state is initialized after the device attachment. But such mixed mode of
>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>can
>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>> been using so far between container vs device dirty tracking.
>>>>>
>>>>> Signed-off-by: Joao Martins 
>>>>> ---
>>>>> include/hw/vfio/vfio-common.h |  9 
>>>>> include/sysemu/iommufd.h  |  5 +++
>>>>> backends/iommufd.c| 30 +
>>>>> hw/vfio/iommufd.c | 82
>>>>> +++
>>>>> backends/trace-events |  1 +
>>>>> 5 files changed, 127 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>>> common.h
>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>
>>>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>
>>>>> +typedef struct VFIOIOASHwpt {
>>>>> +uint32_t hwpt_id;
>>>>> +QLIST_HEAD(, VFIODevice) device_list;
>>>>> +QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>> +} VFIOIOASHwpt;
>>>>> +
>>>>> typedef struct VFIOIOMMUFDContainer {
>>>>> VFIOContainerBase bcontainer;
>>>>> IOMMUFDBackend *be;
>>>>> uint32_t ioas_id;
>>>>> +QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>> } V

RE: [PATCH v4 11/12] vfio/migration: Don't block migration device dirty tracking is unsupported

2024-07-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 11/12] vfio/migration: Don't block migration device
>dirty tracking is unsupported
>
>On 17/07/2024 03:38, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: [PATCH v4 11/12] vfio/migration: Don't block migration device
>dirty
>>> tracking is unsupported
>>>
>>> By default VFIO migration is set to auto, which will support live
>>> migration if the migration capability is set *and* also dirty page
>>> tracking is supported.
>>>
>>> For testing purposes one can force enable without dirty page tracking
>>> via enable-migration=on, but that option is generally left for testing
>>> purposes.
>>>
>>> So starting with IOMMU dirty tracking it can use to acomodate the lack of
>>> VF dirty page tracking allowing us to minimize the VF requirements for
>>> migration and thus enabling migration by default for those too.
>>>
>>> Signed-off-by: Joao Martins 
>>> ---
>>> hw/vfio/migration.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 34d4be2ce1b1..ce3d1b6e9a25 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -1036,7 +1036,8 @@ bool vfio_migration_realize(VFIODevice
>>> *vbasedev, Error **errp)
>>> return !vfio_block_migration(vbasedev, err, errp);
>>> }
>>>
>>> -if (!vbasedev->dirty_pages_supported) {
>>> +if (!vbasedev->dirty_pages_supported &&
>>> +!vbasedev->bcontainer->dirty_pages_supported) {
>>> if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
>>> error_setg(&err,
>>>"%s: VFIO device doesn't support device dirty 
>>> tracking",
>>
>> I'm not sure if this message needs to be updated, " VFIO device doesn't
>support device and IOMMU dirty tracking"
>>
>> Same for the below:
>>
>> warn_report("%s: VFIO device doesn't support device dirty tracking"
>
>
>Ah yes, good catch. Additionally I think I should check device hwpt rather
>than
>container::dirty_pages_supported i.e.
>
>if (!vbasedev->dirty_pages_supported &&
>(vbasedev->hwpt && !iommufd_hwpt_dirty_tracking(vbasedev->hwpt)))
>
>This makes sure that migration is blocked with more accuracy

Yes, this is better. Looks bcontainer->dirty_pages_supported is not as accurate 
as in legacy VFIO days.

Thanks
Zhenzhong

RE: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation

2024-07-17 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 03:18, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>>>
>>> There's generally two modes of operation for IOMMUFD:
>>>
>>> * The simple user API which intends to perform relatively simple things
>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>> and mainly performs IOAS_MAP and UNMAP.
>>>
>>> * The native IOMMUFD API where you have fine grained control of the
>>> IOMMU domain and model it accordingly. This is where most new feature
>>> are being steered to.
>>>
>>> For dirty tracking 2) is required, as it needs to ensure that
>>> the stage-2/parent IOMMU domain will only attach devices
>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>> useful guarantee to VMMs that will refuse incompatible device
>>> attachments for IOMMU domains.
>>>
>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>> responsible for creating an IOMMU domain. This is contrast to the
>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>> automatically
>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>> the needed handling for mdevs.
>>>
>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>> similar logic, where IOMMU domains are created and devices attached to
>>> compatible domains. Essentially mimmicing kernel
>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>IOMMU
>>> domain
>>> it falls back to IOAS attach.
>>>
>>> The auto domain logic allows different IOMMU domains to be created
>when
>>> DMA dirty tracking is not desired (and VF can provide it), and others
>where
>>> it is. Here is not used in this way here given how VFIODevice migration
>>> state is initialized after the device attachment. But such mixed mode of
>>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>> been using so far between container vs device dirty tracking.
>>>
>>> Signed-off-by: Joao Martins 
>>> ---
>>> include/hw/vfio/vfio-common.h |  9 
>>> include/sysemu/iommufd.h  |  5 +++
>>> backends/iommufd.c| 30 +
>>> hw/vfio/iommufd.c | 82
>>> +++
>>> backends/trace-events |  1 +
>>> 5 files changed, 127 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>> common.h
>>> index 7419466bca92..2dd468ce3c02 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>
>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>
>>> +typedef struct VFIOIOASHwpt {
>>> +uint32_t hwpt_id;
>>> +QLIST_HEAD(, VFIODevice) device_list;
>>> +QLIST_ENTRY(VFIOIOASHwpt) next;
>>> +} VFIOIOASHwpt;
>>> +
>>> typedef struct VFIOIOMMUFDContainer {
>>> VFIOContainerBase bcontainer;
>>> IOMMUFDBackend *be;
>>> uint32_t ioas_id;
>>> +QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>> } VFIOIOMMUFDContainer;
>>>
>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>> VFIO_IOMMU_IOMMUFD);
>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>> HostIOMMUDevice *hiod;
>>> int devid;
>>> IOMMUFDBackend *iommufd;
>>> +VFIOIOASHwpt *hwpt;
>>> +QLIST_ENTRY(VFIODevice) hwpt_next;
>>> } VFIODevice;
>>>
>>> struct VFIODeviceOps {
>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>> index 57d502a1c79a..e917e7591d05 100644
>>> --- a/include/sysemu/iommufd.h
>>> +++ b/include/sysemu/iommufd.h
>>> @@ -50,6 +50,11 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>uint32_t
>>> devid,
>

RE: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation

2024-07-17 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 03:52, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>
>>> On 16/07/2024 17:44, Joao Martins wrote:
>>>> On 16/07/2024 17:04, Eric Auger wrote:
>>>>> Hi Joao,
>>>>>
>>>>> On 7/12/24 13:46, Joao Martins wrote:
>>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>>
>>>>>> * The simple user API which intends to perform relatively simple
>things
>>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to
>VFIO
>>>>>
>>>>> It generally creates? can you explicit what is "it"
>>>>>
>>>> 'It' here refers to the process/API-user
>>>>
>>>>> I am confused by this automatic terminology again (not your fault). the
>>> doc says:
>>>>> "
>>>>>
>>>>>   *
>>>>>
>>>>> Automatic domain - refers to an iommu domain created
>automatically
>>>>> when attaching a device to an IOAS object. This is compatible to the
>>>>> semantics of VFIO type1.
>>>>>
>>>>>   *
>>>>>
>>>>> Manual domain - refers to an iommu domain designated by the user
>as
>>>>> the target pagetable to be attached to by a device. Though currently
>>>>> there are no uAPIs to directly create such domain, the datastructure
>>>>> and algorithms are ready for handling that use case.
>>>>>
>>>>> "
>>>>>
>>>>>
>>>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>>> not wrong)
>>>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>>>
>>>>
>>>> Correct.
>>>>
>>>> The 'auto domains' generally refers to the kernel-equivalent own
>>> automatic
>>>> attaching to a new pagetable.
>>>>
>>>> Here I call 'auto domains' in the userspace version too because we are
>>> doing the
>>>> exact same but from userspace, using the manual API in IOMMUFD.
>>>>
>>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>>
>>>>>> * The native IOMMUFD API where you have fine grained control of
>the
>>>>>> IOMMU domain and model it accordingly. This is where most new
>>> feature
>>>>>> are being steered to.
>>>>>>
>>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>>> attachments for IOMMU domains.
>>>>>>
>>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>> automatically
>>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>>> the needed handling for mdevs.
>>>>>>
>>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>>> similar logic, where IOMMU domains are created and devices
>attached
>>> to
>>>>>> compatible domains. Essentially mimmicing kernel
>>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>> IOMMU domain
>>>>>> it falls back to IOAS attach.
>>>>>>
>>>>>> The auto domain logic allows different IOMMU domains to be created
>>> when
>>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>> where
>>>>>> it is. Here is not used in this way here given how VFIODevice
>migration
>>>>>
>>

RE: [PATCH 3/6] virtio-iommu: Free [host_]resv_ranges on unset_iommu_devices

2024-07-17 Thread Duan, Zhenzhong



>-Original Message-
>From: Eric Auger 
>Subject: Re: [PATCH 3/6] virtio-iommu: Free [host_]resv_ranges on
>unset_iommu_devices
>
>Hi Zhenzhong,
>
>On 7/17/24 05:06, Duan, Zhenzhong wrote:
>>
>>> -Original Message-
>>> From: Eric Auger 
>>> Subject: [PATCH 3/6] virtio-iommu: Free [host_]resv_ranges on
>>> unset_iommu_devices
>>>
>>> We are currently missing the deallocation of the [host_]resv_regions
>>> in case of hot unplug. Also to make things more simple let's rule
>>> out the case where multiple HostIOMMUDevices would be aliased and
>>> attached to the same IOMMUDevice. This allows to remove the handling
>>> of conflicting Host reserved regions. Anyway this is not properly
>>> supported at guest kernel level. On hotunplug the reserved regions
>>> are reset to the ones set by virtio-iommu property.
>>>
>>> Signed-off-by: Eric Auger 
>>> ---
>>> hw/virtio/virtio-iommu.c | 62 ++--
>>> 1 file changed, 28 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>>> index 2c54c0d976..2de41ab412 100644
>>> --- a/hw/virtio/virtio-iommu.c
>>> +++ b/hw/virtio/virtio-iommu.c
>>> @@ -538,8 +538,6 @@ static int
>>> virtio_iommu_set_host_iova_ranges(VirtIOIOMMU *s, PCIBus *bus,
>>> {
>>> IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>>> IOMMUDevice *sdev;
>>> -GList *current_ranges;
>>> -GList *l, *tmp, *new_ranges = NULL;
>>> int ret = -EINVAL;
>>>
>>> if (!sbus) {
>>> @@ -553,33 +551,10 @@ static int
>>> virtio_iommu_set_host_iova_ranges(VirtIOIOMMU *s, PCIBus *bus,
>>> return ret;
>>> }
>>>
>>> -current_ranges = sdev->host_resv_ranges;
>>> -
>>> -/* check that each new resv region is included in an existing one */
>>> if (sdev->host_resv_ranges) {
>>> -range_inverse_array(iova_ranges,
>>> -&new_ranges,
>>> -0, UINT64_MAX);
>>> -
>>> -for (tmp = new_ranges; tmp; tmp = tmp->next) {
>>> -Range *newr = (Range *)tmp->data;
>>> -bool included = false;
>>> -
>>> -for (l = current_ranges; l; l = l->next) {
>>> -Range * r = (Range *)l->data;
>>> -
>>> -if (range_contains_range(r, newr)) {
>>> -included = true;
>>> -break;
>>> -}
>>> -}
>>> -if (!included) {
>>> -goto error;
>>> -}
>>> -}
>>> -/* all new reserved ranges are included in existing ones */
>>> -ret = 0;
>>> -goto out;
>>> +error_setg(errp, "%s virtio-iommu does not support aliased BDF",
>>> +   __func__);
>>> +return ret;
>>> }
>>>
>>> range_inverse_array(iova_ranges,
>>> @@ -588,14 +563,31 @@ static int
>>> virtio_iommu_set_host_iova_ranges(VirtIOIOMMU *s, PCIBus *bus,
>>> rebuild_resv_regions(sdev);
>>>
>>> return 0;
>>> -error:
>>> -error_setg(errp, "%s Conflicting host reserved ranges set!",
>>> -   __func__);
>>> -out:
>>> -g_list_free_full(new_ranges, g_free);
>>> -return ret;
>>> }
>>>
>>> +static void virtio_iommu_unset_host_iova_ranges(VirtIOIOMMU *s,
>>> PCIBus *bus,
>>> +int devfn)
>>> +{
>>> +IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>>> +IOMMUDevice *sdev;
>>> +
>>> +if (!sbus) {
>>> +return;
>>> +}
>>> +
>>> +sdev = sbus->pbdev[devfn];
>>> +if (!sdev) {
>>> +return;
>>> +}
>>> +
>>> +g_list_free_full(g_steal_pointer(&sdev->host_resv_ranges), g_free);
>>> +g_list_free_full(sdev->resv_regions, g_free);
>>> +sdev->host_resv_ranges = NULL;
>>> +sdev->resv_regions = NULL;
>>> +add_prop_resv_regions(sdev);
>> Is this necessary? rebuild_resv_regions() will do that a

RE: [PATCH] intel_iommu: Use the latest fault reasons defined by spec

2024-07-16 Thread Duan, Zhenzhong

Hi Michael, Jason,

Based on Yi's analysis, is keeping current VERSION value acceptable for you?
Look forward to your comments, currently this open blocks us from sending the 
next version.

Thanks
Zhenzhong

>-Original Message-
>From: Liu, Yi L 
>Subject: Re: [PATCH] intel_iommu: Use the latest fault reasons defined by
>spec
>
>Hi Michael, Jason,
>
>On 2024/5/28 11:03, Jason Wang wrote:
>> On Mon, May 27, 2024 at 2:50 PM Michael S. Tsirkin 
>wrote:
>>>
>>> On Mon, May 27, 2024 at 06:44:58AM +, Duan, Zhenzhong wrote:
>>>> Hi Jason,
>>>>
>>>>> -Original Message-
>>>>> From: Duan, Zhenzhong
>>>>> Subject: RE: [PATCH] intel_iommu: Use the latest fault reasons defined
>by
>>>>> spec
>>>>>
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: Jason Wang 
>>>>>> Subject: Re: [PATCH] intel_iommu: Use the latest fault reasons
>defined by
>>>>>> spec
>>>>>>
>>>>>> On Fri, May 24, 2024 at 4:41 PM Duan, Zhenzhong
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Jason Wang 
>>>>>>>> Subject: Re: [PATCH] intel_iommu: Use the latest fault reasons
>defined
>>>>> by
>>>>>>>> spec
>>>>>>>>
>>>>>>>> On Tue, May 21, 2024 at 6:25 PM Duan, Zhenzhong
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -Original Message-
>>>>>>>>>> From: Jason Wang 
>>>>>>>>>> Subject: Re: [PATCH] intel_iommu: Use the latest fault reasons
>>>>> defined
>>>>>> by
>>>>>>>>>> spec
>>>>>>>>>>
>>>>>>>>>> On Mon, May 20, 2024 at 12:15 PM Liu, Yi L 
>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From: Duan, Zhenzhong 
>>>>>>>>>>>> Sent: Monday, May 20, 2024 11:41 AM
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -Original Message-
>>>>>>>>>>>>> From: Jason Wang 
>>>>>>>>>>>>> Sent: Monday, May 20, 2024 8:44 AM
>>>>>>>>>>>>> To: Duan, Zhenzhong 
>>>>>>>>>>>>> Cc: qemu-devel@nongnu.org; Liu, Yi L ;
>Peng,
>>>>>>>> Chao
>>>>>>>>>> P
>>>>>>>>>>>>> ; Yu Zhang
>>>>>> ;
>>>>>>>>>> Michael
>>>>>>>>>>>>> S. Tsirkin ; Paolo Bonzini
>>>>>>>> ;
>>>>>>>>>>>>> Richard Henderson ;
>Eduardo
>>>>>>>> Habkost
>>>>>>>>>>>>> ; Marcel Apfelbaum
>>>>>>>>>> 
>>>>>>>>>>>>> Subject: Re: [PATCH] intel_iommu: Use the latest fault
>reasons
>>>>>>>> defined
>>>>>>>>>> by
>>>>>>>>>>>>> spec
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 17, 2024 at 6:26 PM Zhenzhong Duan
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Yu Zhang 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently we use only VTD_FR_PASID_TABLE_INV as fault
>>>>>> reason.
>>>>>>>>>>>>>> Update with more detailed fault reasons listed in VT-d spec
>>>>>> 7.2.3.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Yu Zhang 
>>>>>>>>>>>>>> Signed-off-by: Zhenzhong Duan
>
>>>>>>>>>>>>>> ---
>>

RE: [PATCH 3/6] virtio-iommu: Free [host_]resv_ranges on unset_iommu_devices

2024-07-16 Thread Duan, Zhenzhong




>-Original Message-
>From: Eric Auger 
>Subject: [PATCH 3/6] virtio-iommu: Free [host_]resv_ranges on
>unset_iommu_devices
>
>We are currently missing the deallocation of the [host_]resv_regions
>in case of hot unplug. Also to make things more simple let's rule
>out the case where multiple HostIOMMUDevices would be aliased and
>attached to the same IOMMUDevice. This allows to remove the handling
>of conflicting Host reserved regions. Anyway this is not properly
>supported at guest kernel level. On hotunplug the reserved regions
>are reset to the ones set by virtio-iommu property.
>
>Signed-off-by: Eric Auger 
>---
> hw/virtio/virtio-iommu.c | 62 ++--
> 1 file changed, 28 insertions(+), 34 deletions(-)
>
>diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>index 2c54c0d976..2de41ab412 100644
>--- a/hw/virtio/virtio-iommu.c
>+++ b/hw/virtio/virtio-iommu.c
>@@ -538,8 +538,6 @@ static int
>virtio_iommu_set_host_iova_ranges(VirtIOIOMMU *s, PCIBus *bus,
> {
> IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
> IOMMUDevice *sdev;
>-GList *current_ranges;
>-GList *l, *tmp, *new_ranges = NULL;
> int ret = -EINVAL;
>
> if (!sbus) {
>@@ -553,33 +551,10 @@ static int
>virtio_iommu_set_host_iova_ranges(VirtIOIOMMU *s, PCIBus *bus,
> return ret;
> }
>
>-current_ranges = sdev->host_resv_ranges;
>-
>-/* check that each new resv region is included in an existing one */
> if (sdev->host_resv_ranges) {
>-range_inverse_array(iova_ranges,
>-&new_ranges,
>-0, UINT64_MAX);
>-
>-for (tmp = new_ranges; tmp; tmp = tmp->next) {
>-Range *newr = (Range *)tmp->data;
>-bool included = false;
>-
>-for (l = current_ranges; l; l = l->next) {
>-Range * r = (Range *)l->data;
>-
>-if (range_contains_range(r, newr)) {
>-included = true;
>-break;
>-}
>-}
>-if (!included) {
>-goto error;
>-}
>-}
>-/* all new reserved ranges are included in existing ones */
>-ret = 0;
>-goto out;
>+error_setg(errp, "%s virtio-iommu does not support aliased BDF",
>+   __func__);
>+return ret;
> }
>
> range_inverse_array(iova_ranges,
>@@ -588,14 +563,31 @@ static int
>virtio_iommu_set_host_iova_ranges(VirtIOIOMMU *s, PCIBus *bus,
> rebuild_resv_regions(sdev);
>
> return 0;
>-error:
>-error_setg(errp, "%s Conflicting host reserved ranges set!",
>-   __func__);
>-out:
>-g_list_free_full(new_ranges, g_free);
>-return ret;
> }
>
>+static void virtio_iommu_unset_host_iova_ranges(VirtIOIOMMU *s,
>PCIBus *bus,
>+int devfn)
>+{
>+IOMMUPciBus *sbus = g_hash_table_lookup(s->as_by_busptr, bus);
>+IOMMUDevice *sdev;
>+
>+if (!sbus) {
>+return;
>+}
>+
>+sdev = sbus->pbdev[devfn];
>+if (!sdev) {
>+return;
>+}
>+
>+g_list_free_full(g_steal_pointer(&sdev->host_resv_ranges), g_free);
>+g_list_free_full(sdev->resv_regions, g_free);
>+sdev->host_resv_ranges = NULL;
>+sdev->resv_regions = NULL;
>+add_prop_resv_regions(sdev);

Is this necessary? rebuild_resv_regions() will do that again.

Other than that, for the whole series,

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>+}
>+
>+
> static bool check_page_size_mask(VirtIOIOMMU *viommu, uint64_t
>new_mask,
>  Error **errp)
> {
>@@ -704,6 +696,8 @@ virtio_iommu_unset_iommu_device(PCIBus *bus,
>void *opaque, int devfn)
> if (!hiod) {
> return;
> }
>+virtio_iommu_unset_host_iova_ranges(viommu, hiod->aliased_bus,
>+hiod->aliased_devfn);
>
> g_hash_table_remove(viommu->host_iommu_devices, &key);
> }
>--
>2.41.0

RE: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation

2024-07-16 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 16/07/2024 17:44, Joao Martins wrote:
>> On 16/07/2024 17:04, Eric Auger wrote:
>>> Hi Joao,
>>>
>>> On 7/12/24 13:46, Joao Martins wrote:
 There's generally two modes of operation for IOMMUFD:

 * The simple user API which intends to perform relatively simple things
 with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>>
>>> It generally creates? can you explicit what is "it"
>>>
>> 'It' here refers to the process/API-user
>>
>>> I am confused by this automatic terminology again (not your fault). the
>doc says:
>>> "
>>>
>>>   *
>>>
>>> Automatic domain - refers to an iommu domain created automatically
>>> when attaching a device to an IOAS object. This is compatible to the
>>> semantics of VFIO type1.
>>>
>>>   *
>>>
>>> Manual domain - refers to an iommu domain designated by the user as
>>> the target pagetable to be attached to by a device. Though currently
>>> there are no uAPIs to directly create such domain, the datastructure
>>> and algorithms are ready for handling that use case.
>>>
>>> "
>>>
>>>
>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>not wrong)
>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>
>>
>> Correct.
>>
>> The 'auto domains' generally refers to the kernel-equivalent own
>automatic
>> attaching to a new pagetable.
>>
>> Here I call 'auto domains' in the userspace version too because we are
>doing the
>> exact same but from userspace, using the manual API in IOMMUFD.
>>
 and mainly performs IOAS_MAP and UNMAP.

 * The native IOMMUFD API where you have fine grained control of the
 IOMMU domain and model it accordingly. This is where most new
>feature
 are being steered to.

 For dirty tracking 2) is required, as it needs to ensure that
 the stage-2/parent IOMMU domain will only attach devices
 that support dirty tracking (so far it is all homogeneous in x86, likely
 not the case for smmuv3). Such invariant on dirty tracking provides a
 useful guarantee to VMMs that will refuse incompatible device
 attachments for IOMMU domains.

 Dirty tracking insurance is enforced via HWPT_ALLOC, which is
 responsible for creating an IOMMU domain. This is contrast to the
 'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
 when it attaches to VFIO (usually referred as autodomains) but it has
 the needed handling for mdevs.

 To support dirty tracking with the advanced IOMMUFD API, it needs
 similar logic, where IOMMU domains are created and devices attached
>to
 compatible domains. Essentially mimmicing kernel
 iommufd_device_auto_get_domain(). With mdevs given there's no
>IOMMU domain
 it falls back to IOAS attach.

 The auto domain logic allows different IOMMU domains to be created
>when
 DMA dirty tracking is not desired (and VF can provide it), and others
>where
 it is. Here is not used in this way here given how VFIODevice migration
>>>
>>> Here is not used in this way here ?
>>>
>>
>> I meant, 'Here it is not used in this way given (...)'
>>
 state is initialized after the device attachment. But such mixed mode of
 IOMMU dirty tracking + device dirty tracking is an improvement that
>can
 be added on. Keep the 'all of nothing' of type1 approach that we have
 been using so far between container vs device dirty tracking.

 Signed-off-by: Joao Martins 
 ---
  include/hw/vfio/vfio-common.h |  9 
  include/sysemu/iommufd.h  |  5 +++
  backends/iommufd.c| 30 +
  hw/vfio/iommufd.c | 82
>+++
  backends/trace-events |  1 +
  5 files changed, 127 insertions(+)

 diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
 index 7419466bca92..2dd468ce3c02 100644
 --- a/include/hw/vfio/vfio-common.h
 +++ b/include/hw/vfio/vfio-common.h
 @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {

  typedef struct IOMMUFDBackend IOMMUFDBackend;

 +typedef struct VFIOIOASHwpt {
 +uint32_t hwpt_id;
 +QLIST_HEAD(, VFIODevice) device_list;
 +QLIST_ENTRY(VFIOIOASHwpt) next;
 +} VFIOIOASHwpt;
 +
  typedef struct VFIOIOMMUFDContainer {
  VFIOContainerBase bcontainer;
  IOMMUFDBackend *be;
  uint32_t ioas_id;
 +QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
  } VFIOIOMMUFDContainer;

  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
 @@ -135,6 +142,8 @@ typedef struct VFIODevice {
  HostIOMMUDevice *hiod;
  int devid;
  IOMMUFDBackend *iommufd;
 +VFIOIOASHwpt *hwpt;
 +QLIST_ENTRY(VFIODevice) hwpt_next;
  } VFIODevice

RE: [PATCH v4 11/12] vfio/migration: Don't block migration device dirty tracking is unsupported

2024-07-16 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 11/12] vfio/migration: Don't block migration device dirty
>tracking is unsupported
>
>By default VFIO migration is set to auto, which will support live
>migration if the migration capability is set *and* also dirty page
>tracking is supported.
>
>For testing purposes one can force enable without dirty page tracking
>via enable-migration=on, but that option is generally left for testing
>purposes.
>
>So starting with IOMMU dirty tracking it can use to acomodate the lack of
>VF dirty page tracking allowing us to minimize the VF requirements for
>migration and thus enabling migration by default for those too.
>
>Signed-off-by: Joao Martins 
>---
> hw/vfio/migration.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>index 34d4be2ce1b1..ce3d1b6e9a25 100644
>--- a/hw/vfio/migration.c
>+++ b/hw/vfio/migration.c
>@@ -1036,7 +1036,8 @@ bool vfio_migration_realize(VFIODevice
>*vbasedev, Error **errp)
> return !vfio_block_migration(vbasedev, err, errp);
> }
>
>-if (!vbasedev->dirty_pages_supported) {
>+if (!vbasedev->dirty_pages_supported &&
>+!vbasedev->bcontainer->dirty_pages_supported) {
> if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
> error_setg(&err,
>"%s: VFIO device doesn't support device dirty 
> tracking",

I'm not sure if this message needs to be updated, " VFIO device doesn't support 
device and IOMMU dirty tracking"

Same for the below:

warn_report("%s: VFIO device doesn't support device dirty tracking"

RE: [PATCH v4 09/12] vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking support

2024-07-16 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 09/12] vfio/iommufd: Implement
>VFIOIOMMUClass::set_dirty_tracking support
>
>ioctl(iommufd, IOMMU_HWPT_SET_DIRTY_TRACKING, arg) is the UAPI that
>enables or disables dirty page tracking. It is used if the hwpt
>has been created with dirty tracking supported domain (stored in
>hwpt::flags) and it is called on the whole list of iommu domains
>it is are tracking. On failure it rolls it back.
>
>The checking of hwpt::flags is introduced here as a second user
>and thus consolidate such check into a helper function
>iommufd_hwpt_dirty_tracking().
>
>Signed-off-by: Joao Martins 
>---
> include/sysemu/iommufd.h |  3 +++
> backends/iommufd.c   | 23 +++
> hw/vfio/iommufd.c| 39
>++-
> backends/trace-events|  1 +
> 4 files changed, 65 insertions(+), 1 deletion(-)
>
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index e917e7591d05..7416d9219703 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -55,6 +55,9 @@ bool
>iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> uint32_t data_type, uint32_t data_len,
> void *data_ptr, uint32_t *out_hwpt,
> Error **errp);
>+bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
>uint32_t hwpt_id,
>+bool start, Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
>+
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 41a9dec3b2c5..239f0976e0ad 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -239,6 +239,29 @@ bool
>iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> return true;
> }
>
>+bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
>+uint32_t hwpt_id, bool start,
>+Error **errp)
>+{
>+int ret;
>+struct iommu_hwpt_set_dirty_tracking set_dirty = {
>+.size = sizeof(set_dirty),
>+.hwpt_id = hwpt_id,
>+.flags = !start ? 0 : IOMMU_HWPT_DIRTY_TRACKING_ENABLE,
>+};
>+
>+ret = ioctl(be->fd, IOMMU_HWPT_SET_DIRTY_TRACKING, &set_dirty);
>+trace_iommufd_backend_set_dirty(be->fd, hwpt_id, start, ret ? errno :
>0);
>+if (ret) {
>+error_setg_errno(errp, errno,
>+ "IOMMU_HWPT_SET_DIRTY_TRACKING(hwpt_id %u) failed",
>+ hwpt_id);
>+return false;
>+}
>+
>+return true;
>+}
>+
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp)
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index edc8f97d8f3d..da678315faeb 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -110,6 +110,42 @@ static void
>iommufd_cdev_unbind_and_disconnect(VFIODevice *vbasedev)
> iommufd_backend_disconnect(vbasedev->iommufd);
> }
>
>+static bool iommufd_hwpt_dirty_tracking(VFIOIOASHwpt *hwpt)
>+{
>+return hwpt->hwpt_flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>+}
>+
>+static int iommufd_set_dirty_page_tracking(const VFIOContainerBase
>*bcontainer,
>+   bool start, Error **errp)
>+{
>+const VFIOIOMMUFDContainer *container =
>+container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>+VFIOIOASHwpt *hwpt;
>+
>+QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+if (!iommufd_hwpt_dirty_tracking(hwpt)) {
>+continue;
>+}

So the devices under an hwpt that doesn't support dirty tracking are bypassed.
Then how to track dirty pages coming from those devices?

Thanks
Zhenzhong

>+
>+if (!iommufd_backend_set_dirty_tracking(container->be,
>+hwpt->hwpt_id, start, errp)) {
>+goto err;
>+}
>+}
>+
>+return 0;
>+
>+err:
>+QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+if (!iommufd_hwpt_dirty_tracking(hwpt)) {
>+continue;
>+}
>+iommufd_backend_set_dirty_tracking(container->be,
>+   hwpt->hwpt_id, !start, NULL);
>+}
>+return -EINVAL;
>+}
>+
> static int iommufd_cdev_getfd(const char *sysfs_path, Error **errp)
> {
> ERRP_GUARD();
>@@ -278,7 +314,7 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
> container->bcontainer.dirty_pages_supported |=
>-  (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING);
>+  iommufd_hwpt_dirty_tracking(hwpt);
> return true;
> }
>

RE: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation

2024-07-16 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation
>
>There's generally two modes of operation for IOMMUFD:
>
>* The simple user API which intends to perform relatively simple things
>with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>and mainly performs IOAS_MAP and UNMAP.
>
>* The native IOMMUFD API where you have fine grained control of the
>IOMMU domain and model it accordingly. This is where most new feature
>are being steered to.
>
>For dirty tracking 2) is required, as it needs to ensure that
>the stage-2/parent IOMMU domain will only attach devices
>that support dirty tracking (so far it is all homogeneous in x86, likely
>not the case for smmuv3). Such invariant on dirty tracking provides a
>useful guarantee to VMMs that will refuse incompatible device
>attachments for IOMMU domains.
>
>Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>responsible for creating an IOMMU domain. This is contrast to the
>'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
>when it attaches to VFIO (usually referred as autodomains) but it has
>the needed handling for mdevs.
>
>To support dirty tracking with the advanced IOMMUFD API, it needs
>similar logic, where IOMMU domains are created and devices attached to
>compatible domains. Essentially mimmicing kernel
>iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU
>domain
>it falls back to IOAS attach.
>
>The auto domain logic allows different IOMMU domains to be created when
>DMA dirty tracking is not desired (and VF can provide it), and others where
>it is. Here is not used in this way here given how VFIODevice migration
>state is initialized after the device attachment. But such mixed mode of
>IOMMU dirty tracking + device dirty tracking is an improvement that can
>be added on. Keep the 'all of nothing' of type1 approach that we have
>been using so far between container vs device dirty tracking.
>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h |  9 
> include/sysemu/iommufd.h  |  5 +++
> backends/iommufd.c| 30 +
> hw/vfio/iommufd.c | 82
>+++
> backends/trace-events |  1 +
> 5 files changed, 127 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 7419466bca92..2dd468ce3c02 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>
> typedef struct IOMMUFDBackend IOMMUFDBackend;
>
>+typedef struct VFIOIOASHwpt {
>+uint32_t hwpt_id;
>+QLIST_HEAD(, VFIODevice) device_list;
>+QLIST_ENTRY(VFIOIOASHwpt) next;
>+} VFIOIOASHwpt;
>+
> typedef struct VFIOIOMMUFDContainer {
> VFIOContainerBase bcontainer;
> IOMMUFDBackend *be;
> uint32_t ioas_id;
>+QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
>
> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
>@@ -135,6 +142,8 @@ typedef struct VFIODevice {
> HostIOMMUDevice *hiod;
> int devid;
> IOMMUFDBackend *iommufd;
>+VFIOIOASHwpt *hwpt;
>+QLIST_ENTRY(VFIODevice) hwpt_next;
> } VFIODevice;
>
> struct VFIODeviceOps {
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 57d502a1c79a..e917e7591d05 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -50,6 +50,11 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp);
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+uint32_t pt_id, uint32_t flags,
>+uint32_t data_type, uint32_t data_len,
>+void *data_ptr, uint32_t *out_hwpt,
>+Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 2b3d51af26d2..5d3dfa917415 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -208,6 +208,36 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> return ret;
> }
>
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+uint32_t pt_id, uint32_t flags,
>+uint32_t data_type, uint32_t data_len,
>+void *data_ptr, uint32_t *out_hwpt,
>+Error **errp)
>+{
>+int ret, fd = be->fd;
>+struct iommu_hwpt_alloc alloc_hwpt = {
>+.size = sizeof(struct iommu_hwpt_alloc),
>+.flags = flags,
>+.dev_id = dev_id,
>+.pt_id = pt

RE: [PATCH v4 07/12] vfio/{iommufd,container}: Initialize HostIOMMUDeviceCaps during attach_device()

2024-07-16 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 07/12] vfio/{iommufd,container}: Initialize
>HostIOMMUDeviceCaps during attach_device()
>
>Fetch IOMMU hw raw caps behind the device and thus move the
>HostIOMMUDevice::realize() to be done during the attach of the device. It
>allows it to cache the information obtained from IOMMU_GET_HW_INFO
>from
>iommufd early on. However, while legacy HostIOMMUDevice caps
>always return true and doesn't have dependency on other things, the
>IOMMUFD
>backend requires the iommufd FD to be connected and having a devid to be
>able to query capabilities. Hence when exactly is HostIOMMUDevice
>initialized inside backend ::attach_device() implementation is backend
>specific.
>
>This is in preparation to fetch parse hw capabilities and understand if
>dirty tracking is supported by device backing IOMMU without necessarily
>duplicating the amount of calls we do to IOMMU_GET_HW_INFO.
>
>Suggested-by: Cédric Le Goater 
>Signed-off-by: Joao Martins 
>---
> include/sysemu/host_iommu_device.h |  1 +
> hw/vfio/common.c   | 16 ++--
> hw/vfio/container.c|  6 ++
> hw/vfio/iommufd.c  |  7 +++
> 4 files changed, 20 insertions(+), 10 deletions(-)
>
>diff --git a/include/sysemu/host_iommu_device.h
>b/include/sysemu/host_iommu_device.h
>index 20e77cf54568..b1e5f4b8ac3e 100644
>--- a/include/sysemu/host_iommu_device.h
>+++ b/include/sysemu/host_iommu_device.h
>@@ -24,6 +24,7 @@
>  */
> typedef struct HostIOMMUDeviceCaps {
> uint32_t type;
>+uint64_t hw_caps;
> } HostIOMMUDeviceCaps;
>
> #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>index b0beed44116e..cc14f0e3fe24 100644
>--- a/hw/vfio/common.c
>+++ b/hw/vfio/common.c
>@@ -1544,7 +1544,7 @@ bool vfio_attach_device(char *name, VFIODevice
>*vbasedev,
> {
> const VFIOIOMMUClass *ops =
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
>-HostIOMMUDevice *hiod;
>+HostIOMMUDevice *hiod = NULL;

No need to NULL it?

>
> if (vbasedev->iommufd) {
> ops =
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUF
>D));
>@@ -1552,21 +1552,17 @@ bool vfio_attach_device(char *name,
>VFIODevice *vbasedev,
>
> assert(ops);
>
>-if (!ops->attach_device(name, vbasedev, as, errp)) {
>-return false;
>-}
>
>-if (vbasedev->mdev) {
>-return true;
>+if (!vbasedev->mdev) {
>+hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>+vbasedev->hiod = hiod;
> }
>
>-hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>-if (!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp)) {
>+if (!ops->attach_device(name, vbasedev, as, errp)) {
> object_unref(hiod);
>-ops->detach_device(vbasedev);
>+vbasedev->hiod = NULL;
> return false;
> }
>-vbasedev->hiod = hiod;
>
> return true;
> }
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index c27f448ba26e..29da261bbf3e 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -907,6 +907,7 @@ static bool vfio_legacy_attach_device(const char
>*name, VFIODevice *vbasedev,
>   AddressSpace *as, Error **errp)
> {
> int groupid = vfio_device_groupid(vbasedev, errp);
>+HostIOMMUDevice *hiod = vbasedev->hiod;

Hiod is used only once in this func, may be use vbasedev->hiod directly?


> VFIODevice *vbasedev_iter;
> VFIOGroup *group;
> VFIOContainerBase *bcontainer;
>@@ -917,6 +918,11 @@ static bool vfio_legacy_attach_device(const char
>*name, VFIODevice *vbasedev,
>
> trace_vfio_attach_device(vbasedev->name, groupid);
>
>+if (hiod &&
>+!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp)) {
>+return false;
>+}
>+
> group = vfio_get_group(groupid, as, errp);
> if (!group) {
> return false;
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 873c919e319c..d34dc88231ec 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -384,6 +384,7 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
> Error *err = NULL;
> const VFIOIOMMUClass *iommufd_vioc =
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUF
>D));
>+HostIOMMUDevice *hiod = vbasedev->hiod;

Same here.

>
> if (vbasedev->fd < 0) {
> devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>@@ -401,6 +402,11 @@ static bool iommufd_cdev_attach(const char
>*name, VFIODevice *vbasedev,
>
> space = vfio_get_address_space(as);
>
>+if (hiod &&
>+!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp)) {
>+return false;
>+}
>+
> /* try to attach to an existing container in this space */
> QLIST_FOREACH(bcontainer, &space->containers, next) {
> container = container_of(bcontainer, VFIOIOMMUFDContainer,
>bcontainer);
>@@ -722,6 +72

RE: [PATCH v4 04/12] vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()

2024-07-16 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 04/12] vfio/iommufd: Return errno in
>iommufd_cdev_attach_ioas_hwpt()
>
>In preparation to implement auto domains have the attach function
>return the errno it got during domain attach instead of a bool.
>
>-EINVAL is tracked to track domain incompatibilities, and decide whether
>to create a new IOMMU domain.
>
>Signed-off-by: Joao Martins 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> hw/vfio/iommufd.c | 8 
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 604eaa4d9a5d..077dea8f1b64 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -172,7 +172,7 @@ out:
> return ret;
> }
>
>-static bool iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev,
>uint32_t id,
>+static int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev,
>uint32_t id,
>  Error **errp)
> {
> int iommufd = vbasedev->iommufd->fd;
>@@ -187,12 +187,12 @@ static bool
>iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
> error_setg_errno(errp, errno,
>  "[iommufd=%d] error attach %s (%d) to id=%d",
>  iommufd, vbasedev->name, vbasedev->fd, id);
>-return false;
>+return -errno;
> }
>
> trace_iommufd_cdev_attach_ioas_hwpt(iommufd, vbasedev->name,
> vbasedev->fd, id);
>-return true;
>+return 0;
> }
>
> static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error
>**errp)
>@@ -216,7 +216,7 @@ static bool
>iommufd_cdev_attach_container(VFIODevice *vbasedev,
>   VFIOIOMMUFDContainer *container,
>   Error **errp)
> {
>-return iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id,
>errp);
>+return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id,
>errp);
> }
>
> static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>--
>2.17.2

RE: [PATCH v4 02/12] vfio/iommufd: Don't initialize nor set a HOST_IOMMU_DEVICE with mdev

2024-07-16 Thread Duan, Zhenzhong

Hello Joao,

>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 02/12] vfio/iommufd: Don't initialize nor set a
>HOST_IOMMU_DEVICE with mdev
>
>mdevs aren't "physical" devices and when asking for backing IOMMU info, it
>fails the entire provisioning of the guest. Fix that by skipping
>HostIOMMUDevice initialization in the presence of mdevs, and skip setting
>an iommu device when it is known to be an mdev.
>
>Cc: Zhenzhong Duan 
>Fixes: 930589520128 ("vfio/iommufd: Implement
>HostIOMMUDeviceClass::realize() handler")
>Signed-off-by: Joao Martins 

Thanks for fixing.

Reviewed-by: Zhenzhong Duan 

 BRs.
Zhenzhong

>---
> hw/vfio/common.c |  4 
> hw/vfio/pci.c| 10 +++---
> 2 files changed, 11 insertions(+), 3 deletions(-)
>
>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>index 7cdb969fd396..b0beed44116e 100644
>--- a/hw/vfio/common.c
>+++ b/hw/vfio/common.c
>@@ -1556,6 +1556,10 @@ bool vfio_attach_device(char *name,
>VFIODevice *vbasedev,
> return false;
> }
>
>+if (vbasedev->mdev) {
>+return true;
>+}
>+
> hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
> if (!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp)) {
> object_unref(hiod);
>diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>index 585f23a18406..3fc72e898a25 100644
>--- a/hw/vfio/pci.c
>+++ b/hw/vfio/pci.c
>@@ -3116,7 +3116,7 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
>
> vfio_bars_register(vdev);
>
>-if (!pci_device_set_iommu_device(pdev, vbasedev->hiod, errp)) {
>+if (!is_mdev && !pci_device_set_iommu_device(pdev, vbasedev->hiod,
>errp)) {
> error_prepend(errp, "Failed to set iommu_device: ");
> goto out_teardown;
> }
>@@ -3239,7 +3239,9 @@ out_deregister:
> timer_free(vdev->intx.mmap_timer);
> }
> out_unset_idev:
>-pci_device_unset_iommu_device(pdev);
>+if (!is_mdev) {
>+pci_device_unset_iommu_device(pdev);
>+}
> out_teardown:
> vfio_teardown_msi(vdev);
> vfio_bars_exit(vdev);
>@@ -3284,7 +3286,9 @@ static void vfio_exitfn(PCIDevice *pdev)
> vfio_pci_disable_rp_atomics(vdev);
> vfio_bars_exit(vdev);
> vfio_migration_exit(vbasedev);
>-pci_device_unset_iommu_device(pdev);
>+if (!vbasedev->mdev) {
>+pci_device_unset_iommu_device(pdev);
>+}
> }
>
> static void vfio_pci_reset(DeviceState *dev)
>--
>2.17.2

RE: [PATCH v4 00/12] hw/iommufd: IOMMUFD Dirty Tracking

2024-07-16 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v4 00/12] hw/iommufd: IOMMUFD Dirty Tracking
>
>This small series adds support for IOMMU dirty tracking support via the
>IOMMUFD backend. The hardware capability is available on most recent x86
>hardware. The series is divided organized as follows:
>
>* Patch 1-2: Fixes a regression into mdev support with IOMMUFD. This
> one is independent of the series but happened to cross it
> while testing mdev with this series

I guess VFIO ap/ccw may need fixes too.
Will you help on that or I can take it if you want to focus on dirty tracking.
The fix may be trivial, just assign VFIODevice->mdev = true.

Thanks
Zhenzhong

>
>* Patch 3: Adds a support to iommufd_get_device_info() for capabilities
>
>* Patches 4 - 10: IOMMUFD backend support for dirty tracking;
>
>Introduce auto domains -- Patch 5 goes into more detail, but the gist is that
>we will find and attach a device to a compatible IOMMU domain, or allocate
>a new
>hardware pagetable *or* rely on kernel IOAS attach (for mdevs). Afterwards
>the
>workflow is relatively simple:
>
>1) Probe device and allow dirty tracking in the HWPT
>2) Toggling dirty tracking on/off
>3) Read-and-clear of Dirty IOVAs
>
>The heuristics selected for (1) were to always request the HWPT for
>dirty tracking if supported, or rely on device dirty page tracking. This
>is a little simplistic and we aren't necessarily utilizing IOMMU dirty
>tracking even if we ask during hwpt allocation.
>
>The unmap case is deferred until further vIOMMU support with migration
>is added[3] which will then introduce the usage of
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP ioctl
>in the
>dma unmap bitmap flow.
>
>* Patches 11-12: Don't block live migration where there's no VF dirty
>tracker, considering that we have IOMMU dirty tracking.
>
>Comments and feedback appreciated. Thanks for all the review thus far!
>
>Cheers,
>Joao
>
>P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
>some bugs fixed there with regards to IOMMU hugepage dirty tracking.
>
>Changes since v3[5]:
>* Skip HostIOMMUDevice::realize for mdev, and introduce a helper to check
>if the VFIO
>  device is mdev. (Zhenzhong)
>* Skip setting IOMMU device for mdev (Zhenzhong)
>* Add Zhenzhong review tag in patch 3
>* Utilize vbasedev::bcontainer::dirty_pages_supported instead of
>introducing
>  a new HostIOMMUDevice capability and thus remove the cap patch from
>the series (Zhenzhong)
>* Move the HostIOMMUDevice::realize() to be part of VFIODevice
>initialization in attach_device()
>while skipping it all together for mdev. (Cedric)
>* Due to the previous item, had to remove aw_bits because it depends on
>device attach being
>finished, instead defer it to when get_cap() gets called.
>* Skip auto domains for mdev instead of purposedly erroring out
>(Zhenzhong)
>* Pass errp in all cases, and instead just free the error in case of -EINVAL
>  in most of all patches, and also pass Error* in
>iommufd_backend_alloc_hwpt() amd
>  set/query dirty. This is made better thanks in part to skipping auto domains
>for mdev (Cedric)
>
>Changes since RFCv2[4]:
>* Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING even
>if
>we end up not actually toggling dirty tracking. (Avihai)
>* Fix error handling widely in auto domains logic and all patches (Avihai)
>* Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
>* New patches 1 and 2 taking into consideration previous comments.
>* Store hwpt::flags to know if we have dirty tracking (Avihai)
>* New patch 8, that allows to query dirty tracking support after
>provisioning. This is a cleaner way to check IOMMU dirty tracking support
>when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
>device caps way is still used because at vfio attach we aren't yet with
>a fully initialized migration state.
>* Adopt error propagation in query,set dirty tracking
>* Misc improvements overall broadly and Avihai
>* Drop hugepages as it's a bit unrelated; I can pursue that patch
>* separately. The main motivation is to provide a way to test
>without hugepages similar to what vfio_type1_iommu.disable_hugepages=1
>does.
>
>Changes since RFCv1[2]:
>* Remove intel/amd dirty tracking emulation enabling
>* Remove the dirtyrate improvement for VF/IOMMU dirty tracking
>[Will pursue these two in separate series]
>* Introduce auto domains support
>* Enforce dirty tracking following the IOMMUFD UAPI for this
>* Add support for toggling hugepages in IOMMUFD
>* Auto enable support when VF supports migration to use IOMMU
>when it doesn't have VF dirty tracking
>* Add a parameter to toggle VF dirty tracking
>
>[0] https://lore.kernel.org/qemu-devel/20240201072818.327930-1-
>zhenzhong.d...@intel.com/
>[1] https://lore.kernel.org/qemu-devel/20240201072818.327930-10-
>zhenzhong.d...@intel.com/
>[2] https://lore.kernel.org/qemu-devel/20220428211351.3897-1-
>joao.m.mart...@oracle.com/
>

RE: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking

2024-07-11 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
>
>On 11/07/2024 08:41, Cédric Le Goater wrote:
>> Hello Joao,
>>
>> On 7/8/24 4:34 PM, Joao Martins wrote:
>>> This small series adds support for IOMMU dirty tracking support via the
>>> IOMMUFD backend. The hardware capability is available on most recent
>x86
>>> hardware. The series is divided organized as follows:
>>>
>>> * Patch 1: Fixes a regression into mdev support with IOMMUFD. This
>>>     one is independent of the series but happened to cross it
>>>     while testing mdev with this series
>>>
>>> * Patch 2: Adds a support to iommufd_get_device_info() for capabilities
>>>
>>> * Patches 3 - 7: IOMMUFD backend support for dirty tracking;
>>>
>>> Introduce auto domains -- Patch 3 goes into more detail, but the gist is
>that
>>> we will find and attach a device to a compatible IOMMU domain, or
>allocate a new
>>> hardware pagetable *or* rely on kernel IOAS attach (for mdevs).
>Afterwards the
>>> workflow is relatively simple:
>>>
>>> 1) Probe device and allow dirty tracking in the HWPT
>>> 2) Toggling dirty tracking on/off
>>> 3) Read-and-clear of Dirty IOVAs
>>>
>>> The heuristics selected for (1) were to always request the HWPT for
>>> dirty tracking if supported, or rely on device dirty page tracking. This
>>> is a little simplistic and we aren't necessarily utilizing IOMMU dirty
>>> tracking even if we ask during hwpt allocation.
>>>
>>> The unmap case is deferred until further vIOMMU support with migration
>>> is added[3] which will then introduce the usage of
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP
>ioctl in the
>>> dma unmap bitmap flow.
>>>
>>> * Patches 8-10: Don't block live migration where there's no VF dirty
>>> tracker, considering that we have IOMMU dirty tracking.
>>>
>>> Comments and feedback appreciated.
>>>
>>> Cheers,
>>>  Joao
>>>
>>> P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
>>> some bugs fixed there with regards to IOMMU hugepage dirty tracking.
>>>
>>> Changes since RFCv2[4]:
>>> * Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>even if
>>> we end up not actually toggling dirty tracking. (Avihai)
>>> * Fix error handling widely in auto domains logic and all patches (Avihai)
>>> * Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
>>> * New patches 1 and 2 taking into consideration previous comments.
>>> * Store hwpt::flags to know if we have dirty tracking (Avihai)
>>> * New patch 8, that allows to query dirty tracking support after
>>> provisioning. This is a cleaner way to check IOMMU dirty tracking support
>>> when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
>>> device caps way is still used because at vfio attach we aren't yet with
>>> a fully initialized migration state.
>>> * Adopt error propagation in query,set dirty tracking
>>> * Misc improvements overall broadly and Avihai
>>> * Drop hugepages as it's a bit unrelated; I can pursue that patch
>>> * separately. The main motivation is to provide a way to test
>>> without hugepages similar to what
>vfio_type1_iommu.disable_hugepages=1
>>> does.
>>>
>>> Changes since RFCv1[2]:
>>> * Remove intel/amd dirty tracking emulation enabling
>>> * Remove the dirtyrate improvement for VF/IOMMU dirty tracking
>>> [Will pursue these two in separate series]
>>> * Introduce auto domains support
>>> * Enforce dirty tracking following the IOMMUFD UAPI for this
>>> * Add support for toggling hugepages in IOMMUFD
>>> * Auto enable support when VF supports migration to use IOMMU
>>> when it doesn't have VF dirty tracking
>>> * Add a parameter to toggle VF dirty tracking
>>>
>>> [0]
>>> https://lore.kernel.org/qemu-devel/20240201072818.327930-1-
>zhenzhong.d...@intel.com/
>>> [1]
>>> https://lore.kernel.org/qemu-devel/20240201072818.327930-10-
>zhenzhong.d...@intel.com/
>>> [2]
>>> https://lore.kernel.org/qemu-devel/20220428211351.3897-1-
>joao.m.mart...@oracle.com/
>>> [3]
>>> https://lore.kernel.org/qemu-devel/20230622214845.3980-1-
>joao.m.mart...@oracle.com/
>>> [4]
>>> https://lore.kernel.org/qemu-devel/20240212135643.5858-1-
>joao.m.mart...@oracle.com/
>>>
>>> Joao Martins (10):
>>>    vfio/iommufd: don't fail to realize on IOMMU_GET_HW_INFO failure
>>>    backends/iommufd: Extend iommufd_backend_get_device_info() to
>fetch HW
>>> capabilities
>>>    vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()
>>>    vfio/iommufd: Introduce auto domain creation
>>>    vfio/iommufd: Probe and request hwpt dirty tracking capability
>>>    vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking
>support
>>>    vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap
>support
>>>    vfio/iommufd: Parse hw_caps and store dirty tracking support
>>>    vfio/migration: Don't block migration device dirty tracking is
>unsupported
>>>    vfio/common: Allow disabling device dirty page tracking
>>>
>>>   in

RE: [PATCH v3 10/10] vfio/common: Allow disabling device dirty page tracking

2024-07-10 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v3 10/10] vfio/common: Allow disabling device dirty page
>tracking
>
>The property 'x-pre-copy-dirty-page-tracking' allows disabling the whole
>tracking of VF pre-copy phase of dirty page tracking, though it means
>that it will only be used at the start of the switchover phase.
>
>Add an option that disables the VF dirty page tracking, and fall
>back into container-based dirty page tracking. This also allows to
>use IOMMU dirty tracking even on VFs with their own dirty
>tracker scheme.
>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h | 1 +
> hw/vfio/common.c  | 3 +++
> hw/vfio/migration.c   | 3 ++-
> hw/vfio/pci.c | 3 +++
> 4 files changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 7ce925cfab19..9db3fd31cfae 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -137,6 +137,7 @@ typedef struct VFIODevice {
> VFIOMigration *migration;
> Error *migration_blocker;
> OnOffAuto pre_copy_dirty_page_tracking;
>+OnOffAuto device_dirty_page_tracking;
> bool dirty_pages_supported;
> bool dirty_tracking;
> HostIOMMUDevice *hiod;
>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>index 7cdb969fd396..eaa33d9ee037 100644
>--- a/hw/vfio/common.c
>+++ b/hw/vfio/common.c
>@@ -199,6 +199,9 @@ bool vfio_devices_all_device_dirty_tracking(const
>VFIOContainerBase *bcontainer)
> VFIODevice *vbasedev;
>
> QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>+if (vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF) {
>+return false;

Maybe we can initialize vbasedev->dirty_pages_supported to false by checking 
vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF?
This way we can avoid extra check.

>+}
> if (!vbasedev->dirty_pages_supported) {
> return false;
> }
>diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>index 89195928666f..35d67332db20 100644
>--- a/hw/vfio/migration.c
>+++ b/hw/vfio/migration.c
>@@ -1037,7 +1037,8 @@ bool vfio_migration_realize(VFIODevice
>*vbasedev, Error **errp)
> return !vfio_block_migration(vbasedev, err, errp);
> }
>
>-if (!vbasedev->dirty_pages_supported &&
>+if ((!vbasedev->dirty_pages_supported ||
>+ vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
> (vbasedev->iommufd &&
>  !hiodc->get_cap(vbasedev->hiod,
>  HOST_IOMMU_DEVICE_CAP_DIRTY_TRACKING, NULL))) {
>diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>index e03d9f3ba546..22819b2036b3 100644
>--- a/hw/vfio/pci.c
>+++ b/hw/vfio/pci.c
>@@ -3362,6 +3362,9 @@ static Property vfio_pci_dev_properties[] = {
> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking",
>VFIOPCIDevice,
> vbasedev.pre_copy_dirty_page_tracking,
> ON_OFF_AUTO_ON),
>+DEFINE_PROP_ON_OFF_AUTO("x-device-dirty-page-tracking",
>VFIOPCIDevice,
>+vbasedev.device_dirty_page_tracking,
>+ON_OFF_AUTO_ON),
> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
> display, ON_OFF_AUTO_OFF),
> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>--
>2.17.2

RE: [PATCH v3 09/10] vfio/migration: Don't block migration device dirty tracking is unsupported

2024-07-10 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v3 09/10] vfio/migration: Don't block migration device dirty
>tracking is unsupported
>
>By default VFIO migration is set to auto, which will support live
>migration if the migration capability is set *and* also dirty page
>tracking is supported.
>
>For testing purposes one can force enable without dirty page tracking
>via enable-migration=on, but that option is generally left for testing
>purposes.
>
>So starting with IOMMU dirty tracking it can use to acomodate the lack of
>VF dirty page tracking allowing us to minimize the VF requirements for
>migration and thus enabling migration by default for those.
>
>Signed-off-by: Joao Martins 
>---
> hw/vfio/migration.c | 6 +-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
>diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>index 34d4be2ce1b1..89195928666f 100644
>--- a/hw/vfio/migration.c
>+++ b/hw/vfio/migration.c
>@@ -1012,6 +1012,7 @@ void vfio_reset_bytes_transferred(void)
>  */
> bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
> {
>+HostIOMMUDeviceClass *hiodc =
>HOST_IOMMU_DEVICE_GET_CLASS(vbasedev->hiod);
> Error *err = NULL;
> int ret;
>
>@@ -1036,7 +1037,10 @@ bool vfio_migration_realize(VFIODevice
>*vbasedev, Error **errp)
> return !vfio_block_migration(vbasedev, err, errp);
> }
>
>-if (!vbasedev->dirty_pages_supported) {
>+if (!vbasedev->dirty_pages_supported &&
>+(vbasedev->iommufd &&
>+ !hiodc->get_cap(vbasedev->hiod,
>+ HOST_IOMMU_DEVICE_CAP_DIRTY_TRACKING, NULL))) {

What about below, this can avoid a new CAP define.

--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -1036,7 +1036,7 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error 
**errp)
 return !vfio_block_migration(vbasedev, err, errp);
 }

-if (!vbasedev->dirty_pages_supported) {
+if (!vbasedev->dirty_pages_supported && 
!vbasedev->bcontainer->dirty_pages_supported) {
 if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
 error_setg(&err,
"%s: VFIO device doesn't support device dirty tracking",

Thanks
Zhenzhong

> if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
> error_setg(&err,
>"%s: VFIO device doesn't support device dirty 
> tracking",
>--
>2.17.2

RE: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on IOMMU_GET_HW_INFO failure

2024-07-10 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on
>IOMMU_GET_HW_INFO failure
>
>On 10/07/2024 03:53, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Joao Martins 
>>> Subject: Re: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on
>>> IOMMU_GET_HW_INFO failure
>>>
>>> On 09/07/2024 12:45, Joao Martins wrote:
>>>> On 09/07/2024 09:56, Joao Martins wrote:
>>>>> On 09/07/2024 04:43, Duan, Zhenzhong wrote:
>>>>>> Hi Joao,
>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: Joao Martins 
>>>>>>> Subject: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on
>>>>>>> IOMMU_GET_HW_INFO failure
>>>>>>>
>>>>>>> mdevs aren't "physical" devices and when asking for backing
>IOMMU
>>> info, it
>>>>>>> fails the entire provisioning of the guest. Fix that by filling caps 
>>>>>>> info
>>>>>>> when IOMMU_GET_HW_INFO succeeds plus discarding the error we
>>> would
>>>>>>> get into
>>>>>>> iommufd_backend_get_device_info().
>>>>>>>
>>>>>>> Cc: Zhenzhong Duan 
>>>>>>> Fixes: 930589520128 ("vfio/iommufd: Implement
>>>>>>> HostIOMMUDeviceClass::realize() handler")
>>>>>>> Signed-off-by: Joao Martins 
>>>>>>> ---
>>>>>>> hw/vfio/iommufd.c | 12 +---
>>>>>>> 1 file changed, 5 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>>> index c2f158e60386..a4d23f488b01 100644
>>>>>>> --- a/hw/vfio/iommufd.c
>>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>>> @@ -631,15 +631,13 @@ static bool
>>>>>>> hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void
>*opaque,
>>>>>>>
>>>>>>> hiod->agent = opaque;
>>>>>>>
>>>>>>> -if (!iommufd_backend_get_device_info(vdev->iommufd, vdev-
>>>> devid,
>>>>>>> - &type, &data, sizeof(data), 
>>>>>>> errp)) {
>>>>>>> -return false;
>>>>>>> +if (iommufd_backend_get_device_info(vdev->iommufd, vdev-
>>>> devid,
>>>>>>> + &type, &data, sizeof(data), 
>>>>>>> NULL)) {
>>>>>>
>>>>>> This will make us miss the real error. What about bypassing host
>>> IOMMU device
>>>>>> creation for mdev as it's not "physical device", passing corresponding
>>> host IOMMU
>>>>>> device to vIOMMU make no sense.
>>>>>
>>>>> Yeap -- This was my second alternative.
>>>>>
>>>>> I can add an helper for vfio_is_mdev()) and just call
>>>>> iommufd_backend_get_device_info() if !vfio_is_mdev().  I am
>assuming
>>> you meant
>>>>> to skip the initialization of HostIOMMUDeviceCaps::caps as I think that
>>>>> initializing hiod still makes sense as we are still using a
>>>>> TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO somewhat?
>>>>>
>>>> Something like this is what I've done with this patch, see below. I think 
>>>> it
>>>> matches what you suggested? Naturally there's a precedent patch that
>>> introduces
>>>> vfio_is_mdev().
>>>>
>>>
>>> Sorry ignore the previous snip, it was the wrong version, see below
>instead.
>>>
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index c2f158e60386..987dd9779f94 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -631,6 +631,10 @@ static bool
>>> hiod_iommufd_vfio_realize(HostIOMMUDevice
>>> *hiod, void *opaque,
>>>
>>> hiod->agent = opaque;
>>>
>>> +if (vfio_is_mdev(vdev)) {
>>> +return true;
>>> +}
>>> +
>>
>> Not necessary to create a dummy object.
>> What about bypassing object_new(ops->hiod_typename) in
>vfio_attach_device()?
>>
>Not sure I am parsing this. What dummy object you refer to here if it's not
>vbasedev::hiod that remains unused? Also in a suggestion by Cedric, and
>pre-seeding vbasedev::hiod during attach_device()[0]. So I will sort of do that
>already, but your comments means we are allocating a dummy object
>anyways too?

Yes, with your snip change, it's allocated by object_new(ops->hiod_typename) 
but not realized 
and never used else where.

>
>Or are you perhaps suggesting something like:
>
>@@ -1552,17 +1552,20 @@ bool vfio_attach_device(char *name,
>VFIODevice *vbasedev,
>
> assert(ops);
>
> if (!ops->attach_device(name, vbasedev, as, errp)) {
> return false;
> }
>
> if (!vfio_mdev(vbasedev) &&
>!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev,
>errp)) {
>
>?

I mean bypass host IOMMU device thoroughly for mdev, like:

--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1548,6 +1548,10 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
 return false;
 }

+if (vfio_is_mdev(vdev)) {
+return true;
+}
+
 hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
 if (!HOST_IOMMU_DEVICE_GET_CLASS(hiod)->realize(hiod, vbasedev, errp)) {
 object_unref(hiod);


>
>
>[0]
>https://lore.kernel.org/qemu-devel/4e85db04-fbaa-4a6b-b133-
>59170c471...@oracle.com/

RE: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on IOMMU_GET_HW_INFO failure

2024-07-09 Thread Duan, Zhenzhong



>-Original Message-
>From: Joao Martins 
>Subject: Re: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on
>IOMMU_GET_HW_INFO failure
>
>On 09/07/2024 12:45, Joao Martins wrote:
>> On 09/07/2024 09:56, Joao Martins wrote:
>>> On 09/07/2024 04:43, Duan, Zhenzhong wrote:
>>>> Hi Joao,
>>>>
>>>>> -Original Message-
>>>>> From: Joao Martins 
>>>>> Subject: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on
>>>>> IOMMU_GET_HW_INFO failure
>>>>>
>>>>> mdevs aren't "physical" devices and when asking for backing IOMMU
>info, it
>>>>> fails the entire provisioning of the guest. Fix that by filling caps info
>>>>> when IOMMU_GET_HW_INFO succeeds plus discarding the error we
>would
>>>>> get into
>>>>> iommufd_backend_get_device_info().
>>>>>
>>>>> Cc: Zhenzhong Duan 
>>>>> Fixes: 930589520128 ("vfio/iommufd: Implement
>>>>> HostIOMMUDeviceClass::realize() handler")
>>>>> Signed-off-by: Joao Martins 
>>>>> ---
>>>>> hw/vfio/iommufd.c | 12 +---
>>>>> 1 file changed, 5 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index c2f158e60386..a4d23f488b01 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -631,15 +631,13 @@ static bool
>>>>> hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>>>>
>>>>> hiod->agent = opaque;
>>>>>
>>>>> -if (!iommufd_backend_get_device_info(vdev->iommufd, vdev-
>>devid,
>>>>> - &type, &data, sizeof(data), 
>>>>> errp)) {
>>>>> -return false;
>>>>> +if (iommufd_backend_get_device_info(vdev->iommufd, vdev-
>>devid,
>>>>> + &type, &data, sizeof(data), 
>>>>> NULL)) {
>>>>
>>>> This will make us miss the real error. What about bypassing host
>IOMMU device
>>>> creation for mdev as it's not "physical device", passing corresponding
>host IOMMU
>>>> device to vIOMMU make no sense.
>>>
>>> Yeap -- This was my second alternative.
>>>
>>> I can add an helper for vfio_is_mdev()) and just call
>>> iommufd_backend_get_device_info() if !vfio_is_mdev().  I am assuming
>you meant
>>> to skip the initialization of HostIOMMUDeviceCaps::caps as I think that
>>> initializing hiod still makes sense as we are still using a
>>> TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO somewhat?
>>>
>> Something like this is what I've done with this patch, see below. I think it
>> matches what you suggested? Naturally there's a precedent patch that
>introduces
>> vfio_is_mdev().
>>
>
>Sorry ignore the previous snip, it was the wrong version, see below instead.
>
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index c2f158e60386..987dd9779f94 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -631,6 +631,10 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice
>*hiod, void *opaque,
>
> hiod->agent = opaque;
>
>+if (vfio_is_mdev(vdev)) {
>+return true;
>+}
>+

Not necessary to create a dummy object.
What about bypassing object_new(ops->hiod_typename) in vfio_attach_device()?

> if (!iommufd_backend_get_device_info(vdev->iommufd, vdev->devid,
>  &type, &data, sizeof(data), errp)) {
> return false;
>diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>index d95aa6b65788..f092c1537999 100644
>--- a/hw/vfio/pci.c
>+++ b/hw/vfio/pci.c
>@@ -3115,7 +3115,7 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
>
> vfio_bars_register(vdev);
>
>-if (!pci_device_set_iommu_device(pdev, vbasedev->hiod, errp)) {
>+if (!is_mdev && !pci_device_set_iommu_device(pdev, vbasedev->hiod,
>errp)) {

Yes.

> error_prepend(errp, "Failed to set iommu_device: ");
> goto out_teardown;
> }
>@@ -3238,7 +3238,9 @@ out_deregister:
> timer_free(vdev->intx.mmap_timer);
> }
> out_unset_idev:
>-pci_device_unset_iommu_device(pdev);
>+if (!is_mdev) {
>+pci_device_unset_iommu_device(pdev);
>+}
> out_teardown:
> vfio_teardown_msi(vdev);
> vfio_bars_exit(vdev);
>@@ -3268,6 +3270,7 @@ static void vfio_exitfn(PCIDevice *pdev)
> {
> VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> VFIODevice *vbasedev = &vdev->vbasedev;
>+bool is_mdev = vfio_is_mdev(vbasedev);
>
> vfio_unregister_req_notifier(vdev);
> vfio_unregister_err_notifier(vdev);
>@@ -3283,7 +3286,9 @@ static void vfio_exitfn(PCIDevice *pdev)
> vfio_pci_disable_rp_atomics(vdev);
> vfio_bars_exit(vdev);
> vfio_migration_exit(vbasedev);
>-pci_device_unset_iommu_device(pdev);
>+if (!is_mdev) {
>+pci_device_unset_iommu_device(pdev);
>+}

Yes.

Thanks
Zhenzhong
> }

RE: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine

2024-07-09 Thread Duan, Zhenzhong

Hi Nicolin,

>-Original Message-
>From: Nicolin Chen 
>Subject: Re: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-
>machine
>
>On Tue, Jul 09, 2024 at 07:06:50PM +0200, Eric Auger wrote:
>> On 7/9/24 18:59, Nicolin Chen wrote:
>> > Hi Eric,
>> >
>> > Thanks for the comments!
>> >
>> > On Tue, Jul 09, 2024 at 11:11:56AM +0200, Eric Auger wrote:
>> >> On 6/26/24 02:28, Nicolin Chen wrote:
>> >>> A nested SMMU must use iommufd ioctls to communicate with the
>host-level
>> >>> SMMU instance for 2-stage translation support. Add an iommufd link
>to the
>> >>> ARM virt-machine, allowing QEMU command to pass in an iommufd
>object.
>> >> If I am not wrong vfio devices are allowed to use different iommufd's
>> >> (although there is no real benefice). So this command line wouldn't
>> >> match with that option.
>> > I think Jason's remarks highlighted that FD should be one per VM:
>> > https://lore.kernel.org/qemu-
>devel/20240503141024.ge3341...@nvidia.com/
>> OK I thought this was still envisionned althought not really meaningful.
>> By the way, please add Yi and Zhenzhong in cc since thre problematics
>> are connected I think.
>
>Yea.
>
>Yi/Zhenzhong, would you please shed some light on forwarding an
>iommufd handler to the intel_iommu code? IIRC, we did that at the
>beginning but removed it later?

IOMMUFD/devid/ioas handler is packaged in HostIOMMUDeviceIOMMUFD and passed to 
vIOMMU, see 
https://github.com/yiliu1765/qemu/commit/02892a5b452382866e804c3db3bb392c8f8f500f

The whole nesting series is at 
https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting_rfcv2/

We gave the user flexibility to use different iommufd backends in one VM in 
iommufd cdev series. 
We want to be backward compatible in nesting series, the code change to support 
that is also trivial.

Thanks
Zhenzhong

>
>> >> Also while reading the commit msg it is not clear with the iommufd is
>> >> needed in the machine whereas the vfio iommufd BE generally calls
>those
>> >> ioctls.
>> > I think I forgot to revisit it. Both intel_iommu and smmu-common
>> > used to call iommufd_backend_connect() for counting, so there was
>> > a need to pass in the same iommufd handler to the viommu driver.
>> > For SMMU, since it is created in the virt code, we had to pass in
>> > with this patch.
>> >
>> > That being said, it looks like intel_iommu had removed that. So,
>> > likely we don't need an extra user counting for SMMU too.
>> OK at least it deserves some explanation about the "why"
>
>Yes, I agree that the commit message isn't good enough.
>
>Thanks
>Nicolin

RE: [PATCH v3 04/10] vfio/iommufd: Introduce auto domain creation

2024-07-08 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v3 04/10] vfio/iommufd: Introduce auto domain creation
>
>There's generally two modes of operation for IOMMUFD:
>
>* The simple user API which intends to perform relatively simple things
>with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>and mainly performs IOAS_MAP and UNMAP.
>
>* The native IOMMUFD API where you have fine grained control of the
>IOMMU domain and model it accordingly. This is where most new feature
>are being steered to.
>
>For dirty tracking 2) is required, as it needs to ensure that
>the stage-2/parent IOMMU domain will only attach devices
>that support dirty tracking (so far it is all homogeneous in x86, likely
>not the case for smmuv3). Such invariant on dirty tracking provides a
>useful guarantee to VMMs that will refuse incompatible device
>attachments for IOMMU domains.
>
>Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>responsible for creating an IOMMU domain. This is contrast to the
>'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
>when it attaches to VFIO (usually referred as autodomains) but it has
>the needed handling for mdevs.
>
>To support dirty tracking with the advanced IOMMUFD API, it needs
>similar logic, where IOMMU domains are created and devices attached to
>compatible domains. Essentially mimmicing kernel
>iommufd_device_auto_get_domain(). If this fails (i.e. mdevs) it falls back
>to IOAS attach (which again is always the case for mdevs).
>
>The auto domain logic allows different IOMMU domains to be created when
>DMA dirty tracking is not desired (and VF can provide it), and others where
>it is. Here is not used in this way here given how VFIODevice migration
>state is initialized after the device attachment. But such mixed mode of
>IOMMU dirty tracking + device dirty tracking is an improvement that can
>be added on. Keep the 'all of nothing' approach that we have been using
>so far between container vs device dirty tracking.
>
>Signed-off-by: Joao Martins 
>---
> include/hw/vfio/vfio-common.h |  9 
> include/sysemu/iommufd.h  |  4 ++
> backends/iommufd.c| 29 +++
> hw/vfio/iommufd.c | 91
>+++
> backends/trace-events |  1 +
> 5 files changed, 134 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index e8ddf92bb185..82c5a4aaa61e 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>
> typedef struct IOMMUFDBackend IOMMUFDBackend;
>
>+typedef struct VFIOIOASHwpt {
>+uint32_t hwpt_id;
>+QLIST_HEAD(, VFIODevice) device_list;
>+QLIST_ENTRY(VFIOIOASHwpt) next;
>+} VFIOIOASHwpt;
>+
> typedef struct VFIOIOMMUFDContainer {
> VFIOContainerBase bcontainer;
> IOMMUFDBackend *be;
> uint32_t ioas_id;
>+QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
>
> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
>@@ -134,6 +141,8 @@ typedef struct VFIODevice {
> HostIOMMUDevice *hiod;
> int devid;
> IOMMUFDBackend *iommufd;
>+VFIOIOASHwpt *hwpt;
>+QLIST_ENTRY(VFIODevice) hwpt_next;
> } VFIODevice;
>
> struct VFIODeviceOps {
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 57d502a1c79a..35a8cec9780f 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -50,6 +50,10 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>  uint64_t *caps, Error **errp);
>+int iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>+   uint32_t pt_id, uint32_t flags,
>+   uint32_t data_type, uint32_t data_len,
>+   void *data_ptr, uint32_t *out_hwpt);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 2b3d51af26d2..f5f73eaf4a1a 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -208,6 +208,35 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> return ret;
> }
>
>+int iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>+   uint32_t pt_id, uint32_t flags,
>+   uint32_t data_type, uint32_t data_len,
>+   void *data_ptr, uint32_t *out_hwpt)
>+{
>+int ret, fd = be->fd;
>+struct iommu_hwpt_alloc alloc_hwpt = {
>+.size = sizeof(struct iommu_hwpt_alloc),
>+.flags = flags,
>+.dev_id = dev_id,
>+.pt_id = pt_id,
>+.data_type = data_type,
>+.data_len = data_len,
>+.data_upt

RE: [PATCH v3 02/10] backends/iommufd: Extend iommufd_backend_get_device_info() to fetch HW capabilities

2024-07-08 Thread Duan, Zhenzhong




>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v3 02/10] backends/iommufd: Extend
>iommufd_backend_get_device_info() to fetch HW capabilities
>
>The helper will be able to fetch vendor agnostic IOMMU capabilities
>supported both by hardware and software. Right now it is only iommu dirty
>tracking.
>
>Signed-off-by: Joao Martins 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> include/sysemu/iommufd.h | 2 +-
> backends/iommufd.c   | 4 +++-
> hw/vfio/iommufd.c| 4 +++-
> 3 files changed, 7 insertions(+), 3 deletions(-)
>
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 9edfec604595..57d502a1c79a 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -49,7 +49,7 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>   hwaddr iova, ram_addr_t size);
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>- Error **errp);
>+ uint64_t *caps, Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 84fefbc9ee7a..2b3d51af26d2 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -210,7 +210,7 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>  uint32_t *type, void *data, uint32_t len,
>- Error **errp)
>+ uint64_t *caps, Error **errp)
> {
> struct iommu_hw_info info = {
> .size = sizeof(info),
>@@ -226,6 +226,8 @@ bool
>iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>
> g_assert(type);
> *type = info.out_data_type;
>+g_assert(caps);
>+*caps = info.out_capabilities;
>
> return true;
> }
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index a4d23f488b01..9cee71659b1c 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -628,11 +628,13 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
> union {
> struct iommu_hw_info_vtd vtd;
> } data;
>+uint64_t hw_caps;
>
> hiod->agent = opaque;
>
> if (iommufd_backend_get_device_info(vdev->iommufd, vdev->devid,
>- &type, &data, sizeof(data), NULL)) {
>+&type, &data, sizeof(data),
>+&hw_caps, NULL)) {
> hiod->name = g_strdup(vdev->name);
> caps->type = type;
> caps->aw_bits = vfio_device_get_aw_bits(vdev);
>--
>2.17.2

RE: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on IOMMU_GET_HW_INFO failure

2024-07-08 Thread Duan, Zhenzhong

Hi Joao,

>-Original Message-
>From: Joao Martins 
>Subject: [PATCH v3 01/10] vfio/iommufd: Don't fail to realize on
>IOMMU_GET_HW_INFO failure
>
>mdevs aren't "physical" devices and when asking for backing IOMMU info, it
>fails the entire provisioning of the guest. Fix that by filling caps info
>when IOMMU_GET_HW_INFO succeeds plus discarding the error we would
>get into
>iommufd_backend_get_device_info().
>
>Cc: Zhenzhong Duan 
>Fixes: 930589520128 ("vfio/iommufd: Implement
>HostIOMMUDeviceClass::realize() handler")
>Signed-off-by: Joao Martins 
>---
> hw/vfio/iommufd.c | 12 +---
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index c2f158e60386..a4d23f488b01 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -631,15 +631,13 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>
> hiod->agent = opaque;
>
>-if (!iommufd_backend_get_device_info(vdev->iommufd, vdev->devid,
>- &type, &data, sizeof(data), errp)) {
>-return false;
>+if (iommufd_backend_get_device_info(vdev->iommufd, vdev->devid,
>+ &type, &data, sizeof(data), NULL)) {

This will make us miss the real error. What about bypassing host IOMMU device
creation for mdev as it's not "physical device", passing corresponding host 
IOMMU
device to vIOMMU make no sense.

Thanks
Zhenzhong

>+hiod->name = g_strdup(vdev->name);
>+caps->type = type;
>+caps->aw_bits = vfio_device_get_aw_bits(vdev);
> }
>
>-hiod->name = g_strdup(vdev->name);
>-caps->type = type;
>-caps->aw_bits = vfio_device_get_aw_bits(vdev);
>-
> return true;
> }
>
>--
>2.17.2

RE: [PATCH v3 3/3] intel_iommu: make types match

2024-07-05 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: [PATCH v3 3/3] intel_iommu: make types match
>
>From: Clément Mathieu--Drif 
>
>The 'level' field in vtd_iotlb_key is an unsigned integer.
>We don't need to store level as an int in vtd_lookup_iotlb.
>
>Signed-off-by: Clément Mathieu--Drif 
>---
> hw/i386/intel_iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>index 37c21a0aec..be0cb39b5c 100644
>--- a/hw/i386/intel_iommu.c
>+++ b/hw/i386/intel_iommu.c
>@@ -358,7 +358,7 @@ static VTDIOTLBEntry
>*vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
> {
> struct vtd_iotlb_key key;
> VTDIOTLBEntry *entry;
>-int level;
>+unsigned level;

Will it bring any issue if int is used?

>
> for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
> key.gfn = vtd_get_iotlb_gfn(addr, level);
>--
>2.45.2

RE: [PATCH v3 1/3] intel_iommu: fix FRCD construction macro.

2024-07-05 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: [PATCH v3 1/3] intel_iommu: fix FRCD construction macro.
>
>From: Clément Mathieu--Drif 
>
>The constant must be unsigned, otherwise the two's complement
>overrides the other fields when a PASID is present.
>
>Fixes: 1b2b12376c8a ("intel-iommu: PASID support")
>
>Signed-off-by: Clément Mathieu--Drif 
>Reviewed-by: Yi Liu 
>---
> hw/i386/intel_iommu_internal.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>index f8cf99bddf..cbc4030031 100644
>--- a/hw/i386/intel_iommu_internal.h
>+++ b/hw/i386/intel_iommu_internal.h
>@@ -267,7 +267,7 @@
> /* For the low 64-bit of 128-bit */
> #define VTD_FRCD_FI(val)((val) & ~0xfffULL)
> #define VTD_FRCD_PV(val)(((val) & 0xULL) << 40)
>-#define VTD_FRCD_PP(val)(((val) & 0x1) << 31)
>+#define VTD_FRCD_PP(val)(((val) & 0x1ULL) << 31)

Reviewed-by: Zhenzhong Duan 

VTD_FRCD_PV and VTD_FRCD_PP are MACROs for high 64-bit.
By this chance, maybe we can move them under:

/* For the high 64-bit of 128-bit */

Thanks
Zhenzhong

> #define VTD_FRCD_IR_IDX(val)(((val) & 0xULL) << 48)
>
> /* DMA Remapping Fault Conditions */
>--
>2.45.2

RE: [PATCH v3 2/3] intel_iommu: fix type of the mask field in VTDIOTLBPageInvInfo

2024-07-05 Thread Duan, Zhenzhong



>-Original Message-
>From: CLEMENT MATHIEU--DRIF 
>Subject: [PATCH v3 2/3] intel_iommu: fix type of the mask field in
>VTDIOTLBPageInvInfo
>
>From: Clément Mathieu--Drif 
>
>VTDIOTLBPageInvInfo.mask might not fit in an uint8_t.
>Moreover, this field is used in binary operations with 64-bit addresses.
>
>Signed-off-by: Clément Mathieu--Drif 

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>---
> hw/i386/intel_iommu_internal.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>index cbc4030031..5fcbe2744f 100644
>--- a/hw/i386/intel_iommu_internal.h
>+++ b/hw/i386/intel_iommu_internal.h
>@@ -436,7 +436,7 @@ struct VTDIOTLBPageInvInfo {
> uint16_t domain_id;
> uint32_t pasid;
> uint64_t addr;
>-uint8_t mask;
>+uint64_t mask;
> };
> typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
>
>--
>2.45.2

RE: [PATCH v2 5/7] virtio-iommu : Retrieve page size mask on virtio_iommu_set_iommu_device()

2024-07-01 Thread Duan, Zhenzhong




>-Original Message-
>From: Eric Auger 
>Subject: [PATCH v2 5/7] virtio-iommu : Retrieve page size mask on
>virtio_iommu_set_iommu_device()
>
>Retrieve the Host IOMMU Device page size mask when this latter is set.
>This allows to get the information much sooner than when relying on
>IOMMU MR set_page_size_mask() call, whcih happens when the IOMMU
s/whcih/which
Sorry, I missed it in last review.
>MR
>gets enabled. We introduce check_page_size_mask() helper whose code
>is inherited from current virtio_iommu_set_page_size_mask()
>implementation. This callback will be removed in a subsequent patch.
>
>Signed-off-by: Eric Auger 

Otherwise,
Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>
>---
>
>v1 -> v2:
>- do not update the mask if the granule is frozen (Zhenzhong)
>---
> hw/virtio/virtio-iommu.c | 57
>++--
> hw/virtio/trace-events   |  1 +
> 2 files changed, 56 insertions(+), 2 deletions(-)
>
>diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>index b8f75d2b1a..7d5db554af 100644
>--- a/hw/virtio/virtio-iommu.c
>+++ b/hw/virtio/virtio-iommu.c
>@@ -598,9 +598,39 @@ out:
> return ret;
> }
>
>+static bool check_page_size_mask(VirtIOIOMMU *viommu, uint64_t
>new_mask,
>+ Error **errp)
>+{
>+uint64_t cur_mask = viommu->config.page_size_mask;
>+
>+if ((cur_mask & new_mask) == 0) {
>+error_setg(errp, "virtio-iommu reports a page size mask 0x%"PRIx64
>+   " incompatible with currently supported mask 0x%"PRIx64,
>+   new_mask, cur_mask);
>+return false;
>+}
>+/*
>+ * Once the granule is frozen we can't change the mask anymore. If by
>+ * chance the hotplugged device supports the same granule, we can still
>+ * accept it.
>+ */
>+if (viommu->granule_frozen) {
>+int cur_granule = ctz64(cur_mask);
>+
>+if (!(BIT_ULL(cur_granule) & new_mask)) {
>+error_setg(errp,
>+   "virtio-iommu does not support frozen granule 0x%llx",
>+   BIT_ULL(cur_granule));
>+return false;
>+}
>+}
>+return true;
>+}
>+
> static bool virtio_iommu_set_iommu_device(PCIBus *bus, void *opaque,
>int devfn,
>   HostIOMMUDevice *hiod, Error **errp)
> {
>+ERRP_GUARD();
> VirtIOIOMMU *viommu = opaque;
> HostIOMMUDeviceClass *hiodc =
>HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> struct hiod_key *new_key;
>@@ -623,8 +653,28 @@ static bool
>virtio_iommu_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> hiod->aliased_devfn,
> host_iova_ranges, errp);
> if (ret) {
>-g_list_free_full(host_iova_ranges, g_free);
>-return false;
>+goto error;
>+}
>+}
>+if (hiodc->get_page_size_mask) {
>+uint64_t new_mask = hiodc->get_page_size_mask(hiod);
>+
>+if (check_page_size_mask(viommu, new_mask, errp)) {
>+/*
>+ * The default mask depends on the "granule" property. For 
>example,
>+ * with 4k granule, it is -(4 * KiB). When an assigned device has
>+ * page size restrictions due to the hardware IOMMU configuration,
>+ * apply this restriction to the mask.
>+ */
>+trace_virtio_iommu_update_page_size_mask(hiod->name,
>+ 
>viommu->config.page_size_mask,
>+ new_mask);
>+if (!viommu->granule_frozen) {
>+viommu->config.page_size_mask &= new_mask;
>+}
>+} else {
>+error_prepend(errp, "%s: ", hiod->name);
>+goto error;
> }
> }
>
>@@ -637,6 +687,9 @@ static bool
>virtio_iommu_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> g_list_free_full(host_iova_ranges, g_free);
>
> return true;
>+error:
>+g_list_free_full(host_iova_ranges, g_free);
>+return false;
> }
>
> static void
>diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>index 3cf84e04a7..599d855ff6 100644
>--- a/hw/virtio/trace-events
>+++ b/hw/virtio/trace-events
>@@ -132,6 +132,7 @@ virtio_iommu_notify_map(const char *name,
>uint64_t virt_start, uint64_t virt_end
> virtio_iommu_notify_unmap(const char *name, uint64_t virt_start,
>uint64_t virt_end) "mr=%s virt_start=0x%"PRIx64" virt_end=0x%"PRIx64
> virtio_iommu_remap(const char *name, uint64_t virt_start, uint64_t
>virt_end, uint64_t phys_start) "mr=%s virt_start=0x%"PRIx64"
>virt_end=0x%"PRIx64" phys_start=0x%"PRIx64
> virtio_iommu_set_page_size_mask(const char *name, uint64_t old,
>uint64_t new) "mr=%s old_mask=0x%"PRIx64" new_mask=0x%"PRIx64
>+virtio_iommu_update_page_size_mask(const char *name, uint64_t old,
>uint64_t new) "host iommu device=%s old_mask=0x%"PRIx64"
>new_mask=0x

Re: [PATCH 1/2] vfio/display: Fix potential memleak of edid info

2024-06-30 Thread Duan, Zhenzhong


Hi,

On 6/29/2024 8:15 PM, Marc-André Lureau wrote:

Hi

On Fri, Jun 28, 2024 at 1:32 PM Zhenzhong Duan 
 wrote:


EDID related device region info is leaked in three paths:
1. In vfio_get_dev_region_info(), when edid info isn't find, the last
device region info is leaked.
2. In vfio_display_edid_init() error path, edid info is leaked.
3. In VFIODisplay destroying path, edid info is leaked.

Fixes: 08479114b0de ("vfio/display: add edid support.")
Signed-off-by: Zhenzhong Duan 
---
 hw/vfio/display.c | 2 ++
 hw/vfio/helpers.c | 1 +
 2 files changed, 3 insertions(+)

diff --git a/hw/vfio/display.c b/hw/vfio/display.c
index 661e921616..5926bd6628 100644
--- a/hw/vfio/display.c
+++ b/hw/vfio/display.c
@@ -171,6 +171,7 @@ static void
vfio_display_edid_init(VFIOPCIDevice *vdev)

 err:
     trace_vfio_display_edid_write_error();
+    g_free(dpy->edid_info);


It would be better to set it to NULL.

Will do.


     g_free(dpy->edid_regs);
     dpy->edid_regs = NULL;
     return;
@@ -182,6 +183,7 @@ static void vfio_display_edid_exit(VFIODisplay
*dpy)
         return;
     }

+    g_free(dpy->edid_info);
     g_free(dpy->edid_regs);
     g_free(dpy->edid_blob);
     timer_free(dpy->edid_link_timer);
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index b14edd46ed..3dd32b26a4 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -586,6 +586,7 @@ int vfio_get_dev_region_info(VFIODevice
*vbasedev, uint32_t type,
         g_free(*info);
     }

+    g_free(*info);


This seems incorrect, it is freed at the end of the loop above if it 
didn't retun.


Good catch! Will remove it.

Thanks

Zhenzhong



     *info = NULL;
     return -ENODEV;
 }
-- 
2.34.1





--
Marc-André Lureau

RE: [PATCH 5/7] virtio-iommu : Retrieve page size mask on virtio_iommu_set_iommu_device()

2024-06-27 Thread Duan, Zhenzhong

Hi Eric,

>-Original Message-
>From: Eric Auger 
>Subject: [PATCH 5/7] virtio-iommu : Retrieve page size mask on
>virtio_iommu_set_iommu_device()
>
>Retrieve the Host IOMMU Device page size mask when this latter is set.
>This allows to get the information much sooner than when relying on
>IOMMU MR set_page_size_mask() call, whcih happens when the IOMMU
>MR
>gets enabled. We introduce check_page_size_mask() helper whose code
>is inherited from current virtio_iommu_set_page_size_mask()
>implementation. This callback will be removed in a subsequent patch.
>
>Signed-off-by: Eric Auger 
>---
> hw/virtio/virtio-iommu.c | 55
>++--
> hw/virtio/trace-events   |  1 +
> 2 files changed, 54 insertions(+), 2 deletions(-)
>
>diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
>index b8f75d2b1a..631589735a 100644
>--- a/hw/virtio/virtio-iommu.c
>+++ b/hw/virtio/virtio-iommu.c
>@@ -598,9 +598,39 @@ out:
> return ret;
> }
>
>+static bool check_page_size_mask(VirtIOIOMMU *viommu, uint64_t
>new_mask,
>+ Error **errp)
>+{
>+uint64_t cur_mask = viommu->config.page_size_mask;
>+
>+if ((cur_mask & new_mask) == 0) {
>+error_setg(errp, "virtio-iommu reports a page size mask 0x%"PRIx64
>+   " incompatible with currently supported mask 0x%"PRIx64,
>+   new_mask, cur_mask);
>+return false;
>+}
>+/*
>+ * Once the granule is frozen we can't change the mask anymore. If by
>+ * chance the hotplugged device supports the same granule, we can still
>+ * accept it.
>+ */
>+if (viommu->granule_frozen) {
>+int cur_granule = ctz64(cur_mask);
>+
>+if (!(BIT_ULL(cur_granule) & new_mask)) {
>+error_setg(errp,
>+   "virtio-iommu does not support frozen granule 0x%llx",
>+   BIT_ULL(cur_granule));
>+return false;
>+}
>+}
>+return true;
>+}
>+
> static bool virtio_iommu_set_iommu_device(PCIBus *bus, void *opaque,
>int devfn,
>   HostIOMMUDevice *hiod, Error **errp)
> {
>+ERRP_GUARD();
> VirtIOIOMMU *viommu = opaque;
> HostIOMMUDeviceClass *hiodc =
>HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> struct hiod_key *new_key;
>@@ -623,8 +653,26 @@ static bool
>virtio_iommu_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> hiod->aliased_devfn,
> host_iova_ranges, errp);
> if (ret) {
>-g_list_free_full(host_iova_ranges, g_free);
>-return false;
>+goto error;
>+}
>+}
>+if (hiodc->get_page_size_mask) {
>+uint64_t new_mask = hiodc->get_page_size_mask(hiod);
>+
>+if (check_page_size_mask(viommu, new_mask, errp)) {
>+/*
>+ * The default mask depends on the "granule" property. For 
>example,
>+ * with 4k granule, it is -(4 * KiB). When an assigned device has
>+ * page size restrictions due to the hardware IOMMU configuration,
>+ * apply this restriction to the mask.
>+ */
>+trace_virtio_iommu_update_page_size_mask(hiod->name,
>+ 
>viommu->config.page_size_mask,
>+ new_mask);
>+viommu->config.page_size_mask &= new_mask;

This is a bit different from original logic, it may update page_size_mask after 
frozen.
Will that make issue?

Except this question, for all other patches,

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>+} else {
>+error_prepend(errp, "%s: ", hiod->name);
>+goto error;
> }
> }
>
>@@ -637,6 +685,9 @@ static bool
>virtio_iommu_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> g_list_free_full(host_iova_ranges, g_free);
>
> return true;
>+error:
>+g_list_free_full(host_iova_ranges, g_free);
>+return false;
> }
>
> static void
>diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>index 3cf84e04a7..599d855ff6 100644
>--- a/hw/virtio/trace-events
>+++ b/hw/virtio/trace-events
>@@ -132,6 +132,7 @@ virtio_iommu_notify_map(const char *name,
>uint64_t virt_start, uint64_t virt_end
> virtio_iommu_notify_unmap(const char *name, uint64_t virt_start,
>uint64_t virt_end) "mr=%s virt_start=0x%"PRIx64" virt_end=0x%"PRIx64
> virtio_iommu_remap(const char *name, uint64_t virt_start, uint64_t
>virt_end, uint64_t phys_start) "mr=%s virt_start=0x%"PRIx64"
>virt_end=0x%"PRIx64" phys_start=0x%"PRIx64
> virtio_iommu_set_page_size_mask(const char *name, uint64_t old,
>uint64_t new) "mr=%s old_mask=0x%"PRIx64" new_mask=0x%"PRIx64
>+virtio_iommu_update_page_size_mask(const char *name, uint64_t old,
>uint64_t new) "host iommu device=%s old_mask=0x%"PRIx64"
>new_mask=0x%"PRIx64
> virtio_iommu_notify_flag_add(const

RE: [PATCH 4/7] HostIOMMUDevice: Introduce get_page_size_mask() callback

2024-06-27 Thread Duan, Zhenzhong



>-Original Message-
>From: Eric Auger 
>Subject: Re: [PATCH 4/7] HostIOMMUDevice: Introduce
>get_page_size_mask() callback
>
>Hi Zhenzhong,
>
>On 6/27/24 05:06, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -Original Message-
>>> From: Eric Auger 
>>> Subject: [PATCH 4/7] HostIOMMUDevice: Introduce get_page_size_mask()
>>> callback
>>>
>>> This callback will be used to retrieve the page size mask supported
>>> along a given Host IOMMU device.
>>>
>>> Signed-off-by: Eric Auger 
>>> ---
>>> include/hw/vfio/vfio-container-base.h |  7 +++
>>> include/sysemu/host_iommu_device.h|  8 
>>> hw/vfio/container.c   | 10 ++
>>> hw/vfio/iommufd.c | 11 +++
>>> 4 files changed, 36 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>>> container-base.h
>>> index 45d7c40fce..62a8b60d87 100644
>>> --- a/include/hw/vfio/vfio-container-base.h
>>> +++ b/include/hw/vfio/vfio-container-base.h
>>> @@ -88,6 +88,13 @@ int vfio_container_query_dirty_bitmap(const
>>> VFIOContainerBase *bcontainer,
>>>
>>> GList *vfio_container_get_iova_ranges(const VFIOContainerBase
>>> *bcontainer);
>>>
>>> +static inline uint64_t
>>> +vfio_container_get_page_size_mask(const VFIOContainerBase
>*bcontainer)
>>> +{
>>> +assert(bcontainer);
>>> +return bcontainer->pgsizes;
>>> +}
>>> +
>>> #define TYPE_VFIO_IOMMU "vfio-iommu"
>>> #define TYPE_VFIO_IOMMU_LEGACY TYPE_VFIO_IOMMU "-legacy"
>>> #define TYPE_VFIO_IOMMU_SPAPR TYPE_VFIO_IOMMU "-spapr"
>>> diff --git a/include/sysemu/host_iommu_device.h
>>> b/include/sysemu/host_iommu_device.h
>>> index 05c7324a0d..c1bf74ae2c 100644
>>> --- a/include/sysemu/host_iommu_device.h
>>> +++ b/include/sysemu/host_iommu_device.h
>>> @@ -89,6 +89,14 @@ struct HostIOMMUDeviceClass {
>>>  * @hiod: handle to the host IOMMU device
>>>  */
>>> GList* (*get_iova_ranges)(HostIOMMUDevice *hiod);
>>> +/**
>>> + *
>>> + * @get_page_size_mask: Return the page size mask supported along
>>> this
>>> + * @hiod Host IOMMU device
>>> + *
>>> + * @hiod: handle to the host IOMMU device
>>> + */
>>> +uint64_t (*get_page_size_mask)(HostIOMMUDevice *hiod);
>> Not sure if it's simpler to utilize existing .get_cap() to get pgsizes.
>I chose to introduce a new callback because the page_mask can be
>U64_MAX
>and get_cap is likely to return a negative value. So we could not
>distinguish between an error and a full mask.

I see, you are right.

Thanks
Zhenzhong

RE: [PATCH 4/7] HostIOMMUDevice: Introduce get_page_size_mask() callback

2024-06-26 Thread Duan, Zhenzhong

Hi Eric,

>-Original Message-
>From: Eric Auger 
>Subject: [PATCH 4/7] HostIOMMUDevice: Introduce get_page_size_mask()
>callback
>
>This callback will be used to retrieve the page size mask supported
>along a given Host IOMMU device.
>
>Signed-off-by: Eric Auger 
>---
> include/hw/vfio/vfio-container-base.h |  7 +++
> include/sysemu/host_iommu_device.h|  8 
> hw/vfio/container.c   | 10 ++
> hw/vfio/iommufd.c | 11 +++
> 4 files changed, 36 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>container-base.h
>index 45d7c40fce..62a8b60d87 100644
>--- a/include/hw/vfio/vfio-container-base.h
>+++ b/include/hw/vfio/vfio-container-base.h
>@@ -88,6 +88,13 @@ int vfio_container_query_dirty_bitmap(const
>VFIOContainerBase *bcontainer,
>
> GList *vfio_container_get_iova_ranges(const VFIOContainerBase
>*bcontainer);
>
>+static inline uint64_t
>+vfio_container_get_page_size_mask(const VFIOContainerBase *bcontainer)
>+{
>+assert(bcontainer);
>+return bcontainer->pgsizes;
>+}
>+
> #define TYPE_VFIO_IOMMU "vfio-iommu"
> #define TYPE_VFIO_IOMMU_LEGACY TYPE_VFIO_IOMMU "-legacy"
> #define TYPE_VFIO_IOMMU_SPAPR TYPE_VFIO_IOMMU "-spapr"
>diff --git a/include/sysemu/host_iommu_device.h
>b/include/sysemu/host_iommu_device.h
>index 05c7324a0d..c1bf74ae2c 100644
>--- a/include/sysemu/host_iommu_device.h
>+++ b/include/sysemu/host_iommu_device.h
>@@ -89,6 +89,14 @@ struct HostIOMMUDeviceClass {
>  * @hiod: handle to the host IOMMU device
>  */
> GList* (*get_iova_ranges)(HostIOMMUDevice *hiod);
>+/**
>+ *
>+ * @get_page_size_mask: Return the page size mask supported along
>this
>+ * @hiod Host IOMMU device
>+ *
>+ * @hiod: handle to the host IOMMU device
>+ */
>+uint64_t (*get_page_size_mask)(HostIOMMUDevice *hiod);

Not sure if it's simpler to utilize existing .get_cap() to get pgsizes.

Thanks
Zhenzhong

> };
>
> /*
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index adeab1ac89..b5ce559a0d 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -1174,6 +1174,15 @@
>hiod_legacy_vfio_get_iova_ranges(HostIOMMUDevice *hiod)
> return vfio_container_get_iova_ranges(vdev->bcontainer);
> }
>
>+static uint64_t
>+hiod_legacy_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
>+{
>+VFIODevice *vdev = hiod->agent;
>+
>+g_assert(vdev);
>+return vfio_container_get_page_size_mask(vdev->bcontainer);
>+}
>+
> static void vfio_iommu_legacy_instance_init(Object *obj)
> {
> VFIOContainer *container = VFIO_IOMMU_LEGACY(obj);
>@@ -1188,6 +1197,7 @@ static void
>hiod_legacy_vfio_class_init(ObjectClass *oc, void *data)
> hioc->realize = hiod_legacy_vfio_realize;
> hioc->get_cap = hiod_legacy_vfio_get_cap;
> hioc->get_iova_ranges = hiod_legacy_vfio_get_iova_ranges;
>+hioc->get_page_size_mask = hiod_legacy_vfio_get_page_size_mask;
> };
>
> static const TypeInfo types[] = {
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 211e7223f1..7b5f87a148 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -652,12 +652,23 @@
>hiod_iommufd_vfio_get_iova_ranges(HostIOMMUDevice *hiod)
> return vfio_container_get_iova_ranges(vdev->bcontainer);
> }
>
>+static uint64_t
>+hiod_iommufd_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
>+{
>+VFIODevice *vdev = hiod->agent;
>+
>+g_assert(vdev);
>+return vfio_container_get_page_size_mask(vdev->bcontainer);
>+}
>+
>+
> static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
> {
> HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
>
> hiodc->realize = hiod_iommufd_vfio_realize;
> hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
>+hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
> };
>
> static const TypeInfo types[] = {
>--
>2.41.0

RE: [PATCH v4 0/8] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices

2024-06-17 Thread Duan, Zhenzhong




>-Original Message-
>From: Eric Auger 
>Subject: [PATCH v4 0/8] VIRTIO-IOMMU/VFIO: Fix host iommu geometry
>handling for hotplugged devices
>
>This series is based on Zhenzhong HostIOMMUDevice:
>
>[PATCH v7 00/17] Add a host IOMMU device abstraction to check with
>vIOMMU
>https://lore.kernel.org/all/20240605083043.317831-1-
>zhenzhong.d...@intel.com/
>
>It allows to convey host IOVA reserved regions to the virtio-iommu and
>uses the HostIOMMUDevice infrastructure. This replaces the usage of
>IOMMU MR ops which fail to satisfy this need for hotplugged devices.
>
>See below for additional background.
>
>In [1] we attempted to fix a case where a VFIO-PCI device protected
>with a virtio-iommu was assigned to an x86 guest. On x86 the physical
>IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
>virtio-iommu used to expose a 64b address space by default.
>Hence the guest was trying to use the full 64b space and we hit
>DMA MAP failures. To work around this issue we managed to pass
>usable IOVA regions (excluding the out of range space) from VFIO
>to the virtio-iommu device. This was made feasible by introducing
>a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
>This latter gets called when the IOMMU MR is enabled which
>causes the vfio_listener_region_add() to be called.
>
>For coldplugged devices the technique works because we make sure all
>the IOMMU MR are enabled once on the machine init done: 94df5b2180
>("virtio-iommu: Fix 64kB host page size VFIO device assignment")
>for granule freeze. But I would be keen to get rid of this trick.
>
>However with VFIO-PCI hotplug, this technique fails due to the
>race between the call to the callback in the add memory listener
>and the virtio-iommu probe request. Indeed the probe request gets
>called before the attach to the domain. So in that case the usable
>regions are communicated after the probe request and fail to be
>conveyed to the guest.
>
>Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
>MR to have been enabled and the corresponding vfio_listener_region_add()
>to be executed. Instead this series proposes to replace the usage of this
>API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
>modify
>pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
>called earlier, once the usable IOVA regions have been collected by
>VFIO, without the need for the IOMMU MR to be enabled.
>
>This series also removes the spurious message:
>qemu-system-aarch64: warning: virtio-iommu-memory-region-7-0: Notified
>about new host reserved regions after probe
>
>In the short term this may also be used for passing the page size
>mask, which would allow to get rid of the hacky transient IOMMU
>MR enablement mentionned above.
>
>[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
>https://lore.kernel.org/all/20231019134651.842175-1-
>eric.au...@redhat.com/
>
>Extra Notes:
>With that series, the reserved memory regions are communicated on time
>so that the virtio-iommu probe request grabs them. However this is not
>sufficient. In some cases (my case), I still see some DMA MAP failures
>and the guest keeps on using IOVA ranges outside the geometry of the
>physical IOMMU. This is due to the fact the VFIO-PCI device is in the
>same iommu group as the pcie root port. Normally the kernel
>iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
>reserve_iova()
>for each reserved IOVA, which carves them out of the allocator. When
>iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
>the iova domain is already allocated and set and we don't call
>iova_reserve_iommu_regions() again for the vfio-pci device. So its
>corresponding reserved regions are not properly taken into account.
>
>This is not trivial to fix because theoretically the 1st attached
>devices could already have allocated IOVAs within the reserved regions
>of the second device. Also we are somehow hijacking the reserved
>memory regions to model the geometry of the physical IOMMU so not sure
>any attempt to fix that upstream will be accepted. At the moment one
>solution is to make sure assigned devices end up in singleton group.
>Another solution is to work on a different approach where the gaw
>can be passed as an option to the virtio-iommu device, similarly at
>what is done with intel iommu.
>
>This series can be found at:
>https://github.com/eauger/qemu/tree/iommufd_nesting_preq_v7_resv_re
>gions_v4

For the whole series,

Reviewed-by: Zhenzhong Duan 

Thanks
Zhenzhong

>
>History:
>v3 -> v4:
>- add one patch to add aliased pci bus and devfn in the HostIOMMUDevice
>- Use those for resv regions computation
>- Remove VirtioHostIOMMUDevice and simply use the base object
>
>v2 -> v3:
>- moved the series from RFC to patch
>- collected Zhenzhong's R-bs and took into account most of his comments
>  (see replies on v2)
>
>
>Eric Auger (8):
>  HostIOMMUDevice: Store the VFIO/VDPA agent
>  virtio-iommu: I

1 2 3 4 5 6 >

1 - 100 of 541 matches

Mail list logo