Re: [Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen
On 2017年10月26日 22:27, Michael S. Tsirkin wrote: > On Thu, Oct 26, 2017 at 02:19:43PM +0200, Eduardo Habkost wrote: >> On Mon, Aug 21, 2017 at 10:22:15AM +0800, Lan Tianyu wrote: >>> On 2017年08月19日 00:38, Eduardo Habkost wrote: >>>> On Thu, Aug 17, 2017 at 09:37:10AM +0800, Lan Tianyu wrote: >>>>> On 2017年08月16日 19:21, Paolo Bonzini wrote: >>>>>> On 16/08/2017 02:22, Lan Tianyu wrote: >>>>>>> Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU >>>>>>> check for Xen here when vcpu number is more than 255. >>>>>> >>>>>> I think you still need to do a check for vIOMMU being enabled. >>>>> >>>>> Yes, this will be done in the Xen tool stack and Qemu doesn't have such >>>>> knowledge. Operations of create, destroy Xen vIOMMU will be done in the >>>>> Xen tool stack. >>>> >>>> Shouldn't we make QEMU have knowledge of the vIOMMU device, then? >>>> Won't QEMU need to know about it eventually? >>>> >>> >>> Hi Eduardo: >>> Thanks for your review. >>> Xen has some guest modes which doesn't use Qemu and we tried to >>> make Xen vIOMMU framework compatible with all guest modes. So far, we >>> are adding interrupt remapping function for Xen vIOMMU and find qemu >>> doesn't need to know Xen vIOMMU. The check of vcpu number > 255 here >>> will be done in Xen side and so skip the check in Qemu to avoid blocking >>> Xen creating >255 vcpus. >>> We may make Qemu have knowledge of the vIOMMU device if it's >>> necessary when adding new function. >> >> I was expecting it to go through the PC tree, but I will queue it >> on x86-next instead. > > I was waiting for an ack from you or Paolo as you participated in the > discussion. But sure, go ahead > > Acked-by: Michael S. Tsirkin > Great. Thanks. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen
On 2017年08月19日 00:38, Eduardo Habkost wrote: > On Thu, Aug 17, 2017 at 09:37:10AM +0800, Lan Tianyu wrote: >> On 2017年08月16日 19:21, Paolo Bonzini wrote: >>> On 16/08/2017 02:22, Lan Tianyu wrote: >>>> Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU >>>> check for Xen here when vcpu number is more than 255. >>> >>> I think you still need to do a check for vIOMMU being enabled. >> >> Yes, this will be done in the Xen tool stack and Qemu doesn't have such >> knowledge. Operations of create, destroy Xen vIOMMU will be done in the >> Xen tool stack. > > Shouldn't we make QEMU have knowledge of the vIOMMU device, then? > Won't QEMU need to know about it eventually? > Hi Eduardo: Thanks for your review. Xen has some guest modes which doesn't use Qemu and we tried to make Xen vIOMMU framework compatible with all guest modes. So far, we are adding interrupt remapping function for Xen vIOMMU and find qemu doesn't need to know Xen vIOMMU. The check of vcpu number > 255 here will be done in Xen side and so skip the check in Qemu to avoid blocking Xen creating >255 vcpus. We may make Qemu have knowledge of the vIOMMU device if it's necessary when adding new function. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen
On 2017年08月16日 19:21, Paolo Bonzini wrote: > On 16/08/2017 02:22, Lan Tianyu wrote: >> Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU >> check for Xen here when vcpu number is more than 255. > > I think you still need to do a check for vIOMMU being enabled. Yes, this will be done in the Xen tool stack and Qemu doesn't have such knowledge. Operations of create, destroy Xen vIOMMU will be done in the Xen tool stack. > > Paolo > >> Signed-off-by: Lan Tianyu >> --- >> hw/i386/pc.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c >> index 5943539..fc17885 100644 >> --- a/hw/i386/pc.c >> +++ b/hw/i386/pc.c >> @@ -1260,7 +1260,7 @@ void pc_machine_done(Notifier *notifier, void *data) >> fw_cfg_modify_i16(pcms->fw_cfg, FW_CFG_NB_CPUS, pcms->boot_cpus); >> } >> >> -if (pcms->apic_id_limit > 255) { >> +if (pcms->apic_id_limit > 255 && !xen_enabled()) { >> IntelIOMMUState *iommu = >> INTEL_IOMMU_DEVICE(x86_iommu_get_default()); >> >> if (!iommu || !iommu->x86_iommu.intr_supported || >> >
[Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen
Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU check for Xen here when vcpu number is more than 255. Signed-off-by: Lan Tianyu --- hw/i386/pc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 5943539..fc17885 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1260,7 +1260,7 @@ void pc_machine_done(Notifier *notifier, void *data) fw_cfg_modify_i16(pcms->fw_cfg, FW_CFG_NB_CPUS, pcms->boot_cpus); } -if (pcms->apic_id_limit > 255) { +if (pcms->apic_id_limit > 255 && !xen_enabled()) { IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default()); if (!iommu || !iommu->x86_iommu.intr_supported || -- 1.8.3.1
[Qemu-devel] [PATCH v3 2/2] x86: Increase max vcpu number to 8192
For HPC usage case, it will create a huge VM with vcpus number as same as host cpus and this requires more vcpus support in a single VM. This patch is to increase max vcpu number from 288 to 8192 which is current default maximum cpu number for Linux kernel. Signed-off-by: Lan Tianyu --- hw/i386/pc_q35.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index b469d05..3468f5a 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m) m->default_display = "std"; m->no_floppy = 1; m->has_dynamic_sysbus = true; -m->max_cpus = 288; +m->max_cpus = 8192; } static void pc_q35_2_11_machine_options(MachineClass *m) @@ -314,6 +314,7 @@ static void pc_q35_2_10_machine_options(MachineClass *m) { pc_q35_2_11_machine_options(m); m->alias = NULL; +m->max_cpus = 288; m->numa_auto_assign_ram = numa_legacy_auto_assign_ram; SET_MACHINE_COMPAT(m, PC_COMPAT_2_10); } -- 1.8.3.1
[Qemu-devel] [PATCH v3 1/2] pc: add 2.11 machine type
Signed-off-by: Lan Tianyu --- hw/i386/pc_piix.c| 14 +++--- hw/i386/pc_q35.c | 14 +++--- include/hw/compat.h | 3 +++ include/hw/i386/pc.h | 3 +++ 4 files changed, 28 insertions(+), 6 deletions(-) diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c index 11b4336..e99ec8c 100644 --- a/hw/i386/pc_piix.c +++ b/hw/i386/pc_piix.c @@ -436,11 +436,21 @@ static void pc_i440fx_machine_options(MachineClass *m) m->default_display = "std"; } -static void pc_i440fx_2_10_machine_options(MachineClass *m) +static void pc_i440fx_2_11_machine_options(MachineClass *m) { pc_i440fx_machine_options(m); m->alias = "pc"; m->is_default = 1; +} + +DEFINE_I440FX_MACHINE(v2_11, "pc-i440fx-2.11", NULL, + pc_i440fx_2_11_machine_options); + +static void pc_i440fx_2_10_machine_options(MachineClass *m) +{ +pc_i440fx_machine_options(m); +m->is_default = 0; +m->alias = NULL; m->numa_auto_assign_ram = numa_legacy_auto_assign_ram; } @@ -450,8 +460,6 @@ DEFINE_I440FX_MACHINE(v2_10, "pc-i440fx-2.10", NULL, static void pc_i440fx_2_9_machine_options(MachineClass *m) { pc_i440fx_2_10_machine_options(m); -m->is_default = 0; -m->alias = NULL; SET_MACHINE_COMPAT(m, PC_COMPAT_2_9); } diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 169a214..b469d05 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -302,11 +302,20 @@ static void pc_q35_machine_options(MachineClass *m) m->max_cpus = 288; } +static void pc_q35_2_11_machine_options(MachineClass *m) +{ + pc_q35_machine_options(m); + m->alias = "q35"; +} +DEFINE_Q35_MACHINE(v2_11, "pc-q35-2.11", NULL, + pc_q35_2_11_machine_options); + static void pc_q35_2_10_machine_options(MachineClass *m) { -pc_q35_machine_options(m); -m->alias = "q35"; +pc_q35_2_11_machine_options(m); +m->alias = NULL; m->numa_auto_assign_ram = numa_legacy_auto_assign_ram; +SET_MACHINE_COMPAT(m, PC_COMPAT_2_10); } DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL, @@ -315,7 +324,6 @@ DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL, static void pc_q35_2_9_machine_options(MachineClass *m) { pc_q35_2_10_machine_options(m); -m->alias = NULL; SET_MACHINE_COMPAT(m, PC_COMPAT_2_9); } diff --git a/include/hw/compat.h b/include/hw/compat.h index 08f3600..3e101f8 100644 --- a/include/hw/compat.h +++ b/include/hw/compat.h @@ -1,6 +1,9 @@ #ifndef HW_COMPAT_H #define HW_COMPAT_H +#define HW_COMPAT_2_10 \ +/* empty */ + #define HW_COMPAT_2_9 \ {\ .driver = "pci-bridge",\ diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h index d80859b..8226904 100644 --- a/include/hw/i386/pc.h +++ b/include/hw/i386/pc.h @@ -369,6 +369,9 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t); int e820_get_num_entries(void); bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); +#define PC_COMPAT_2_10 \ +HW_COMPAT_2_10 \ + #define PC_COMPAT_2_9 \ HW_COMPAT_2_9 \ {\ -- 1.8.3.1
Re: [Qemu-devel] [PATCH V2 2/3] xen-pt: bind/unbind interrupt remapping format MSI
Hi Anthony: On 2017年08月12日 02:04, Anthony PERARD wrote: > On Wed, Aug 09, 2017 at 04:51:21PM -0400, Lan Tianyu wrote: >> From: Chao Gao >> >> If a vIOMMU is exposed to guest, guest will configure the msi to remapping >> format. The original code isn't suitable to the new format. A new pair >> bind/unbind interfaces are added for this usage. This patch recognizes >> this case and uses new interfaces to bind/unbind msi. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu > > Reviewed-by: Anthony PERARD > > That patch series can be applied once the Xen side patches are merged. > Great. Thanks for your review. -- Best regards Tianyu Lan
[Qemu-devel] [Update PATCH V2] x86: Increase max vcpu number to 8192
Intel Xeon phi chip will support 352 logical threads. For HPC usage case, it will create a huge VM with vcpus number as same as host cpus. This patch is to increase max vcpu number from 288 to 8192 which is current default maximum cpu number for Linux kernel. Signed-off-by: Lan Tianyu --- Change since v1: * Bump max vcpu number from 352 to 8192 * Add compat support for new max vcpu limitation hw/i386/pc_q35.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 169a214..e093601 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -299,13 +299,21 @@ static void pc_q35_machine_options(MachineClass *m) m->default_display = "std"; m->no_floppy = 1; m->has_dynamic_sysbus = true; -m->max_cpus = 288; +m->max_cpus = 8192; } +static void pc_q35_2_11_machine_options(MachineClass *m) +{ + pc_q35_machine_options(m); + m->alias = "q35"; +} +DEFINE_Q35_MACHINE(v2_11, "pc-q35-2.11", NULL, + pc_q35_2_11_machine_options); + static void pc_q35_2_10_machine_options(MachineClass *m) { -pc_q35_machine_options(m); -m->alias = "q35"; +pc_q35_2_11_machine_options(m); +m->max_cpus = 288; m->numa_auto_assign_ram = numa_legacy_auto_assign_ram; } -- 1.8.3.1
[Qemu-devel] [PATCH v2] x86: Increase max vcpu number to 8192
Intel Xeon phi chip will support 352 logical threads. For HPC usage case, it will create a huge VM with vcpus number as same as host cpus. This patch is to increase max vcpu number from 288 to 8192 which is current default maximum cpu number for Linux kernel. Signed-off-by: Lan Tianyu --- Change since v1: * Bump max vcpu number from 352 to 8192 * Add compat support for new max vcpu limitation hw/i386/pc_q35.c | 15 --- include/hw/i386/pc.h | 2 ++ 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 169a214..06fdc95 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -299,14 +299,23 @@ static void pc_q35_machine_options(MachineClass *m) m->default_display = "std"; m->no_floppy = 1; m->has_dynamic_sysbus = true; -m->max_cpus = 288; +m->max_cpus = 8192; } +static void pc_q35_2_11_machine_options(MachineClass *m) +{ + pc_q35_machine_options(m); + m->alias = "q35"; +} +DEFINE_Q35_MACHINE(v2_11, "pc-q35-2.11", NULL, + pc_q35_2_11_machine_options); + static void pc_q35_2_10_machine_options(MachineClass *m) { -pc_q35_machine_options(m); -m->alias = "q35"; +pc_q35_2_11_machine_options(m); +m->max_cpus = 288; m->numa_auto_assign_ram = numa_legacy_auto_assign_ram; +SET_MACHINE_COMPAT(m, PC_COMPAT_2_10); } DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL, diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h index d80859b..634abc6 100644 --- a/include/hw/i386/pc.h +++ b/include/hw/i386/pc.h @@ -369,6 +369,8 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t); int e820_get_num_entries(void); bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); +#define PC_COMPAT_2_10 \ +HW_COMPAT_2_10 \ #define PC_COMPAT_2_9 \ HW_COMPAT_2_9 \ {\ -- 1.8.3.1
Re: [Qemu-devel] [PATCH] x86: Increase max vcpu number to 352
On 2017年08月11日 03:22, Radim Krčmář wrote: > 2017-08-10 15:16-0300, Eduardo Habkost: >> On Thu, Aug 10, 2017 at 02:41:03PM +0200, Radim Krčmář wrote: >>> 2017-08-10 19:02+0800, Lan Tianyu: >>>> On 2017年08月10日 18:26, Daniel P. Berrange wrote: >>>>> On Thu, Aug 10, 2017 at 06:08:07PM +0800, Lan Tianyu wrote: >>>>>> Intel Xeon phi chip will support 352 logical threads. For HPC >>>>>> usage case, it will create a huge VM with vcpus number as same as host >>>>>> cpus. This patch is to increase max vcpu number to 352. >>>>> >>>>> If we pick arbitray limits based on size of physical CPUs that happen >>>>> to be shipping today, we'll continue the cat+mouse game forever trailing >>>>> latest CPUs that vendors ship. >>>>> >>>>> IMHO we should pick a higher number influenced by technical constraints >>>>> of the q35 impl instead. eg can we go straight to something like 512 or >>>>> 1024 ? >>>> >>>> Sure. 512 should be enough and some arrays is defined according to max >>>> vcpu number. >>> >>> Hm, which arrays are that? I was thinking it is safe to bump it to >>> INT_MAX as the number is only used when setting global max_cpus. >> >> We had a MAX_CPUMASK_BITS macro, and bitmaps whose sizes were >> defined at compile time based on it. But commit >> cdda2018e3b9ce0c18938767dfdb1e05a05b67ca removed it. Probably >> those arrays all use max_cpus, by now (and the default for >> max_cpus is smp_cpus, not MachineClass::max_cpus). > > Ah, thanks. > >> Anyway, if we set it to INT_MAX, there are some cases where more >> appropriate error checking/reporting could be required because >> they won't handle overflow very well: >> * pcms->apic_id_limit initialization at pc_cpus_init() >> * ACPI code that assumes possible_cpus->cpus[i].arch_id fits >> in a 32-bit integer >> * Other x86_cpu_apic_id_from_index() calls in PC code >> (especially the initialization of possible_cpus->cpus[i].arch_id). >> Note that x86_cpu_apic_id_from_index(cpu_index) might not fit >> in 32 bits even if cpu_index <= UINT32_MAX. > > Good point, looks like it all comes to x86_cpu_apic_id_from_index(). > Each level of the topology has at most one underutilized bit, so > 2^(32 - 3) would be safe. > > It is still needlessly large for the foreseeable future, but 512 is > going to be surpassed pretty soon, so I think that jumping at least to > 8k would be better. > (8k the current default maximum for Linux and the resulting overcommit > of ~20 is bearable for smoke testing on current hardware.) > Hi All: Thanks for your input. I tried Qemu with 8192 as max_vcpu and it works normally. I will update patches. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH] x86: Increase max vcpu number to 352
On 2017年08月10日 18:26, Daniel P. Berrange wrote: > On Thu, Aug 10, 2017 at 06:08:07PM +0800, Lan Tianyu wrote: >> Intel Xeon phi chip will support 352 logical threads. For HPC >> usage case, it will create a huge VM with vcpus number as same as host >> cpus. This patch is to increase max vcpu number to 352. > > If we pick arbitray limits based on size of physical CPUs that happen > to be shipping today, we'll continue the cat+mouse game forever trailing > latest CPUs that vendors ship. > > IMHO we should pick a higher number influenced by technical constraints > of the q35 impl instead. eg can we go straight to something like 512 or > 1024 ? Sure. 512 should be enough and some arrays is defined according to max vcpu number. > >> Signed-off-by: Lan Tianyu >> --- >> hw/i386/pc_q35.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c >> index 169a214..5e93749 100644 >> --- a/hw/i386/pc_q35.c >> +++ b/hw/i386/pc_q35.c >> @@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m) >> m->default_display = "std"; >> m->no_floppy = 1; >> m->has_dynamic_sysbus = true; >> -m->max_cpus = 288; >> +m->max_cpus = 352; >> } > > You'll need to introduce machine type back compat support so that we > avoid changing the 2.10 q35 machine type - only the 2.11 machine > type should have the new limit. How about the following change? -static void pc_q35_2_10_machine_options(MachineClass *m) +static void pc_q35_2_11_machine_options(MachineClass *m) { pc_q35_machine_options(m); m->alias = "q35"; +} + +static void pc_q35_2_10_machine_options(MachineClass *m) +{ +pc_q35_2_11_options(m); +m->alias = "q35"; +m->max_cpus = 288; m->numa_auto_assign_ram = numa_legacy_auto_assign_ram; } > > Regards, > Daniel > -- Best regards Tianyu Lan
[Qemu-devel] [PATCH] x86: Increase max vcpu number to 352
Intel Xeon phi chip will support 352 logical threads. For HPC usage case, it will create a huge VM with vcpus number as same as host cpus. This patch is to increase max vcpu number to 352. Signed-off-by: Lan Tianyu --- hw/i386/pc_q35.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 169a214..5e93749 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m) m->default_display = "std"; m->no_floppy = 1; m->has_dynamic_sysbus = true; -m->max_cpus = 288; +m->max_cpus = 352; } static void pc_q35_2_10_machine_options(MachineClass *m) -- 1.8.3.1
[Qemu-devel] [PATCH] x86: Increase max vcpu number to 352
Intel Xeon phi chip will support 352 logical threads. For HPC usage case, it will create a huge VM with vcpus number as same as host cpus. This patch is to increase max vcpu number to 352. Signed-off-by: Lan Tianyu --- hw/i386/pc_q35.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 169a214..5e93749 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m) m->default_display = "std"; m->no_floppy = 1; m->has_dynamic_sysbus = true; -m->max_cpus = 288; +m->max_cpus = 352; } static void pc_q35_2_10_machine_options(MachineClass *m) -- 1.8.3.1
Re: [Qemu-devel] [PATCH v2 0/3] Qemu: Add Xen vIOMMU interrupt remapping function support
On 2017年08月10日 17:04, Paolo Bonzini wrote: > On 09/08/2017 22:51, Lan Tianyu wrote: >> This patchset is to deal with MSI interrupt remapping request when guest >> updates MSI registers. >> >> Chao Gao (3): >> i386/msi: Correct mask of destination ID in MSI address >> xen-pt: bind/unbind interrupt remapping format MSI >> msi: Handle remappable format interrupt request >> >> configure | 4 +++- >> hw/i386/xen/xen-hvm.c | 8 ++- >> hw/pci/msi.c | 5 +++-- >> hw/pci/msix.c | 4 +++- >> hw/xen/xen_pt_msi.c | 52 >> +++ >> include/hw/i386/apic-msidef.h | 3 ++- >> include/hw/xen/xen.h | 2 +- >> include/hw/xen/xen_common.h | 25 + >> stubs/xen-hvm.c | 2 +- >> 9 files changed, 83 insertions(+), 22 deletions(-) >> > > Non-Xen parts look good (though I cannot ack them). > > Paolo > Never minder. Thanks for your review. -- Best regards Tianyu Lan
[Qemu-devel] [PATCH V2 2/3] xen-pt: bind/unbind interrupt remapping format MSI
From: Chao Gao If a vIOMMU is exposed to guest, guest will configure the msi to remapping format. The original code isn't suitable to the new format. A new pair bind/unbind interfaces are added for this usage. This patch recognizes this case and uses new interfaces to bind/unbind msi. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- configure | 4 +++- hw/xen/xen_pt_msi.c | 50 --- include/hw/i386/apic-msidef.h | 1 + include/hw/xen/xen_common.h | 25 ++ 4 files changed, 66 insertions(+), 14 deletions(-) diff --git a/configure b/configure index dd73cce..92b56d3 100755 --- a/configure +++ b/configure @@ -2108,13 +2108,15 @@ EOF elif cat > $TMPC < #include int main(void) { + xc_interface *xc = NULL; xenforeignmemory_handle *xfmem; xfmem = xenforeignmemory_open(0, 0); xenforeignmemory_map2(xfmem, 0, 0, 0, 0, 0, 0, 0); - + xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0, 0); return 0; } EOF diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index ff9a79f..5c5d15a 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -163,16 +163,23 @@ static int msi_msix_update(XenPCIPassthroughState *s, int rc = 0; uint64_t table_addr = 0; -XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" - " (entry: %#x)\n", - is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); - if (is_msix) { table_addr = s->msix->mmio_base_addr; } -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, - pirq, gflags, table_addr); +if (addr & MSI_ADDR_IF_MASK) { +XEN_PT_LOG(d, "Updating MSI%s with addr %#" PRIx64 " data %#x\n", + is_msix ? "-X" : "", addr, data); +rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq, + d->devfn, data, addr, table_addr); +} else { +XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" + " (entry: %#x)\n", + is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); + +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, + pirq, gflags, table_addr); +} if (rc) { XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n", @@ -204,13 +211,30 @@ static int msi_msix_disable(XenPCIPassthroughState *s, } if (is_binded) { -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", - is_msix ? "-X" : "", pirq, gvec); -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); -if (rc) { -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, gvec: %#x)\n", - is_msix ? "-X" : "", errno, pirq, gvec); -return rc; +if (addr & MSI_ADDR_IF_MASK) { +XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, " + "addr: %#" PRIx64 ")\n", + is_msix ? "-X" : "", pirq, data, addr); +rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq, +d->devfn, data, addr); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, " + "data: %x, addr: %#" PRIx64 ")\n", + is_msix ? "-X" : "", rc, pirq, data, addr); +return rc; +} + +} else { +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", + is_msix ? "-X" : "", pirq, gvec); +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, + pirq, gflags); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, " + "gvec: %#x)\n", + is_msix ? "-X" : "", errno, pirq, gvec); +return rc; +} } } diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h index 420b411..a2b52d9 100644 --- a/include/hw/i386/apic-msidef.h +++ b/include/hw/i386/apic-msidef.h @@ -27,5 +27,6 @@ #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_IDX_SHIFT 4 #define MSI_ADDR_DEST_ID_MASK 0x000ff000 +#define MSI_ADDR_IF_MASK 0x0010 #endif /* HW_APIC_MSIDEF_H */ diff --git a/include/hw/xen/xen_c
[Qemu-devel] [PATCH V2 3/3] msi: Handle remappable format interrupt request
From: Chao Gao According to VT-d spec Interrupt Remapping and Interrupt Posting -> Interrupt Remapping -> Interrupt Request Formats On Intel 64 Platforms, fields of MSI data register have changed. This patch avoids wrongly regarding a remappable format interrupt request as an interrupt binded with an event channel. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu Acked-by: Anthony PERARD --- hw/i386/xen/xen-hvm.c | 8 +++- hw/pci/msi.c | 5 +++-- hw/pci/msix.c | 4 +++- hw/xen/xen_pt_msi.c | 2 +- include/hw/xen/xen.h | 2 +- stubs/xen-hvm.c | 2 +- 6 files changed, 16 insertions(+), 7 deletions(-) diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c index d9ccd5d..cbbce73 100644 --- a/hw/i386/xen/xen-hvm.c +++ b/hw/i386/xen/xen-hvm.c @@ -145,8 +145,14 @@ void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len) } } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) { +/* If the MSI address is configured in remapping format, the MSI will not + * be remapped into a pirq. + */ +if (msi_addr_lo & MSI_ADDR_IF_MASK) { +return 0; +} /* If vector is 0, the msi is remapped into a pirq, passed as * dest_id. */ diff --git a/hw/pci/msi.c b/hw/pci/msi.c index 5e05ce5..d05c876 100644 --- a/hw/pci/msi.c +++ b/hw/pci/msi.c @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev) static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) { uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev)); -uint32_t mask, data; +uint32_t mask, data, addr_lo; bool msi64bit = flags & PCI_MSI_FLAGS_64BIT; assert(vector < PCI_MSI_VECTORS_MAX); @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) } data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); -if (xen_is_pirq_msi(data)) { +addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev)); +if (xen_is_pirq_msi(addr_lo, data)) { return false; } diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 5078d3d..65cf00c 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -83,9 +83,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int vector, bool fmask) { unsigned offset = vector * PCI_MSIX_ENTRY_SIZE; uint8_t *data = &dev->msix_table[offset + PCI_MSIX_ENTRY_DATA]; +uint8_t *addr_lo = &dev->msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR]; /* MSIs on Xen can be remapped into pirqs. In those cases, masking * and unmasking go through the PV evtchn path. */ -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) { +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(addr_lo), + pci_get_long(data))) { return false; } return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] & diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 5c5d15a..82e13b2 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s, assert((!is_msix && msix_entry == 0) || is_msix); -if (xen_is_pirq_msi(data)) { +if (xen_is_pirq_msi(addr, data)) { *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr); if (!*ppirq) { /* this probably identifies an misconfiguration of the guest, diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index 7efcdaa..0d6c83e 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -34,7 +34,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); void xen_piix3_set_irq(void *opaque, int irq_num, int level); void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len); void xen_hvm_inject_msi(uint64_t addr, uint32_t data); -int xen_is_pirq_msi(uint32_t msi_data); +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data); qemu_irq *xen_interrupt_controller_init(void); diff --git a/stubs/xen-hvm.c b/stubs/xen-hvm.c index 3ca6c51..aeb1592 100644 --- a/stubs/xen-hvm.c +++ b/stubs/xen-hvm.c @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data) { } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) { return 0; } -- 1.8.3.1
[Qemu-devel] [PATCH v2 0/3] Qemu: Add Xen vIOMMU interrupt remapping function support
This patchset is to deal with MSI interrupt remapping request when guest updates MSI registers. Chao Gao (3): i386/msi: Correct mask of destination ID in MSI address xen-pt: bind/unbind interrupt remapping format MSI msi: Handle remappable format interrupt request configure | 4 +++- hw/i386/xen/xen-hvm.c | 8 ++- hw/pci/msi.c | 5 +++-- hw/pci/msix.c | 4 +++- hw/xen/xen_pt_msi.c | 52 +++ include/hw/i386/apic-msidef.h | 3 ++- include/hw/xen/xen.h | 2 +- include/hw/xen/xen_common.h | 25 + stubs/xen-hvm.c | 2 +- 9 files changed, 83 insertions(+), 22 deletions(-) -- 1.8.3.1
[Qemu-devel] [PATCH V2 1/3] i386/msi: Correct mask of destination ID in MSI address
From: Chao Gao According to SDM 10.11.1, only [19:12] bits of MSI address are Destination ID, change the mask to avoid ambiguity for VT-d spec has used the bit 4 to indicate a remappable interrupt request. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu Reviewed-by: Anthony PERARD Reviewed-by: Peter Xu --- include/hw/i386/apic-msidef.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h index 8b4d4cc..420b411 100644 --- a/include/hw/i386/apic-msidef.h +++ b/include/hw/i386/apic-msidef.h @@ -26,6 +26,6 @@ #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_IDX_SHIFT 4 -#define MSI_ADDR_DEST_ID_MASK 0x000 +#define MSI_ADDR_DEST_ID_MASK 0x000ff000 #endif /* HW_APIC_MSIDEF_H */ -- 1.8.3.1
Re: [Qemu-devel] [PATCH 2/3] xen-pt: bind/unbind interrupt remapping format MSI
Hi Anthony: On 2017年06月30日 23:48, Anthony PERARD wrote: > On Thu, Jun 29, 2017 at 01:49:53AM -0400, Lan Tianyu wrote: >> From: Chao Gao >> >> If a vIOMMU is exposed to guest, guest will configure the msi to remapping >> format. The original code isn't suitable to the new format. A new pair >> bind/unbind interfaces are added for this usage. This patch recognizes >> this case and uses new interfaces to bind/unbind msi. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu > > Hi, > > The patch series is going to need to be rebased on top of QEMU upstream. > For starter, configure have changed a bit. Thanks for your reminder. Will do that. > >> --- >> configure | 54 >> +++ >> hw/xen/xen_pt_msi.c | 50 --- >> include/hw/i386/apic-msidef.h | 1 + >> include/hw/xen/xen_common.h | 25 >> 4 files changed, 117 insertions(+), 13 deletions(-) >> >> diff --git a/configure b/configure >> index 476210b..b3ac49f 100755 >> --- a/configure >> +++ b/configure >> @@ -1982,6 +1982,60 @@ EOF >> /* >> * If we have stable libs the we don't want the libxc compat >> * layers, regardless of what CFLAGS we may have been given. >> + */ >> +#undef XC_WANT_COMPAT_EVTCHN_API >> +#undef XC_WANT_COMPAT_GNTTAB_API >> +#undef XC_WANT_COMPAT_MAP_FOREIGN_API >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#if !defined(HVM_MAX_VCPUS) >> +# error HVM_MAX_VCPUS not defined >> +#endif >> +int main(void) { >> + xc_interface *xc = NULL; >> + xenforeignmemory_handle *xfmem; >> + xenevtchn_handle *xe; >> + xengnttab_handle *xg; >> + xen_domain_handle_t handle; >> + xengnttab_grant_copy_segment_t* seg = NULL; >> + >> + xs_daemon_open(); >> + >> + xc = xc_interface_open(0, 0, 0); >> + xc_hvm_set_mem_type(0, 0, HVMMEM_ram_ro, 0, 0); >> + xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0); >> + xc_hvm_inject_msi(xc, 0, 0xf000, 0x); >> + xc_hvm_create_ioreq_server(xc, 0, HVM_IOREQSRV_BUFIOREQ_ATOMIC, NULL); >> + xc_domain_create(xc, 0, handle, 0, NULL, NULL); >> + >> + xfmem = xenforeignmemory_open(0, 0); >> + xenforeignmemory_map(xfmem, 0, 0, 0, 0, 0); >> + >> + xe = xenevtchn_open(0, 0); >> + xenevtchn_fd(xe); >> + >> + xg = xengnttab_open(0, 0); >> + xengnttab_grant_copy(xg, 0, seg); >> + >> + xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0 ,0); >> + >> + return 0; >> +} >> +EOF >> + compile_prog "" "$xen_libs $xen_stable_libs" >> +then >> +xen_ctrl_version=4100 >> +xen=yes > > There have been some change/refactoring in configure, so this won't > work. The xen_ctrl_version got one more digit. > > Can you try with this patch? Which is also simpler. Sure. Thanks. > diff --git a/configure b/configure > index c571ad14e5..a06f2c0b92 100755 > --- a/configure > +++ b/configure > @@ -2021,6 +2021,24 @@ EOF > # Xen unstable > elif > cat > $TMPC < +#include > +int main(void) { > + xc_interface *xc = NULL; > + > + xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0 ,0); > + > + return 0; > +} > +EOF > +compile_prog "" "$xen_libs -lxendevicemodel $xen_stable_libs" > + then > + xen_stable_libs="-lxendevicemodel $xen_stable_libs" > + xen_ctrl_version=41000 > + xen=yes > + > +# Xen 4.9 > +elif > +cat > $TMPC < #undef XC_WANT_COMPAT_DEVICEMODEL_API > #define __XEN_TOOLS__ > #include > > >> index 8e1580d..4ba43a8 100644 >> --- a/include/hw/xen/xen_common.h >> +++ b/include/hw/xen/xen_common.h >> @@ -438,4 +438,29 @@ static inline int xengnttab_grant_copy(xengnttab_handle >> *xgt, uint32_t count, >> } >> #endif >> >> +/* Xen before 4.10 */ >> +#if CONFIG_XEN_CTRL_INTERFACE_VERSION < 4100 > > This will needs to be > CONFIG_XEN_CTRL_INTERFACE_VERSION < 41000 > Will update. -- Best regards Tianyu Lan
[Qemu-devel] [PATCH 1/3] i386/msi: Correct mask of destination ID in MSI address
From: Chao Gao According to SDM 10.11.1, only [19:12] bits of MSI address are Destination ID, change the mask to avoid ambiguity for VT-d spec has used the bit 4 to indicate a remappable interrupt request. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- include/hw/i386/apic-msidef.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h index 8b4d4cc..420b411 100644 --- a/include/hw/i386/apic-msidef.h +++ b/include/hw/i386/apic-msidef.h @@ -26,6 +26,6 @@ #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_IDX_SHIFT 4 -#define MSI_ADDR_DEST_ID_MASK 0x000 +#define MSI_ADDR_DEST_ID_MASK 0x000ff000 #endif /* HW_APIC_MSIDEF_H */ -- 1.8.3.1
[Qemu-devel] [PATCH 3/3] msi: Handle remappable format interrupt request
From: Chao Gao According to VT-d spec Interrupt Remapping and Interrupt Posting -> Interrupt Remapping -> Interrupt Request Formats On Intel 64 Platforms, fields of MSI data register have changed. This patch avoids wrongly regarding a remappable format interrupt request as an interrupt binded with an event channel. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/pci/msi.c | 5 +++-- hw/pci/msix.c| 4 +++- hw/xen/xen_pt_msi.c | 2 +- include/hw/xen/xen.h | 2 +- xen-hvm-stub.c | 2 +- xen-hvm.c| 8 +++- 6 files changed, 16 insertions(+), 7 deletions(-) diff --git a/hw/pci/msi.c b/hw/pci/msi.c index a87b227..3713ea6 100644 --- a/hw/pci/msi.c +++ b/hw/pci/msi.c @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev) static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) { uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev)); -uint32_t mask, data; +uint32_t mask, data, addr_lo; bool msi64bit = flags & PCI_MSI_FLAGS_64BIT; assert(vector < PCI_MSI_VECTORS_MAX); @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) } data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); -if (xen_is_pirq_msi(data)) { +addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev)); +if (xen_is_pirq_msi(addr_lo, data)) { return false; } diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 0ec1cb1..6dda83c 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -81,9 +81,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int vector, bool fmask) { unsigned offset = vector * PCI_MSIX_ENTRY_SIZE; uint8_t *data = &dev->msix_table[offset + PCI_MSIX_ENTRY_DATA]; +uint8_t *addr_lo = &dev->msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR]; /* MSIs on Xen can be remapped into pirqs. In those cases, masking * and unmasking go through the PV evtchn path. */ -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) { +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(addr_lo), + pci_get_long(data))) { return false; } return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] & diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 26a69d9..68da0e4 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s, assert((!is_msix && msix_entry == 0) || is_msix); -if (xen_is_pirq_msi(data)) { +if (xen_is_pirq_msi(addr, data)) { *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr); if (!*ppirq) { /* this probably identifies an misconfiguration of the guest, diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index a8f3afb..b1ef41e 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); void xen_piix3_set_irq(void *opaque, int irq_num, int level); void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len); void xen_hvm_inject_msi(uint64_t addr, uint32_t data); -int xen_is_pirq_msi(uint32_t msi_data); +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data); qemu_irq *xen_interrupt_controller_init(void); diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c index c500325..a3a3bb3 100644 --- a/xen-hvm-stub.c +++ b/xen-hvm-stub.c @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data) { } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) { return 0; } diff --git a/xen-hvm.c b/xen-hvm.c index 0892361..9f1a7bd 100644 --- a/xen-hvm.c +++ b/xen-hvm.c @@ -146,8 +146,14 @@ void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len) } } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) { +/* If the MSI address is configured in remapping format, the MSI will not + * be remapped into a pirq. + */ +if (msi_addr_lo & MSI_ADDR_IF_MASK) { +return 0; +} /* If vector is 0, the msi is remapped into a pirq, passed as * dest_id. */ -- 1.8.3.1
[Qemu-devel] [PATCH 2/3] xen-pt: bind/unbind interrupt remapping format MSI
From: Chao Gao If a vIOMMU is exposed to guest, guest will configure the msi to remapping format. The original code isn't suitable to the new format. A new pair bind/unbind interfaces are added for this usage. This patch recognizes this case and uses new interfaces to bind/unbind msi. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- configure | 54 +++ hw/xen/xen_pt_msi.c | 50 --- include/hw/i386/apic-msidef.h | 1 + include/hw/xen/xen_common.h | 25 4 files changed, 117 insertions(+), 13 deletions(-) diff --git a/configure b/configure index 476210b..b3ac49f 100755 --- a/configure +++ b/configure @@ -1982,6 +1982,60 @@ EOF /* * If we have stable libs the we don't want the libxc compat * layers, regardless of what CFLAGS we may have been given. + */ +#undef XC_WANT_COMPAT_EVTCHN_API +#undef XC_WANT_COMPAT_GNTTAB_API +#undef XC_WANT_COMPAT_MAP_FOREIGN_API +#include +#include +#include +#include +#include +#include +#include +#if !defined(HVM_MAX_VCPUS) +# error HVM_MAX_VCPUS not defined +#endif +int main(void) { + xc_interface *xc = NULL; + xenforeignmemory_handle *xfmem; + xenevtchn_handle *xe; + xengnttab_handle *xg; + xen_domain_handle_t handle; + xengnttab_grant_copy_segment_t* seg = NULL; + + xs_daemon_open(); + + xc = xc_interface_open(0, 0, 0); + xc_hvm_set_mem_type(0, 0, HVMMEM_ram_ro, 0, 0); + xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0); + xc_hvm_inject_msi(xc, 0, 0xf000, 0x); + xc_hvm_create_ioreq_server(xc, 0, HVM_IOREQSRV_BUFIOREQ_ATOMIC, NULL); + xc_domain_create(xc, 0, handle, 0, NULL, NULL); + + xfmem = xenforeignmemory_open(0, 0); + xenforeignmemory_map(xfmem, 0, 0, 0, 0, 0); + + xe = xenevtchn_open(0, 0); + xenevtchn_fd(xe); + + xg = xengnttab_open(0, 0); + xengnttab_grant_copy(xg, 0, seg); + + xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0 ,0); + + return 0; +} +EOF + compile_prog "" "$xen_libs $xen_stable_libs" +then +xen_ctrl_version=4100 +xen=yes + elif + cat > $TMPC <msix->mmio_base_addr; } -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, - pirq, gflags, table_addr); +if (addr & MSI_ADDR_IF_MASK) { +XEN_PT_LOG(d, "Updating MSI%s with addr %#" PRIx64 " data %#x\n", + is_msix ? "-X" : "", addr, data); +rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq, + d->devfn, data, addr, table_addr); +} else { +XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" + " (entry: %#x)\n", + is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); + +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, + pirq, gflags, table_addr); +} if (rc) { XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n", @@ -204,13 +211,30 @@ static int msi_msix_disable(XenPCIPassthroughState *s, } if (is_binded) { -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", - is_msix ? "-X" : "", pirq, gvec); -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); -if (rc) { -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, gvec: %#x)\n", - is_msix ? "-X" : "", errno, pirq, gvec); -return rc; +if (addr & MSI_ADDR_IF_MASK) { +XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, " + "addr: %#" PRIx64 ")\n", + is_msix ? "-X" : "", pirq, data, addr); +rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq, +d->devfn, data, addr); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, " + "data: %x, addr: %#" PRIx64 ")\n", + is_msix ? "-X" : "", rc, pirq, data, addr); +return rc; +} + +} else { +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", + is_msix ? "-X" : "", pirq, gvec); +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, + pirq, gflags); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, " + "gvec: %#x)\n", +
[Qemu-devel] [PATCH 0/3] Qemu: Add Xen vIOMMU interrupt remapping function support
This patchset is to deal with MSI interrupt remapping request when guest updates MSI registers. Chao Gao (3): i386/msi: Correct mask of destination ID in MSI address xen-pt: bind/unbind interrupt remapping format MSI msi: Handle remappable format interrupt request configure | 54 +++ hw/pci/msi.c | 5 ++-- hw/pci/msix.c | 4 +++- hw/xen/xen_pt_msi.c | 52 ++--- include/hw/i386/apic-msidef.h | 3 ++- include/hw/xen/xen.h | 2 +- include/hw/xen/xen_common.h | 25 xen-hvm-stub.c| 2 +- xen-hvm.c | 8 ++- 9 files changed, 134 insertions(+), 21 deletions(-) -- 1.8.3.1
Re: [Qemu-devel] [Xen-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI
On 2017年05月24日 01:06, Anthony PERARD wrote: > On Tue, May 23, 2017 at 08:16:25PM +0800, Lan Tianyu wrote: >> On 2017年05月19日 20:04, Jan Beulich wrote: >>>>>> On 19.05.17 at 13:16, wrote: >>>> On Thu, May 18, 2017 at 01:32:59AM -0400, Lan Tianyu wrote: >>>>> --- a/include/hw/i386/apic-msidef.h >>>>> +++ b/include/hw/i386/apic-msidef.h >>>>> @@ -26,6 +26,7 @@ >>>>> >>>>> #define MSI_ADDR_DEST_ID_SHIFT 12 >>>>> #define MSI_ADDR_DEST_IDX_SHIFT 4 >>>>> -#define MSI_ADDR_DEST_ID_MASK 0x000 >>>>> +#define MSI_ADDR_DEST_ID_MASK 0x000fff00 >>>> The value of MSI_ADDR_DEST_ID_MASK is changed here. I think the patch >>>> should be: >>>> +#define MSI_ADDR_DEST_ID_MASK 0x0000 >>> Judging from other sources, rather the other way around - the >>> mask needs to have further bits removed (should be 0x000ff000 >>> afaict). Xen sources confirm this, and while Linux has the value >>> you suggest, that contradicts >> Agree. Defining the mask as "0x000ff000" makes more sense. >> Just check Qemu source code. Only apic_send_msi() and msi_dest_id() use >> the mask >> to get dest apic id. They mask MSI address field with >> MSI_ADDR_DEST_ID_MASK and >> then right-shift 12bit. The low 12bit won't be used. >> >> Anthony, does this make sense? > Yes, it does. > The change to MSI_ADDR_DEST_ID_MASK should probably go in its own patch. > OK. Will update. -- Best regards Tianyu Lan
Re: [Qemu-devel] [Xen-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI
On 2017年05月19日 20:04, Jan Beulich wrote: >>>> On 19.05.17 at 13:16, wrote: >> On Thu, May 18, 2017 at 01:32:59AM -0400, Lan Tianyu wrote: >>> --- a/include/hw/i386/apic-msidef.h >>> +++ b/include/hw/i386/apic-msidef.h >>> @@ -26,6 +26,7 @@ >>> >>> #define MSI_ADDR_DEST_ID_SHIFT 12 >>> #define MSI_ADDR_DEST_IDX_SHIFT 4 >>> -#define MSI_ADDR_DEST_ID_MASK 0x000 >>> +#define MSI_ADDR_DEST_ID_MASK 0x000fff00 >> The value of MSI_ADDR_DEST_ID_MASK is changed here. I think the patch >> should be: >> +#define MSI_ADDR_DEST_ID_MASK 0x0000 > Judging from other sources, rather the other way around - the > mask needs to have further bits removed (should be 0x000ff000 > afaict). Xen sources confirm this, and while Linux has the value > you suggest, that contradicts Agree. Defining the mask as "0x000ff000" makes more sense. Just check Qemu source code. Only apic_send_msi() and msi_dest_id() use the mask to get dest apic id. They mask MSI address field with MSI_ADDR_DEST_ID_MASK and then right-shift 12bit. The low 12bit won't be used. Anthony, does this make sense? -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH V2 2/2] msi: Handle remappable format interrupt request
Hi Anthony: Thanks for your review. On 2017年05月19日 19:57, Anthony PERARD wrote: > On Thu, May 18, 2017 at 01:33:00AM -0400, Lan Tianyu wrote: >> From: Chao Gao >> >> According to VT-d spec Interrupt Remapping and Interrupt Posting -> >> Interrupt Remapping -> Interrupt Request Formats On Intel 64 >> Platforms, fields of MSI data register have changed. This patch >> avoids wrongly regarding a remappable format interrupt request as >> an interrupt binded with an event channel. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu >> --- >> hw/pci/msi.c | 5 +++-- >> hw/pci/msix.c| 4 +++- >> hw/xen/xen_pt_msi.c | 2 +- >> include/hw/xen/xen.h | 2 +- >> xen-hvm-stub.c | 2 +- >> xen-hvm.c| 7 ++- >> 6 files changed, 15 insertions(+), 7 deletions(-) >> >> diff --git a/hw/pci/msi.c b/hw/pci/msi.c >> index a87b227..199cb47 100644 >> --- a/hw/pci/msi.c >> +++ b/hw/pci/msi.c >> @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev) >> static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) >> { >> uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev)); >> -uint32_t mask, data; >> +uint32_t mask, data, addr_lo; >> bool msi64bit = flags & PCI_MSI_FLAGS_64BIT; >> assert(vector < PCI_MSI_VECTORS_MAX); >> >> @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned >> int vector) >> } >> >> data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); >> -if (xen_is_pirq_msi(data)) { >> +addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev)); >> +if (xen_is_pirq_msi(data, addr_lo)) { >> return false; >> } >> >> diff --git a/hw/pci/msix.c b/hw/pci/msix.c >> index bb54e8b..efe2982 100644 >> --- a/hw/pci/msix.c >> +++ b/hw/pci/msix.c >> @@ -82,9 +82,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned >> int vector, bool fmask) >> { >> unsigned offset = vector * PCI_MSIX_ENTRY_SIZE; >> uint8_t *data = &dev->msix_table[offset + PCI_MSIX_ENTRY_DATA]; >> +uint8_t *addr_lo = &dev->msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR]; >> /* MSIs on Xen can be remapped into pirqs. In those cases, masking >> * and unmasking go through the PV evtchn path. */ >> -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) { >> +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data), >> + pci_get_long(addr_lo))) { >> return false; >> } >> return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] & >> diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c >> index 5fab95e..45a9e9f 100644 >> --- a/hw/xen/xen_pt_msi.c >> +++ b/hw/xen/xen_pt_msi.c >> @@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s, >> >> assert((!is_msix && msix_entry == 0) || is_msix); >> >> -if (xen_is_pirq_msi(data)) { >> +if (xen_is_pirq_msi(data, addr)) { >> *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr); >> if (!*ppirq) { >> /* this probably identifies an misconfiguration of the guest, >> diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h >> index 09c2ce5..af759bc 100644 >> --- a/include/hw/xen/xen.h >> +++ b/include/hw/xen/xen.h >> @@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); >> void xen_piix3_set_irq(void *opaque, int irq_num, int level); >> void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int >> len); >> void xen_hvm_inject_msi(uint64_t addr, uint32_t data); >> -int xen_is_pirq_msi(uint32_t msi_data); >> +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo); > > Maybe inverting the arguments would be better, so the arguments would be > the address first, then the data, like I think it is often the case. > What do you think? That make sense. Will update. > >> >> qemu_irq *xen_interrupt_controller_init(void); >> >> diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c >> index c500325..dae421c 100644 >> --- a/xen-hvm-stub.c >> +++ b/xen-hvm-stub.c >> @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data) >> { >> } >> >> -int xen_is_pirq_msi(uint32_t msi_data) >> +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo) >> { >
[Qemu-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI
From: Chao Gao If a vIOMMU is exposed to guest, guest will configure the msi to remapping format. The original code isn't suitable to the new format. A new pair bind/unbind interfaces are added for this usage. This patch recognizes this case and use new interfaces to bind/unbind msi. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/xen/xen_pt_msi.c | 50 --- include/hw/i386/apic-msidef.h | 3 ++- 2 files changed, 39 insertions(+), 14 deletions(-) diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 62add06..5fab95e 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -163,16 +163,24 @@ static int msi_msix_update(XenPCIPassthroughState *s, int rc = 0; uint64_t table_addr = 0; -XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" - " (entry: %#x)\n", - is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); - if (is_msix) { table_addr = s->msix->mmio_base_addr; } -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, - pirq, gflags, table_addr); +if (addr & MSI_ADDR_IF_MASK) { +XEN_PT_LOG(d, "Updating MSI%s with addr %#" PRIx64 "data %#x\n", + is_msix ? "-X": "", addr, data); +rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq, + d->devfn, data, addr, table_addr); +} +else { +XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" + " (entry: %#x)\n", + is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); + +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, + pirq, gflags, table_addr); +} if (rc) { XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n", @@ -204,13 +212,29 @@ static int msi_msix_disable(XenPCIPassthroughState *s, } if (is_binded) { -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", - is_msix ? "-X" : "", pirq, gvec); -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); -if (rc) { -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, gvec: %#x)\n", - is_msix ? "-X" : "", errno, pirq, gvec); -return rc; +if ( addr & MSI_ADDR_IF_MASK ) { +XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, " + "addr: %#" PRIx64 ")\n", + is_msix ? "-X" : "", pirq, data, addr); +rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq, +d->devfn, data, addr); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, " + "data: %x, addr: %#" PRIx64 ")\n", + is_msix ? "-X" : "", rc, pirq, data, addr); +return rc; +} + +} else { +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", + is_msix ? "-X" : "", pirq, gvec); +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, " + "gvec: %#x)\n", + is_msix ? "-X" : "", errno, pirq, gvec); +return rc; +} } } diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h index 8b4d4cc..2c450f9 100644 --- a/include/hw/i386/apic-msidef.h +++ b/include/hw/i386/apic-msidef.h @@ -26,6 +26,7 @@ #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_IDX_SHIFT 4 -#define MSI_ADDR_DEST_ID_MASK 0x000 +#define MSI_ADDR_DEST_ID_MASK 0x000fff00 +#define MSI_ADDR_IF_MASK 0x0010 #endif /* HW_APIC_MSIDEF_H */ -- 1.8.3.1
[Qemu-devel] [RFC PATCH V2 0/2] Qemu: Add Xen vIOMMU interrupt remapping function.
Change since V1: 1) Move create/destroy vIOMMU and query vIOMMU capabilities to tool stack. 2) Fix some code style issue. This patchset is to deal with MSI interrupt remapping request when guest updates MSI registers. Repo: https://github.com/lantianyu/qemu/tree/xen_viommu_rfc_v2 Chao Gao (2): xen-pt: bind/unbind interrupt remapping format MSI msi: Handle remappable format interrupt request hw/pci/msi.c | 5 +++-- hw/pci/msix.c | 4 +++- hw/xen/xen_pt_msi.c | 52 +++ include/hw/i386/apic-msidef.h | 3 ++- include/hw/xen/xen.h | 2 +- xen-hvm-stub.c| 2 +- xen-hvm.c | 7 +- 7 files changed, 54 insertions(+), 21 deletions(-) -- 1.8.3.1
[Qemu-devel] [RFC PATCH V2 2/2] msi: Handle remappable format interrupt request
From: Chao Gao According to VT-d spec Interrupt Remapping and Interrupt Posting -> Interrupt Remapping -> Interrupt Request Formats On Intel 64 Platforms, fields of MSI data register have changed. This patch avoids wrongly regarding a remappable format interrupt request as an interrupt binded with an event channel. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/pci/msi.c | 5 +++-- hw/pci/msix.c| 4 +++- hw/xen/xen_pt_msi.c | 2 +- include/hw/xen/xen.h | 2 +- xen-hvm-stub.c | 2 +- xen-hvm.c| 7 ++- 6 files changed, 15 insertions(+), 7 deletions(-) diff --git a/hw/pci/msi.c b/hw/pci/msi.c index a87b227..199cb47 100644 --- a/hw/pci/msi.c +++ b/hw/pci/msi.c @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev) static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) { uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev)); -uint32_t mask, data; +uint32_t mask, data, addr_lo; bool msi64bit = flags & PCI_MSI_FLAGS_64BIT; assert(vector < PCI_MSI_VECTORS_MAX); @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) } data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); -if (xen_is_pirq_msi(data)) { +addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev)); +if (xen_is_pirq_msi(data, addr_lo)) { return false; } diff --git a/hw/pci/msix.c b/hw/pci/msix.c index bb54e8b..efe2982 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -82,9 +82,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int vector, bool fmask) { unsigned offset = vector * PCI_MSIX_ENTRY_SIZE; uint8_t *data = &dev->msix_table[offset + PCI_MSIX_ENTRY_DATA]; +uint8_t *addr_lo = &dev->msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR]; /* MSIs on Xen can be remapped into pirqs. In those cases, masking * and unmasking go through the PV evtchn path. */ -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) { +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data), + pci_get_long(addr_lo))) { return false; } return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] & diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 5fab95e..45a9e9f 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s, assert((!is_msix && msix_entry == 0) || is_msix); -if (xen_is_pirq_msi(data)) { +if (xen_is_pirq_msi(data, addr)) { *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr); if (!*ppirq) { /* this probably identifies an misconfiguration of the guest, diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index 09c2ce5..af759bc 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); void xen_piix3_set_irq(void *opaque, int irq_num, int level); void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len); void xen_hvm_inject_msi(uint64_t addr, uint32_t data); -int xen_is_pirq_msi(uint32_t msi_data); +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo); qemu_irq *xen_interrupt_controller_init(void); diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c index c500325..dae421c 100644 --- a/xen-hvm-stub.c +++ b/xen-hvm-stub.c @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data) { } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo) { return 0; } diff --git a/xen-hvm.c b/xen-hvm.c index 5043beb..db29121 100644 --- a/xen-hvm.c +++ b/xen-hvm.c @@ -146,8 +146,13 @@ void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len) } } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo) { +/* If msi address is configurate to remapping format, the msi will not + * remapped into a pirq. + */ +if (msi_addr_lo & MSI_ADDR_IF_MASK) +return 0; /* If vector is 0, the msi is remapped into a pirq, passed as * dest_id. */ -- 1.8.3.1
Re: [Qemu-devel] [RFC PATCH 14/20] intel_iommu: add FOR_EACH_ASSIGN_DEVICE macro
On 2017年04月26日 18:06, Liu, Yi L wrote: > Add FOR_EACH_ASSIGN_DEVICE. It would be used to loop all assigned > devices when processing guest pasid table linking and iommu cache > invalidate propagation. > > Signed-off-by: Liu, Yi L > --- > hw/i386/intel_iommu.c | 32 > hw/i386/intel_iommu_internal.h | 11 +++ > 2 files changed, 43 insertions(+) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 0c412d2..f291995 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -55,6 +55,38 @@ static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | > VTD_DBGBIT(CSR); > #define VTD_DPRINTF(what, fmt, ...) do {} while (0) > #endif > > +#define FOR_EACH_ASSIGN_DEVICE(__notify_info_type, \ > + __opaque_type, \ > + __hook_info, \ > + __hook_fn) \ > +do { \ > +IntelIOMMUNotifierNode *node; \ > +VTDNotifierIterator iterator; \ > +int ret = 0; \ > +__notify_info_type *notify_info; \ > +__opaque_type *opaq; \ > +int argsz; \ > +argsz = sizeof(*notify_info) + sizeof(*opaq); \ > +notify_info = g_malloc0(argsz); \ > +QLIST_FOREACH(node, &(s->notifiers_list), next) { \ > +VTDAddressSpace *vtd_as = node->vtd_as; \ > +VTDContextEntry ce[2]; \ > +iterator.bus = pci_bus_num(vtd_as->bus); \ > +ret = vtd_dev_to_context_entry(s, iterator.bus, \ > + vtd_as->devfn, &ce[0]); \ > +if (ret != 0) { \ > +continue; \ > +} \ > +iterator.sid = vtd_make_source_id(iterator.bus, vtd_as->devfn); \ > +iterator.did = VTD_CONTEXT_ENTRY_DID(ce[0].hi); \ > +iterator.host_sid = node->host_sid; \ > +iterator.vtd_as = vtd_as; \ > +iterator.ce = &ce[0]; \ > +__hook_fn(&iterator, __hook_info, notify_info); \ > +} \ > +g_free(notify_info); \ > +} while (0) > + > static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val, > uint64_t wmask, uint64_t w1cmask) > { > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h > index f2a7d12..5178398 100644 > --- a/hw/i386/intel_iommu_internal.h > +++ b/hw/i386/intel_iommu_internal.h > @@ -439,6 +439,17 @@ typedef struct VTDRootEntry VTDRootEntry; > #define VTD_EXT_CONTEXT_TT_NO_DEV_IOTLB (4ULL << 2) > #define VTD_EXT_CONTEXT_TT_DEV_IOTLB (5ULL << 2) > > +struct VTDNotifierIterator { > +VTDAddressSpace *vtd_as; > +VTDContextEntry *ce; > +uint16_t host_sid; > +uint16_t sid; > +uint16_t did; > +uint8_t bus; The "bus" seems to be redundant. It is already contained in the "sid", right? > +}; > + > +typedef struct VTDNotifierIterator VTDNotifierIterator; > + > /* Paging Structure common */ > #define VTD_SL_PT_PAGE_SIZE_MASK(1ULL << 7) > /* Bits to decide the offset for each level */ > -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH 09/20] Memory: introduce iommu_ops->record_device
On 2017年04月26日 18:06, Liu, Yi L wrote: > With vIOMMU exposed to guest, vIOMMU emulator needs to do translation > between host and guest. e.g. a device-selective TLB flush, vIOMMU > emulator needs to replace guest SID with host SID so that to limit > the invalidation. This patch introduces a new callback > iommu_ops->record_device() to notify vIOMMU emulator to record necessary > information about the assigned device. This patch is to prepare to translate guest sbdf to host sbdf. Alex: Could we add a new vfio API to do such translation? This will be more straight forward than storing host sbdf in the vIOMMU device model. > > Signed-off-by: Liu, Yi L > --- > include/exec/memory.h | 11 +++ > memory.c | 12 > 2 files changed, 23 insertions(+) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 7bd13ab..49087ef 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -203,6 +203,8 @@ struct MemoryRegionIOMMUOps { > IOMMUNotifierFlag new_flags); > /* Set this up to provide customized IOMMU replay function */ > void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier); > +void (*record_device)(MemoryRegion *iommu, > + void *device_info); > }; > > typedef struct CoalescedMemoryRange CoalescedMemoryRange; > @@ -708,6 +710,15 @@ void memory_region_notify_iommu(MemoryRegion *mr, > void memory_region_notify_one(IOMMUNotifier *notifier, >IOMMUTLBEntry *entry); > > +/* > + * memory_region_notify_device_record: notify IOMMU to record assign > + * device. > + * @mr: the memory region to notify > + * @ device_info: device information > + */ > +void memory_region_notify_device_record(MemoryRegion *mr, > +void *info); > + > /** > * memory_region_register_iommu_notifier: register a notifier for changes to > * IOMMU translation entries. > diff --git a/memory.c b/memory.c > index 0728e62..45ef069 100644 > --- a/memory.c > +++ b/memory.c > @@ -1600,6 +1600,18 @@ static void > memory_region_update_iommu_notify_flags(MemoryRegion *mr) > mr->iommu_notify_flags = flags; > } > > +void memory_region_notify_device_record(MemoryRegion *mr, > +void *info) > +{ > +assert(memory_region_is_iommu(mr)); > + > +if (mr->iommu_ops->record_device) { > +mr->iommu_ops->record_device(mr, info); > +} > + > +return; > +} > + > void memory_region_register_iommu_notifier(MemoryRegion *mr, > IOMMUNotifier *n) > { >
Re: [Qemu-devel] [RFC PATCH 02/20] intel_iommu: exposed extended-context mode to guest
On 2017年04月27日 18:32, Peter Xu wrote: > On Wed, Apr 26, 2017 at 06:06:32PM +0800, Liu, Yi L wrote: >> VT-d implementations reporting PASID or PRS fields as "Set", must also >> report ecap.ECS as "Set". Extended-Context is required for SVM. >> >> When ECS is reported, intel iommu driver would initiate extended root entry >> and extended context entry, and also PASID table if there is any SVM capable >> device. >> >> Signed-off-by: Liu, Yi L >> --- >> hw/i386/intel_iommu.c | 131 >> +++-- >> hw/i386/intel_iommu_internal.h | 9 +++ >> include/hw/i386/intel_iommu.h | 2 +- >> 3 files changed, 97 insertions(+), 45 deletions(-) >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c >> index 400d0d1..bf98fa5 100644 >> --- a/hw/i386/intel_iommu.c >> +++ b/hw/i386/intel_iommu.c >> @@ -497,6 +497,11 @@ static inline bool vtd_root_entry_present(VTDRootEntry >> *root) >> return root->val & VTD_ROOT_ENTRY_P; >> } >> >> +static inline bool vtd_root_entry_upper_present(VTDRootEntry *root) >> +{ >> +return root->rsvd & VTD_ROOT_ENTRY_P; >> +} >> + >> static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index, >>VTDRootEntry *re) >> { >> @@ -509,6 +514,9 @@ static int vtd_get_root_entry(IntelIOMMUState *s, >> uint8_t index, >> return -VTD_FR_ROOT_TABLE_INV; >> } >> re->val = le64_to_cpu(re->val); >> +if (s->ecs) { >> +re->rsvd = le64_to_cpu(re->rsvd); >> +} > > I feel it slightly hacky to play with re->rsvd. How about: > > union VTDRootEntry { > struct { > uint64_t val; > uint64_t rsvd; > } base; > struct { > uint64_t ext_lo; > uint64_t ext_hi; > } extended; > }; > > (Or any better way that can get rid of rsvd...) > > Even: > > struct VTDRootEntry { > union { > struct { > uint64_t val; > uint64_t rsvd; > } base; > struct { > uint64_t ext_lo; > uint64_t ext_hi; > } extended; > } data; > bool extended; > }; > > Then we read the entry into data, and setup extended bit. A benefit of > it is that we may avoid passing around IntelIOMMUState everywhere to > know whether we are using extended context entries. > >> return 0; >> } >> >> @@ -517,19 +525,30 @@ static inline bool >> vtd_context_entry_present(VTDContextEntry *context) >> return context->lo & VTD_CONTEXT_ENTRY_P; >> } >> >> -static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t >> index, >> - VTDContextEntry *ce) >> +static int vtd_get_context_entry_from_root(IntelIOMMUState *s, >> + VTDRootEntry *root, uint8_t index, VTDContextEntry *ce) >> { >> -dma_addr_t addr; >> +dma_addr_t addr, ce_size; >> >> /* we have checked that root entry is present */ >> -addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce); >> -if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) { >> +ce_size = (s->ecs) ? (2 * sizeof(*ce)) : (sizeof(*ce)); >> +addr = (s->ecs && (index > 0x7f)) ? >> + ((root->rsvd & VTD_ROOT_ENTRY_CTP) + (index - 0x80) * ce_size) : >> + ((root->val & VTD_ROOT_ENTRY_CTP) + index * ce_size); >> + >> +if (dma_memory_read(&address_space_memory, addr, ce, ce_size)) { >> trace_vtd_re_invalid(root->rsvd, root->val); >> return -VTD_FR_CONTEXT_TABLE_INV; >> } >> -ce->lo = le64_to_cpu(ce->lo); >> -ce->hi = le64_to_cpu(ce->hi); >> + >> +ce[0].lo = le64_to_cpu(ce[0].lo); >> +ce[0].hi = le64_to_cpu(ce[0].hi); > > Again, I feel this even hackier. :) > > I would slightly prefer to play the same union trick to context > entries, just like what I proposed to the root entries above... > >> + >> +if (s->ecs) { >> +ce[1].lo = le64_to_cpu(ce[1].lo); >> +ce[1].hi = le64_to_cpu(ce[1].hi); >> +} >> + >> return 0; >> } >> >> @@ -595,9 +614,11 @@ static inline uint32_t >> vtd_get_agaw_from_context_entry(VTDContextEntry *ce) >> return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9; >> } >> >> -static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce) >> +static inline uint32_t vtd_ce_get_type(IntelIOMMUState *s, >> + VTDContextEntry *ce) >> { >> -return ce->lo & VTD_CONTEXT_ENTRY_TT; >> +return s->ecs ? (ce->lo & VTD_CONTEXT_ENTRY_TT) : >> +(ce->lo & VTD_EXT_CONTEXT_ENTRY_TT); >> } >> >> static inline uint64_t vtd_iova_limit(VTDContextEntry *ce) >> @@ -842,16 +863,20 @@ static int vtd_dev_to_context_entry(IntelIOMMUState >> *s, uint8_t bus_num, >> return ret_fr; >> } >> >> -if (!vtd_root_entry_present(&re)) { >> +if (!vtd_root_entry_present(&re) || >> +(s->ecs && (devfn > 0x7f) && (!vtd_root_entry_upper_present(&re >> { >> /* Not
Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)
On 2017年04月20日 15:04, Peter Xu wrote: > On Thu, Apr 20, 2017 at 06:36:16AM +, Liu, Yi L wrote: > > [...] > In my understanding, container->space->as->root cannot work here no matter passthru-mode is enabled or not. The code here is aiming to check if vIOMMU exists. After the vfio series, the vtd_dev_as->root is not initialized to be a iommu MemoryRegion. Compared with checking if it is system_memory(), I think adding a mechanism to get the iommu >>> MemoryRegion may be a better choice. Just like the current >>> pci_device_iommu_address_space(). >>> >>> Isn't pci_device_iommu_address_space() used to get that IOMMU memory region? > > Again I should say s/memory region/address space/... > >> >> It actually returns the AddressSpace, and the AddressSpace includes a memory >> region. >> It is as->root. But after adding the vfio series, through the IOMMU memory >> region >> is got, but it has no iommu_ops. Just as the following code shows. That's >> why I said >> even without passthru-mode, Tianyu's that code snippet is not able to get >> the correct >> check. >> >> memory_region_init(&vtd_dev_as->root, OBJECT(s), >>"vtd_root", UINT64_MAX); >> address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); > > The problem is, I am not sure whether there is always _only_ one IOMMU > region behind one device. E.g., IIUC ppc can have more than one IOMMU > memory regions, but translate for different address ranges. > If we really want to check this via memory region, we may check all subregions under container's address space. Register vIOMMU related notifiers if one of them has iommu_ops, -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)
On 4/20/2017 1:40 PM, Peter Xu wrote: On Thu, Apr 20, 2017 at 04:55:24AM +, Liu, Yi L wrote: [...] In my previous RFC patchset of fault event reporting, I registered fault notifier when there is a VFIO group attached to VFIO container and used the address space to check whether vIOMMU is added. The address space is returned by vtd_host_dma_iommu(). vtd_find_add_as() initializes device's IOMMU memory region and put it into device's address space as root memory region. Does this make sense? @@ -1103,6 +1132,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, goto listener_release_exit; } +if (memory_region_is_iommu(container->space->as->root)) { I would suggest we don't play with as->root any more. After vtd vfio series, this may not be true if passthrough mode is enabled (then the root may be switched to an system memory alias). I don't know the best way to check this, one alternative might be that we check whether container->space->as == system_memory(), it should be workable, but in Sorry, I was meaning &address_space_memory. a slightly hackish way. In my understanding, container->space->as->root cannot work here no matter passthru-mode is enabled or not. The code here is aiming to check if vIOMMU exists. After the vfio series, the vtd_dev_as->root is not initialized to be a iommu MemoryRegion. Compared with checking if it is system_memory(), I think adding a mechanism to get the iommu MemoryRegion may be a better choice. Just like the current pci_device_iommu_address_space(). Isn't pci_device_iommu_address_space() used to get that IOMMU memory region? And, one thing to mention is that container->space->as is actually derived from pci_device_iommu_address_space() (when calling vfio_get_group()). I feel like that playing around with an IOMMU memory region is still not clear enough in many cases. I still feel like some day we would like an "IOMMU object". Then, we can register non-iotlb notifiers against that IOMMU object, rather than memory regions... Our target is to check whether assigned device is under a vIOMMU. We may check whether iommu_fn point has been populated in its pci bus' data structure(PCIDevice). This is what pci_device_iommu_address_space() does.
Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)
Hi All: Sorry for later response. On 2017年04月18日 17:04, Liu, Yi L wrote: >> -Original Message- >> From: Peter Xu [mailto:pet...@redhat.com] >> Sent: Tuesday, April 18, 2017 3:27 PM >> To: Liu, Yi L >> Cc: qemu-devel@nongnu.org; Lan, Tianyu ; Michael S . >> Tsirkin ; Jason Wang ; Marcel >> Apfelbaum ; David Gibson >> Subject: Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough >> (PT) >> >> On Tue, Apr 18, 2017 at 06:02:44AM +, Liu, Yi L wrote: >>>>> Hi Peter, >>>>> >>>>> Skip address space switching is a good idea to support Passthru mode. >>>>> However, without the address space, the vfio notifier would not be >>>>> registered, thus vIOMMU emulator has no way to connect to host. It >>>>> is no harm if there is only map/unmap notifier. But if we have >>>>> more notifiers other than map/unmap, it may be a problem. >>>>> >>>>> I think we need to reconsider it here. >>>> >>>> For now I think as switching is good to us in general. Could I know >>>> more context about this? Would it be okay to work on top of this in the >>>> future? >>>> >>> >>> Let me explain. For vSVM enabling, it needs to add new notifier to >>> bind gPASID table to host. If an assigned device uses SVM in guest, as >>> designed guest IOMMU driver would set "pt" for it. This is to make >>> sure the host second-level page table stores GPA->HPA mapping. So that >>> pIOMMU can do nested translation to achieve gVA->GPA GPA->HPA mapping. >> >> That should mean that in the guest it will only enable first level >> translation, while in >> the host it'll be nested (first + second)? Then you should be trapping the >> guest >> extended context entry invalidation, then pass the PASID table pointer >> downward to >> the host IOMMU, am I correct? > > exactly. > >> >>> >>> So the situation is if we want to keep GPA->HPA mapping, we should >>> skip address space switch, but the vfio listener would not know vIOMMU >>> is added, so no vfio notifier would be registered. But if we do the >>> switch, the GPA->HPA mapping is unmapped. And need another way to build the >> GPA->HPA mapping. >> >> If my understanding above is correct, current IOMMU notifier may not satisfy >> your >> need. If you see the current notify interface, it's delivering >> IOMMUTLBEntry. I >> suspect it only suites for IOTLB notifications, but not if you want to pass >> some >> pointer (one GPA) downward somewhere. > > Yeah, you got it more than absolutely. One of my patch is to modify the para > to be > void * and let each notifiers to translate separately. I think it should be a > reasonable > change. > > Supposedly, I would send RFC for vSVM soon. We may talk more it at that > thread. > >>> >>> I think we may have two choice here. Pls let me know your ideas. >>> >>> 1. skip the switch for "pt" mode, find other way to register vfio >>> notifiers. not sure if this is workable. Just a quick thought. >>> >>> 2. do the switch, and rebuild GPA->HPA mapping if device use "pt" >>> mode. For this option, I also have two way to build the GPA->HPA mapping. >>> a) walk all the memory region sections in address_space_memory, and build >>> the >> mapping. >>> address_space_memory.dispatch->map.sections, sections stores all the mapped >> sections. >>> >>> b) let guest iommu driver mimics a 1:1 mapping for the devices use >>> "pt" mode. in this way, the GPA->HPA mapping could be setup by walking >>> the guest SL page table. This is consistent with your vIOVA replay solution. >> >> I'll prefer (1). Reason explained above. >> >>> >>> Also I'm curious if Tianyu's fault report framework needs to register new >>> notifiers. >> >> For Tianyu's case, I feel like we need other kind of notifier as well, but >> not the >> current IOTLB one. And, of course in this case it'll be in an reverted >> direction, which >> is from device to vIOMMU. >> >> (I am thinking whether it's a good time now to let any PCI device able to >> fetch its >> direct owner IOMMU object, then it can register anything it want onto that >> object >> maybe?) >> > Hmmm, let's see
Re: [Qemu-devel] [RFC PATCH 0/4] Qemu: Add Xen vIOMMU support
On 2017年03月20日 19:38, Paolo Bonzini wrote: > Fair enough, though I'd be worried about increasing the attack surface > of the hypervisor. For KVM, for example, IOMMU emulation requires using > the "split irqchip" feature to move the PIC and IOAPIC out of the kernel > and back to QEMU. Yes, just like Roger mentioned we also need to support no-qemu mode on Xen and this is tradeoff result. > > Also, I think this series is missing changes to support IOMMU > translation in the vIOMMU device model. Yes, this series just enabled vIOMMU's irq remapping function and we need to pass virtual device's DMA request to Xen hypervisor for translation when enable DMA translation. -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH 1/4] I440: Allow adding sysbus devices with -device on I440
Hi Eduardo: Thanks for your review. On 2017年03月21日 03:49, Eduardo Habkost wrote: > On Fri, Mar 17, 2017 at 07:29:14PM +0800, Lan Tianyu wrote: >> From: Chao Gao >> >> xen-viommu will be a sysbus device and the device model will >> be enabled via "-device" parameter. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu > > I'm worried about the bugs we may expose by accepting all the > other sysbus devices in the command-line in addition to > xen-viommu. > > I am working on a RFC to replace "has_dynamic_sysbus" with a > whitelist of sysbus device classes. This way we could enable only > xen-viommu on i440fx, instead of suddenly enabling all sysbus > devices just because of xen-viommu. That sounds reasonable. Thanks for the job and we rebase on your patches. > >> --- >> hw/i386/pc_piix.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c >> index a07dc81..3289593 100644 >> --- a/hw/i386/pc_piix.c >> +++ b/hw/i386/pc_piix.c >> @@ -436,6 +436,7 @@ static void pc_i440fx_machine_options(MachineClass *m) >> m->hot_add_cpu = pc_hot_add_cpu; >> m->default_machine_opts = "firmware=bios-256k.bin"; >> m->default_display = "std"; >> +m->has_dynamic_sysbus = true; >> } >> >> static void pc_i440fx_2_7_machine_options(MachineClass *m) >> -- >> 1.8.3.1 >> > -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH 0/4] Qemu: Add Xen vIOMMU support
On 2017年03月17日 22:48, Paolo Bonzini wrote: > > > On 17/03/2017 12:29, Lan Tianyu wrote: >> This patchset is to add Xen vIOMMU device model and handle >> irq remapping stuffs. Xen vIOMMU emulation is in the Xen hypervisor >> and the new device module in Qemu works as hypercall wrappers to >> create and destroy vIOMMU in hypervisor. >> >> Xen only supports emulated I440 and so we enable vIOMMU with emulated >> I440 chipset. This works on Linux and Windows guest. > > Any plans to change this? Why is Xen not able to use Q35 with Intel > IOMMU, with only special hooks for interrupt remapping? > > Paolo > Hi Paolo: Thanks for review. For Xen side, we won't reuse Intel IOMMU device model in Qemu and create counterpart in Xen hypervisor. The reasons are 1) Avoid round trips between Qemu and Xen hypervisor 2) Ease of integration with the rest part of the hypervisor(vIOAPIC, vMSI and so on). We don't have plan to enable Q35 for Xen now. There is no dependency between Q35 emulation and vIOMMU implementation from our test. As Stefano mentioned, Anthony worked on the Q35 support before. These two tasks can be done in parallel. -- Best regards Tianyu Lan
[Qemu-devel] [RFC PATCH 4/4] msi: taking interrupt format into consideration during judging a pirq is binded with a event channel
From: Chao Gao As remapping format interrupt has been introduced, the vector in msi remapping format can also be 0, same as a interrupt is binded with a event channel. So we can't just use whether vector is 0 or not to judge a msi is binded to a event channel or not. This patch takes the msi interrupt format into consideration. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/pci/msi.c | 5 +++-- hw/pci/msix.c| 4 +++- hw/xen/xen_pt_msi.c | 2 +- include/hw/xen/xen.h | 2 +- xen-hvm-stub.c | 2 +- xen-hvm.c| 7 ++- 6 files changed, 15 insertions(+), 7 deletions(-) diff --git a/hw/pci/msi.c b/hw/pci/msi.c index a87b227..8d1ac9e 100644 --- a/hw/pci/msi.c +++ b/hw/pci/msi.c @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev) static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) { uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev)); -uint32_t mask, data; +uint32_t mask, data, addr_lo; bool msi64bit = flags & PCI_MSI_FLAGS_64BIT; assert(vector < PCI_MSI_VECTORS_MAX); @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) } data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); -if (xen_is_pirq_msi(data)) { +addr_lo = pci_get_word(dev->config + msi_address_lo_off(dev)); +if (xen_is_pirq_msi(data, addr_lo)) { return false; } diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 0ec1cb1..6b8045a 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -81,9 +81,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int vector, bool fmask) { unsigned offset = vector * PCI_MSIX_ENTRY_SIZE; uint8_t *data = &dev->msix_table[offset + PCI_MSIX_ENTRY_DATA]; +uint8_t *addr_lo = &dev->msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR]; /* MSIs on Xen can be remapped into pirqs. In those cases, masking * and unmasking go through the PV evtchn path. */ -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) { +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data), + pci_get_long(addr_lo))) { return false; } return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] & diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 8b0d7fc..f799fed 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s, assert((!is_msix && msix_entry == 0) || is_msix); -if (xen_is_pirq_msi(data)) { +if (xen_is_pirq_msi(data, (uint32_t)(addr & 0x))) { *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr); if (!*ppirq) { /* this probably identifies an misconfiguration of the guest, diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index a8f3afb..c15beb5 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); void xen_piix3_set_irq(void *opaque, int irq_num, int level); void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len); void xen_hvm_inject_msi(uint64_t addr, uint32_t data); -int xen_is_pirq_msi(uint32_t msi_data); +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo); qemu_irq *xen_interrupt_controller_init(void); diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c index c500325..dae421c 100644 --- a/xen-hvm-stub.c +++ b/xen-hvm-stub.c @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data) { } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo) { return 0; } diff --git a/xen-hvm.c b/xen-hvm.c index 2f348ed..9e78b23 100644 --- a/xen-hvm.c +++ b/xen-hvm.c @@ -146,8 +146,13 @@ void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len) } } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo) { +/* If msi address is configurate to remapping format, the msi will not + * remapped into a pirq. + */ +if ( msi_addr_lo & 0x10 ) +return 0; /* If vector is 0, the msi is remapped into a pirq, passed as * dest_id. */ -- 1.8.3.1
[Qemu-devel] [RFC PATCH 2/4] Xen: add a dummy vIOMMU to create/destroy vIOMMU in Xen
From: Chao Gao Since adding a dynamic sysbus device is forbidden, so choose TYPE_DEVICE as parent class. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/xen/Makefile.objs | 1 + hw/xen/xen_viommu.c | 116 +++ 2 files changed, 117 insertions(+) create mode 100644 hw/xen/xen_viommu.c diff --git a/hw/xen/Makefile.objs b/hw/xen/Makefile.objs index d367094..e37d808 100644 --- a/hw/xen/Makefile.objs +++ b/hw/xen/Makefile.objs @@ -3,3 +3,4 @@ common-obj-$(CONFIG_XEN_BACKEND) += xen_backend.o xen_devconfig.o obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen-host-pci-device.o obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pt.o xen_pt_config_init.o xen_pt_graphics.o xen_pt_msi.o +obj-$(CONFIG_XEN) += xen_viommu.o diff --git a/hw/xen/xen_viommu.c b/hw/xen/xen_viommu.c new file mode 100644 index 000..9bf9158 --- /dev/null +++ b/hw/xen/xen_viommu.c @@ -0,0 +1,116 @@ +/* + * Xen virtual IOMMU (virtual VT-D) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see <http://www.gnu.org/licenses/>. + */ + +#include "qemu/osdep.h" +#include "qemu/error-report.h" + +#include "hw/qdev-core.h" +#include "hw/sysbus.h" +#include "hw/xen/xen.h" +#include "hw/xen/xen_backend.h" + +#define TYPE_XEN_VIOMMU_DEVICE "xen_viommu" +#define XEN_VIOMMU_DEVICE(obj) \ +OBJECT_CHECK(XenVIOMMUState, (obj), TYPE_XEN_VIOMMU_DEVICE) + +typedef struct XenVIOMMUState XenVIOMMUState; + +struct XenVIOMMUState { +DeviceState dev; +uint32_t id; +uint64_t cap; +uint64_t base_addr; +}; + +static void xen_viommu_realize(DeviceState *dev, Error **errp) +{ +int rc; +uint64_t cap; +char *dom; +char viommu_path[1024]; +XenVIOMMUState *s = XEN_VIOMMU_DEVICE(dev); + +s->id = -1; + +/* Read vIOMMU attributes from Xenstore. */ +dom = xs_get_domain_path(xenstore, xen_domid); +snprintf(viommu_path, sizeof(viommu_path), "%s/viommu", dom); +rc = xenstore_read_uint64(viommu_path, "base_addr", &s->base_addr); +if (rc) { +error_report("Can't get base address of vIOMMU"); +exit(1); +} + +rc = xenstore_read_uint64(viommu_path, "cap", &s->cap); +if (rc) { +error_report("Can't get capabilities of vIOMMU"); +exit(1); +} + +rc = xc_viommu_query_cap(xen_xc, xen_domid, &cap); +if (rc) { +exit(1); +} + + +if ((cap & s->cap) != cap) { +error_report("xen: Unsupported capability %lx", s->cap); +exit(1); +} + +rc = xc_viommu_create(xen_xc, xen_domid, s->base_addr, s->cap, &s->id); +if (rc) { +s->id = -1; +error_report("xen: failed(%d) to create viommu ", rc); +exit(1); +} +} + +static void xen_viommu_instance_finalize(Object *o) +{ +int rc; +XenVIOMMUState *s = XEN_VIOMMU_DEVICE(o); + +if (s->id != -1) { +rc = xc_viommu_destroy(xen_xc, xen_domid, s->id); +if (rc) { +error_report("xen: failed(%d) to destroy viommu ", rc); +} +} +} + +static void xen_viommu_class_init(ObjectClass *klass, void *data) +{ +DeviceClass *dc = DEVICE_CLASS(klass); +dc->hotpluggable = false; +dc->realize = xen_viommu_realize; +} + +static const TypeInfo xen_viommu_info = { +.name = TYPE_XEN_VIOMMU_DEVICE, +.parent= TYPE_SYS_BUS_DEVICE, +.instance_size = sizeof(XenVIOMMUState), +.instance_finalize = xen_viommu_instance_finalize, +.class_init= xen_viommu_class_init, +}; + +static void xen_viommu_register_types(void) +{ +type_register_static(&xen_viommu_info); +} + +type_init(xen_viommu_register_types); -- 1.8.3.1
[Qemu-devel] [RFC PATCH 3/4] xen-pt: bind/unbind interrupt remapping format MSI
From: Chao Gao If a vIOMMU is exposed to guest, guest will configure the msi to remapping format. The original code isn't suitable to the new format. A new pair bind/unbind interfaces are added for this usage. This patch recognizes this case and use new interfaces to bind/unbind msi. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/xen/xen_pt_msi.c | 36 include/hw/i386/apic-msidef.h | 1 + 2 files changed, 29 insertions(+), 8 deletions(-) diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 62add06..8b0d7fc 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -161,6 +161,7 @@ static int msi_msix_update(XenPCIPassthroughState *s, uint8_t gvec = msi_vector(data); uint32_t gflags = msi_gflags(data, addr); int rc = 0; +bool ir = !!(addr & MSI_ADDR_IM_MASK); uint64_t table_addr = 0; XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" @@ -171,8 +172,14 @@ static int msi_msix_update(XenPCIPassthroughState *s, table_addr = s->msix->mmio_base_addr; } -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, +if (ir) { +rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq, +d->devfn, data, addr, table_addr); +} +else { +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags, table_addr); +} if (rc) { XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n", @@ -204,13 +211,26 @@ static int msi_msix_disable(XenPCIPassthroughState *s, } if (is_binded) { -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", - is_msix ? "-X" : "", pirq, gvec); -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); -if (rc) { -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, gvec: %#x)\n", - is_msix ? "-X" : "", errno, pirq, gvec); -return rc; +if ( addr & MSI_ADDR_IM_MASK ) { +XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, addr: %lx)\n", + is_msix ? "-X" : "", pirq, data, addr); +rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq, +d->devfn, data, addr); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, data: %x, addr: %lx)\n", + is_msix ? "-X" : "", rc, pirq, data, addr); +return rc; +} + +} else { +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", + is_msix ? "-X" : "", pirq, gvec); +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); +if (rc) { +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, gvec: %#x)\n", + is_msix ? "-X" : "", errno, pirq, gvec); +return rc; +} } } diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h index 8b4d4cc..08b584f 100644 --- a/include/hw/i386/apic-msidef.h +++ b/include/hw/i386/apic-msidef.h @@ -27,5 +27,6 @@ #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_IDX_SHIFT 4 #define MSI_ADDR_DEST_ID_MASK 0x000 +#define MSI_ADDR_IM_MASK 0x0010 #endif /* HW_APIC_MSIDEF_H */ -- 1.8.3.1
[Qemu-devel] [RFC PATCH 1/4] I440: Allow adding sysbus devices with -device on I440
From: Chao Gao xen-viommu will be a sysbus device and the device model will be enabled via "-device" parameter. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- hw/i386/pc_piix.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c index a07dc81..3289593 100644 --- a/hw/i386/pc_piix.c +++ b/hw/i386/pc_piix.c @@ -436,6 +436,7 @@ static void pc_i440fx_machine_options(MachineClass *m) m->hot_add_cpu = pc_hot_add_cpu; m->default_machine_opts = "firmware=bios-256k.bin"; m->default_display = "std"; +m->has_dynamic_sysbus = true; } static void pc_i440fx_2_7_machine_options(MachineClass *m) -- 1.8.3.1
[Qemu-devel] [RFC PATCH 0/4] Qemu: Add Xen vIOMMU support
This patchset is to add Xen vIOMMU device model and handle irq remapping stuffs. Xen vIOMMU emulation is in the Xen hypervisor and the new device module in Qemu works as hypercall wrappers to create and destroy vIOMMU in hypervisor. Xen only supports emulated I440 and so we enable vIOMMU with emulated I440 chipset. This works on Linux and Windows guest. Chao Gao (4): I440: Allow adding sysbus devices with -device on I440 Xen: add a dummy vIOMMU to create/destroy vIOMMU in Xen xen-pt: bind/unbind interrupt remapping format MSI msi: taking interrupt format into consideration during judging a pirq is binded with a event channel hw/i386/pc_piix.c | 1 + hw/pci/msi.c | 5 +- hw/pci/msix.c | 4 +- hw/xen/Makefile.objs | 1 + hw/xen/xen_pt_msi.c | 38 ++ hw/xen/xen_viommu.c | 116 ++ include/hw/i386/apic-msidef.h | 1 + include/hw/xen/xen.h | 2 +- xen-hvm-stub.c| 2 +- xen-hvm.c | 7 ++- 10 files changed, 162 insertions(+), 15 deletions(-) create mode 100644 hw/xen/xen_viommu.c -- 1.8.3.1
[Qemu-devel] [Resend RFC PATCH 3/4] Intel iommu: Add Intel IOMMU fault event callback
This patch is to deal with fault event reported from IOMMU driver. Signed-off-by: Lan Tianyu --- hw/i386/intel_iommu.c | 25 + 1 file changed, 25 insertions(+) diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index 9b1ba1b..79507d2 100644 --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -2286,6 +2286,30 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion *iommu, } } +static void vtd_iommu_notify_fault_event(MemoryRegion *iommu, + IOMMUFaultInfo *info) +{ +VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu); +IntelIOMMUState *s = vtd_as->iommu_state; +bool is_fpd_set = false; +uint8_t bus_num = pci_bus_num(vtd_as->bus); +uint8_t devfn = vtd_as->devfn; +VTDContextEntry ce; + +/* Replace source id with device's vbdf */ +info->sid = vtd_make_source_id(bus_num, devfn); + +if (!vtd_dev_to_context_entry(s, bus_num, devfn, &ce)) { +is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD; +if (is_fpd_set) { +trace_vtd_fault_disabled(); +} else { +vtd_report_dmar_fault(s, info->sid, info->addr, + info->fault_reason, info->is_write); +} +} +} + static const VMStateDescription vtd_vmstate = { .name = "iommu-intel", .version_id = 1, @@ -2816,6 +2840,7 @@ static void vtd_init(IntelIOMMUState *s) s->iommu_ops.translate = vtd_iommu_translate; s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed; +s->iommu_ops.notify_fault_event = vtd_iommu_notify_fault_event; s->iommu_ops.replay = vtd_iommu_replay; s->root = 0; s->root_extended = false; -- 1.8.3.1
[Qemu-devel] [Resend RFC PATCH 0/4] VT-d: Inject fault event from IOMMU hardware
Resend patchset due to wrong Qemu devel mail address. Sorry for noise. This patchset proposes a solution for vIOMMU to get hardware IOMMU fault event and info. Motivation is to make vIOMMU inject associated fault event when pIOMMU reports fault event. vIOMMU is in charge of transforming fault info and inject to guest. The feature is also very important to support first level translation(Translation request with PASID) in VM which requires vIOMMU to inject device page request to VM. VFIO can get notification and read fault info via new VFIO cmds. Add fault event handler in the memory IOMMU ops and Intel IOMMU device model needs to register its fault event callback. VFIO will call the callback via memory wrapper function when get fault notification. This patches is prototype code and just passes build test. IOMMU driver new interface is still in the design stage. This patches is to confirm interface between Qemu and VFIO kernel driver is on the right way. Very appreciate for comments. Lan Tianyu (4): VFIO: Set eventfd for IOMMU fault event via new vfio cmd Memory: Introduce IOMMU fault event callback Intel iommu: Add Intel IOMMU fault event callback VFIO: Read IOMMU fault info from kernel space when get fault event hw/i386/intel_iommu.c | 25 hw/vfio/common.c | 88 +++ include/exec/memory.h | 19 ++ include/hw/vfio/vfio-common.h | 3 ++ linux-headers/linux/vfio.h| 35 + memory.c | 8 6 files changed, 178 insertions(+) -- 1.8.3.1
[Qemu-devel] [Resend RFC PATCH 4/4] VFIO: Read IOMMU fault info from kernel space when get fault event
This patch is to implement fault event handler with new vfio cmd to get fault info and notify vIOMMU device model. Signed-off-by: Lan Tianyu --- hw/vfio/common.c | 51 ++ linux-headers/linux/vfio.h | 22 2 files changed, 73 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 628b424..4f76e26 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -297,6 +297,57 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) static void vfio_iommu_fault(void *opaque) { +VFIOContainer *container = opaque; +struct vfio_iommu_type1_get_fault_info *info; +struct vfio_iommu_fault_info *fault_info; +MemoryRegion *mr = container->space->as->root; +int count = 0, i, ret; +IOMMUFaultInfo tmp; + +if (!event_notifier_test_and_clear(&container->fault_notifier)) { +return; +} + +info = g_malloc0(sizeof(*info)); +if (!info) { +error_report("vfio: Fail to allocate memory"); +return; +} + +info->argsz = sizeof(*info); + +ret = ioctl(container->fd, VFIO_IOMMU_GET_FAULT_INFO, info); +if (ret && ret != -ENOSPC) { +error_report("vfio: Can't get fault info"); +goto err_exit; +} + +count = info->count; +if (count <= 0) { +goto err_exit; +} + +info = g_realloc(info, sizeof(*info) + count * sizeof(*fault_info)); +info->argsz = sizeof(*info) + count * sizeof(*fault_info); +fault_info = info->fault_info; + +ret = ioctl(container->fd, VFIO_IOMMU_GET_FAULT_INFO, info); +if (ret) { +error_report("vfio: Can't get fault info"); +goto err_exit; +} + +for (i = 0; i < info->count; i++) { +tmp.addr = fault_info[i].addr; +tmp.sid = fault_info[i].sid; +tmp.fault_reason = fault_info[i].fault_reason; +tmp.is_write = fault_info[i].is_write; + +memory_region_iommu_fault_notify(mr, &tmp); +} + +err_exit: +g_free(info); } static int vfio_set_iommu_fault_notifier(struct VFIOContainer *container) diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index ca890ee..8b172f5 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -550,6 +550,28 @@ struct vfio_iommu_type1_set_fault_eventfd { #define VFIO_IOMMU_SET_FAULT_EVENTFD _IO(VFIO_TYPE, VFIO_BASE + 17) +/* + * VFIO_IOMMU_GET_FAULT_INFO _IO(VFIO_TYPE, VFIO_BASE + 18) + * + * Return IOMMU fault info to userspace. + */ + +struct vfio_iommu_fault_info { + __u64 addr; + __u16 sid; + __u8fault_reason; + __u8is_write:1; +}; + +struct vfio_iommu_type1_get_fault_info { + __u32 argsz; + __u32 flags; + __u32 count; + struct vfio_iommu_fault_info fault_info[]; +}; + +#define VFIO_IOMMU_GET_FAULT_INFO _IO(VFIO_TYPE, VFIO_BASE + 18) + /* Additional API for SPAPR TCE (Server POWERPC) IOMMU */ /* -- 1.8.3.1
[Qemu-devel] [Resend RFC PATCH 1/4] VFIO: Set eventfd for IOMMU fault event via new vfio cmd
This patch is to assign an event fd to VFIO IOMMU type1 driver in order to get notification when IOMMU driver reports fault event. Signed-off-by: Lan Tianyu --- hw/vfio/common.c | 37 + include/hw/vfio/vfio-common.h | 3 +++ linux-headers/linux/vfio.h| 13 + 3 files changed, 53 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 6b33b9f..628b424 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -33,6 +33,7 @@ #include "qemu/error-report.h" #include "qemu/range.h" #include "sysemu/kvm.h" +#include "sysemu/sysemu.h" #include "trace.h" #include "qapi/error.h" @@ -294,6 +295,34 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) section->offset_within_address_space & (1ULL << 63); } +static void vfio_iommu_fault(void *opaque) +{ +} + +static int vfio_set_iommu_fault_notifier(struct VFIOContainer *container) +{ +struct vfio_iommu_type1_set_fault_eventfd eventfd; +int ret; + +ret = event_notifier_init(&container->fault_notifier, 0); +if (ret < 0) { +error_report("vfio: Failed to init notifier for IOMMU fault event"); +return ret; +} + +eventfd.fd = event_notifier_get_fd(&container->fault_notifier); +eventfd.argsz = sizeof(eventfd); + +ret = ioctl(container->fd, VFIO_IOMMU_SET_FAULT_EVENTFD, &eventfd); +if (ret < 0) { +error_report("vfio: Failed to set notifier for IOMMU fault event"); +return ret; +} + +qemu_set_fd_handler(eventfd.fd, vfio_iommu_fault, NULL, container); +return 0; +} + /* Called with rcu_read_lock held. */ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, bool *read_only) @@ -1103,6 +1132,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, goto listener_release_exit; } +if (memory_region_is_iommu(container->space->as->root)) { +if (vfio_set_iommu_fault_notifier(container)) { +error_setg_errno(errp, -ret, +"Fail to set IOMMU fault notifier"); +goto listener_release_exit; +} +} + container->initialized = true; QLIST_INIT(&container->group_list); diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index c582de1..1b594c6 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -26,6 +26,7 @@ #include "exec/memory.h" #include "qemu/queue.h" #include "qemu/notify.h" +#include "qemu/event_notifier.h" #ifdef CONFIG_LINUX #include #endif @@ -81,6 +82,8 @@ typedef struct VFIOContainer { unsigned iommu_type; int error; bool initialized; +EventNotifier fault_notifier; + /* * This assumes the host IOMMU can support only a single * contiguous IOVA window. We may need to generalize that in diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 759b850..ca890ee 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -537,6 +537,19 @@ struct vfio_iommu_type1_dma_unmap { #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15) #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16) +/* + * VFIO_IOMMU_SET_FAULT_EVENT_FD _IO(VFIO_TYPE, VFIO_BASE + 17) + * + * Receive eventfd from userspace to notify fault event from IOMMU. + */ +struct vfio_iommu_type1_set_fault_eventfd { + __u32 argsz; + __u32 flags; + __u32 fd; +}; + +#define VFIO_IOMMU_SET_FAULT_EVENTFD _IO(VFIO_TYPE, VFIO_BASE + 17) + /* Additional API for SPAPR TCE (Server POWERPC) IOMMU */ /* -- 1.8.3.1
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月08日 10:39, Peter Xu wrote: > Btw, do you have time to help review my RFC series for "vt-d replay"? > :) I'd like to receive any further review comments besides the issues > mentioned above since IMHO they can be seperated from current series, > I have noted them down in my pending list. Sure. I have reviewed some patches. It looks good for me :) -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
On 2016年12月06日 18:36, Peter Xu wrote: > +/** > + * vtd_page_walk - walk specific IOVA range, and call the hook > + * > + * @ce: context entry to walk upon > + * @start: IOVA address to start the walk > + * @end: size of the IOVA range to walk over *·@end:·IOVA·range·end·address·(start·<=·addr·<·end) This comment for vtd_page_walk_level() maybe more accurate here? -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月07日 14:43, Peter Xu wrote: On Wed, Dec 07, 2016 at 02:09:16PM +0800, Lan Tianyu wrote: On 2016年12月06日 18:59, Peter Xu wrote: On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: [...] User space driver(E.G DPDK) also can enable/disable IOVA for device dynamically. Could you provide more detailed (or any pointer) on how to do that? I did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks like it is for ppc only. No, I just give an example that user space may do that but no more research. But since Qemu already can enable device's IOVA, other user application also should can do that with the same VFIO interface, right? AFAIU we can't do that at least on x86. We can use vfio interface to bind group into container, but we should not be able to dynamically disable IOMMU protection. IIUC That needs to taint the kernel. The only way I know is that we probe vfio-pci with no-iommu mode, in that case, we disabled IOMMU, but we can never dynamically enable it as well. Please correct me if I am wrong. Actually, disabling device's IOVA doesn't require to disable kernel global DMA protect and just clear device's VTD context entry in the context table. Go though IOMMU and VFIO code, find this will happen when call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy VM or unplug assigned device in Qemu. Please help to double check. Call trace: __vfio_group_unset_container() vfio_iommu_type1_detach_group() iommu_detach_group() dmar_remove_one_dev_info() __dmar_remove_one_dev_info() domain_context_clear() The legacy KVM device assign code also will call iommu_detach_device() when deassign a device. From device emulation view, we need to make sure correct register emulation regardless of guest behavior. Even if the context entry is cleared and invalidated, IMHO it does not mean that we should be using GPA address space, nor do we need to put it into guest physical address space. Instead, it simply means this device cannot do any IO at that time. If IO comes, IOMMU should do fault reporting to guest OS, which should be treated as error. Yes, that looks right and there will be fault event reported by pIOMMU if context entry is no present for DMA untranslated request. This goes back to the first gap to report pIOMMU fault event to vIOMMU. For disabling via clearing DMA translation via gcmd.TE bit, assigned device should work after clearing operation and it's still necessary to restore GPA->HPA mapping since we can't assume guest won't clear the bit after enabling DMA translation. This maybe low priority gap since Linux IOMMU driver don't disable DMA translation frequently or dynamically. But we also should consider the situation. So I think we are emulating the correct guest behavior here - we don't need to do anything if a device is detached from an existing IOMMU domain in guest. If we do (e.g., we replay the GPA address space on that device when it is detached, so the shadow page table for that device maps the whole guest memory), that is dangerous, because with that the device can DMA to anywhere it wants to guest memory. If guest want to disabling DMA translation, this is expected behavior and device model should follow guest configuration. This just likes most distributions don't enable VTD DMA translation by default and it's OS choice. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月06日 18:59, Peter Xu wrote: > On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: > > [...] > >>> >>>> User space driver(E.G DPDK) also can enable/disable >>>> IOVA for device dynamically. >>> >>> Could you provide more detailed (or any pointer) on how to do that? I >>> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks >>> like it is for ppc only. >> >> No, I just give an example that user space may do that but no more >> research. But since Qemu already can enable device's IOVA, other user >> application also should can do that with the same VFIO interface, right? > > AFAIU we can't do that at least on x86. We can use vfio interface to > bind group into container, but we should not be able to dynamically > disable IOMMU protection. IIUC That needs to taint the kernel. > > The only way I know is that we probe vfio-pci with no-iommu mode, in > that case, we disabled IOMMU, but we can never dynamically enable it > as well. > > Please correct me if I am wrong. Actually, disabling device's IOVA doesn't require to disable kernel global DMA protect and just clear device's VTD context entry in the context table. Go though IOMMU and VFIO code, find this will happen when call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy VM or unplug assigned device in Qemu. Please help to double check. Call trace: __vfio_group_unset_container() vfio_iommu_type1_detach_group() iommu_detach_group() dmar_remove_one_dev_info() __dmar_remove_one_dev_info() domain_context_clear() The legacy KVM device assign code also will call iommu_detach_device() when deassign a device. >From device emulation view, we need to make sure correct register emulation regardless of guest behavior. > > Thanks, > > -- peterx > -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月06日 15:22, Peter Xu wrote: > On Tue, Dec 06, 2016 at 03:06:20PM +0800, Lan Tianyu wrote: >> On 2016年12月06日 14:51, Peter Xu wrote: >>> On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote: >>> >>> [...] >>> >>>>>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. >>>>>> When guest enables IOVA for device, vIOMMU will invalidate all previous >>>>>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu >>>>>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA >>>>>> mapping for the device otherwise the device won't work again in the VM. >>>>> >>>>> If we can have a workable replay mechanism, this problem will be >>>>> solved IMHO. >>>> >>>> Basic idea is to replay related memory regions to restore GPA->HPA >>>> mapping when guest disables IOVA. >>> >>> Btw, could you help elaborate in what case we will trigger such a >>> condition? Or, can we dynamically enable/disable an IOMMU? >> >> First case I think of is nest virtualization and we assign physical >> device to l2 guest. > > If we assign physical device to l2 guest, then it will belongs to an > IOMMU domain in either l1 guest and host, right? (assuming we are > using vfio-pci to do device assignment) Yes. > >> User space driver(E.G DPDK) also can enable/disable >> IOVA for device dynamically. > > Could you provide more detailed (or any pointer) on how to do that? I > did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks > like it is for ppc only. No, I just give an example that user space may do that but no more research. But since Qemu already can enable device's IOVA, other user application also should can do that with the same VFIO interface, right? -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月06日 14:51, Peter Xu wrote: > On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote: > > [...] > >>>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. >>>> When guest enables IOVA for device, vIOMMU will invalidate all previous >>>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu >>>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA >>>> mapping for the device otherwise the device won't work again in the VM. >>> >>> If we can have a workable replay mechanism, this problem will be >>> solved IMHO. >> >> Basic idea is to replay related memory regions to restore GPA->HPA >> mapping when guest disables IOVA. > > Btw, could you help elaborate in what case we will trigger such a > condition? Or, can we dynamically enable/disable an IOMMU? First case I think of is nest virtualization and we assign physical device to l2 guest. User space driver(E.G DPDK) also can enable/disable IOVA for device dynamically. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月02日 14:52, Peter Xu wrote: > On Thu, Dec 01, 2016 at 02:44:14PM +0800, Lan Tianyu wrote: > > [...] > >> Hi: >> I think there are still other gaps to enable passthough device with >> vIOMMU's DMA translation support. >> >> 1. Since this patchset is to shadow guest IO page table to >> pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault >> events from pIOMMU if guest os does misconfigurations. We should report >> these fault events to guest. This means we need to pass the fault event >> from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should >> be added to connect pIOMMU and vIOMMU. > > Thanks for raising this up - IMHO this is a good question. > >> >> The task should be divided into three parts >> 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface > > Here you mean "how host kernel capture the DMAR fault", right? Yes, VFIO kernel driver should know the fault event when it's triggered. > > IMHO We can have something like notifier/notifiee as well in DMAR > fault reporting - people (like vfio driver in kernel) can register to > fault reports related to specific device. When DMAR receives faults > for those devices, it triggers the notification list, then vfio can be > notified. Yes, something likes this. > >> 2) Add new channel in VFIO subsystem to connect pIOMMU driver and >> vIOMMU in Qemu >> 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject >> virtual fault event to guest. >> >> Such VFIO channel is also required by device's PRS(Page Request >> Services) support. This is also a part of SVM(Shared virtual memory) >> support in VM. Here is SVM design doc link. >> http://marc.info/?l=kvm&m=148049586514120&w=2 >> >> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. >> When guest enables IOVA for device, vIOMMU will invalidate all previous >> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu >> notifier. But if IOVA is disabled, I think we should restore GPA->HPA >> mapping for the device otherwise the device won't work again in the VM. > > If we can have a workable replay mechanism, this problem will be > solved IMHO. Basic idea is to replay related memory regions to restore GPA->HPA mapping when guest disables IOVA. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 12/3/2016 1:30 AM, Alex Williamson wrote: > On Fri, 2 Dec 2016 14:08:59 +0800 > Peter Xu wrote: > >> On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote: >>> On 2016年11月30日 17:23, Peter Xu wrote: >>>> On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: >>>>> * intel_iommu's replay op is not implemented yet (May come in different >>>>> patch >>>>> set). >>>>> The replay function is required for hotplug vfio device and to move >>>>> devices >>>>> between existing domains. >>>> >>>> I am thinking about this replay thing recently and now I start to >>>> doubt whether the whole vt-d vIOMMU framework suites this... >>>> >>>> Generally speaking, current work is throwing away the IOMMU "domain" >>>> layer here. We maintain the mapping only per device, and we don't care >>>> too much about which domain it belongs. This seems problematic. >>>> >>>> A simplest wrong case for this is (let's assume cache-mode is >>>> enabled): if we have two assigned devices A and B, both belong to the >>>> same domain 1. Meanwhile, in domain 1 assume we have one mapping which >>>> is the first page (iova range 0-0xfff). Then, if guest wants to >>>> invalidate the page, it'll notify VT-d vIOMMU with an invalidation >>>> message. If we do this invalidation per-device, we'll need to UNMAP >>>> the region twice - once for A, once for B (if we have more devices, we >>>> will unmap more times), and we can never know we have done duplicated >>>> work since we don't keep domain info, so we don't know they are using >>>> the same address space. The first unmap will work, and then we'll >>>> possibly get some errors on the rest of dma unmap failures. >>> >>> >>> Hi Peter: >> >> Hi, Tianyu, >> >>> According VTD spec 6.2.2.1, "Software must ensure that, if multiple >>> context-entries (or extended-context-entries) are programmed >>> with the same Domain-id (DID), such entries must be programmed with same >>> value for the secondlevel page-table pointer (SLPTPTR) field, and same >>> value for the PASID Table Pointer (PASIDTPTR) field.". >>> >>> So if two assigned device may have different IO page table, they should >>> be put into different domains. >> >> >> By default, devices will be put into different domains. However it >> should be legal that we put two assigned devices into the same IOMMU >> domain (in the guest), right? And we should handle both cases well >> IMHO. >> >> Actually I just wrote a tool to do it based on vfio-pci: >> >> >> https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c >> >> If we run this tool in the guest with parameter like: >> >> ./vfio-bind-groups 00:03.0 00:04.0 >> >> It'll create one single domain, and put PCI device 00:03.0, 00:04.0 >> into the same IOMMU domain. > > On the host though, I'd expect we still have separate IOMMU domains, > one for each device and we do DMA_{UN}MAP ioctls separately per > container. Thanks, Agree. Guest may use different IO page tables for multi assigned devices and this requires to put assigned device in different VTD domain on host. I think we can't put assigned devices into the same VTD domain before enabling dynamic containers.
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年11月30日 17:23, Peter Xu wrote: > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: >> * intel_iommu's replay op is not implemented yet (May come in different >> patch >> set). >> The replay function is required for hotplug vfio device and to move >> devices >> between existing domains. > > I am thinking about this replay thing recently and now I start to > doubt whether the whole vt-d vIOMMU framework suites this... > > Generally speaking, current work is throwing away the IOMMU "domain" > layer here. We maintain the mapping only per device, and we don't care > too much about which domain it belongs. This seems problematic. > > A simplest wrong case for this is (let's assume cache-mode is > enabled): if we have two assigned devices A and B, both belong to the > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > is the first page (iova range 0-0xfff). Then, if guest wants to > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > message. If we do this invalidation per-device, we'll need to UNMAP > the region twice - once for A, once for B (if we have more devices, we > will unmap more times), and we can never know we have done duplicated > work since we don't keep domain info, so we don't know they are using > the same address space. The first unmap will work, and then we'll > possibly get some errors on the rest of dma unmap failures. Hi Peter: According VTD spec 6.2.2.1, "Software must ensure that, if multiple context-entries (or extended-context-entries) are programmed with the same Domain-id (DID), such entries must be programmed with same value for the secondlevel page-table pointer (SLPTPTR) field, and same value for the PASID Table Pointer (PASIDTPTR) field.". So if two assigned device may have different IO page table, they should be put into different domains. > > Looks like we just cannot live without knowing this domain layer. > Because the address space is binded to the domain. If we want to sync > the address space (here to setup a correct shadow page table), we need > to do it per-domain. > > What I can think of as a solution is that we introduce this "domain" > layer - like a memory region per domain. When invalidation happens, > it's per-domain, not per-device any more (actually I guess that's what > current vt-d iommu driver in kernel is doing, we just ignored it - we > fetch the devices that matches the domain ID). We can/need to maintain > something different, like sid <-> domain mappings (we can do this as > long as we are notified when context entries changed), per-domain > mappings (just like per-device mappings that we are trying to build in > this series, but what we really need is IMHO per domain one), etc. > When device switches domain, we switch the IOMMU memory region > accordingly. > > Does this make any sense? Comments are greatly welcomed (especially > from AlexW and DavidG). > > Thanks, > > -- peterx > -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月01日 12:21, Tian, Kevin wrote: >> I am thinking about this replay thing recently and now I start to >> > doubt whether the whole vt-d vIOMMU framework suites this... >> > >> > Generally speaking, current work is throwing away the IOMMU "domain" >> > layer here. We maintain the mapping only per device, and we don't care >> > too much about which domain it belongs. This seems problematic. >> > >> > A simplest wrong case for this is (let's assume cache-mode is >> > enabled): if we have two assigned devices A and B, both belong to the >> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which >> > is the first page (iova range 0-0xfff). Then, if guest wants to >> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation >> > message. If we do this invalidation per-device, we'll need to UNMAP >> > the region twice - once for A, once for B (if we have more devices, we >> > will unmap more times), and we can never know we have done duplicated >> > work since we don't keep domain info, so we don't know they are using >> > the same address space. The first unmap will work, and then we'll >> > possibly get some errors on the rest of dma unmap failures. > Tianyu and I discussed there is a bigger problem: today VFIO assumes > only one address space per container, which is fine w/o vIOMMU (all devices > in > same container share same GPA->HPA translation). However it's not the case > when vIOMMU is enabled, because guest Linux implements per-device > IOVA space. If a VFIO container includes multiple devices, it means > multiple address spaces required per container... > Hi All: Some updates about relationship about assigned device and container. If vIOMMU is disabled, all assigned devices will use global address_space_memory as address space (Detail please see pci_device_iommu_address_space()). VFIO creates container according to address space and so all assigned devices will put into one container. If vIOMMU is enabled, Intel vIOMMU will allocate separate address spaces for each assigned devices and then VFIO will create different container for each assigned device. In other word, it will be one assigned device per container when vIOMMU is enabled. The original concern won't take place. This is my understanding and please correct me if something wrong. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年11月28日 23:51, Aviv B.D wrote: > From: "Aviv Ben-David" > > * Advertize Cache Mode capability in iommu cap register. > This capability is controlled by "cache-mode" property of intel-iommu > device. > To enable this option call QEMU with "-device intel-iommu,cache-mode=true". > > * On page cache invalidation in intel vIOMMU, check if the domain belong to > registered notifier, and notify accordingly. > > Currently this patch still doesn't enabling VFIO devices support with vIOMMU > present. Current problems: > * vfio_iommu_map_notify is not aware about memory range belong to specific > VFIOGuestIOMMU. > * intel_iommu's replay op is not implemented yet (May come in different patch > set). > The replay function is required for hotplug vfio device and to move devices > between existing domains. > > Changes from v1 to v2: > * remove assumption that the cache do not clears > * fix lockup on high load. > > Changes from v2 to v3: > * remove debug leftovers > * split to sepearate commits > * change is_write to flags in vtd_do_iommu_translate, add IOMMU_NO_FAIL > to suppress error propagating to guest. > > Changes from v3 to v4: > * Add property to intel_iommu device to control the CM capability, > default to False. > * Use s->iommu_ops.notify_flag_changed to register notifiers. > > Changes from v4 to v4 RESEND: > * Fix codding style pointed by checkpatch.pl script. > > Changes from v4 to v5: > * Reduce the number of changes in patch 2 and make flags real bitfield. > * Revert deleted debug prints. > * Fix memory leak in patch 3. > > Changes from v5 to v6: > * fix prototype of iommu_translate function for more IOMMU types. > * VFIO will be notified only on the difference, without unmap > before change to maps. > > Changes from v6 to v7: > * Add replay op to iommu_ops, with current behavior as default implementation > (Patch 4). > * Add stub to disable VFIO with intel_iommu support (Patch 5). > * Cosmetic changes to other patches. > > Aviv Ben-David (5): > IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to > guest > IOMMU: change iommu_op->translate's is_write to flags, add support to > NO_FAIL flag mode > IOMMU: enable intel_iommu map and unmap notifiers > IOMMU: add specific replay function with default implemenation > IOMMU: add specific null implementation of iommu_replay to intel_iommu > > exec.c | 3 +- > hw/alpha/typhoon.c | 2 +- > hw/i386/amd_iommu.c| 4 +- > hw/i386/intel_iommu.c | 165 > ++--- > hw/i386/intel_iommu_internal.h | 3 + > hw/pci-host/apb.c | 2 +- > hw/ppc/spapr_iommu.c | 2 +- > hw/s390x/s390-pci-bus.c| 2 +- > include/exec/memory.h | 10 ++- > include/hw/i386/intel_iommu.h | 11 +++ > memory.c | 11 ++- > 11 files changed, 177 insertions(+), 38 deletions(-) > Hi: I think there are still other gaps to enable passthough device with vIOMMU's DMA translation support. 1. Since this patchset is to shadow guest IO page table to pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault events from pIOMMU if guest os does misconfigurations. We should report these fault events to guest. This means we need to pass the fault event from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should be added to connect pIOMMU and vIOMMU. The task should be divided into three parts 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface 2) Add new channel in VFIO subsystem to connect pIOMMU driver and vIOMMU in Qemu 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject virtual fault event to guest. Such VFIO channel is also required by device's PRS(Page Request Services) support. This is also a part of SVM(Shared virtual memory) support in VM. Here is SVM design doc link. http://marc.info/?l=kvm&m=148049586514120&w=2 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. When guest enables IOVA for device, vIOMMU will invalidate all previous GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu notifier. But if IOVA is disabled, I think we should restore GPA->HPA mapping for the device otherwise the device won't work again in the VM. Another option to enable passthough device with intel vIOMMU We may prevent guest to enable DMA translation function successfully via gcmd and vIOMMU never set "Translation Enable Status" bit in gsts register when there is a passthough device. There are kernel parameter "intel_iommu=on" or Kconfig CONFIG_INTEL_IOMMU_DEFAULT_ON to enable DMA translation for intel IOMMU driver. But they aren't set in distribution RHEL, SLES, Oracle and ubuntu by default. Tha The iommu driver will trigger a kernel panic when fail to enable DMA translation via gcmd with log "DMAR hardware is malfunctioning". -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode
On 2016年09月23日 13:26, Peter Xu wrote: > On Thu, Sep 22, 2016 at 12:34:36PM +0800, Chao Gao wrote: >> Hi, we had 3 problems left here. >> 1. IRQremapping can't work with x2apic_cluster mode. >> 2. apic_id > 255 can't receive devices interrupts. >> 3. windows crash when present IRQremapping capability to it. > Hi Peter: Thanks for your information. We just want to make sure these issues aren't ignored. > For (3), I don't know whether it's urgent or not - I've put it into my > todo list (assuming this is for HPC and mostly we are using Linux > guests?). > > For (1-2), again we may need to wait for Radim's patches.
Re: [Qemu-devel] [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip is off
On 3/1/2016 10:00 PM, Eduardo Habkost wrote: On Mon, Feb 29, 2016 at 10:56:04AM +0800, Lan Tianyu wrote: On 2016年02月27日 03:54, Eduardo Habkost wrote: On Thu, Feb 25, 2016 at 11:15:12PM +0800, Lan Tianyu wrote: x2apic feature is in the kvm_default_props and automatically added to all CPU models when KVM is enabled. But userspace devices don't support x2apic which can't be enabled without the in-kernel irqchip. It will trigger warning of "host doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]" when kernel_irqchip is off. This patch is to fix it via removing x2apic feature when kernel_irqchip is off. Signed-off-by: Lan Tianyu --- target-i386/cpu.c | 4 1 file changed, 4 insertions(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index c78f824..298fb62 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2125,6 +2125,10 @@ static void x86_cpu_load_def(X86CPU *cpu, X86CPUDefinition *def, Error **errp) /* Special cases not set in the X86CPUDefinition structs: */ if (kvm_enabled()) { +if (!kvm_irqchip_in_kernel()) { +x86_cpu_change_kvm_default("x2apic", "off"); This should be NULL instead of "off". I tried "NULL" before. But some cpus modules(E,G SandyBridge, IvyBridge, haswell) already have x2apic feature in their default features of struct X86CPUDefinition, passing "NULL" is not to add x2apic feature to the cpu module and will not help to remove x2apic feature for these cpu modules. So I changed "NULL" to "off". In this case, I suggest we remove x2apic from these CPU models to avoid confusion, as the presence of the flag in the table makes no difference at all (this can be done in a separate patch). If we do that, NULL and "off" would have the same results, but NULL is clearer, IMO. NULL simply disables the KVM-specific hack (so we don't touch the property anymore), but "off" adds a new TCG-specific hack. I will submit that as a follow-up patch. Yes, that sounds reasonable. Thanks for your review. Reviewed-by: Eduardo Habkost Otherwise, the warning will be disabled if using "-cpu ...,+x2apic". kvm_arch_get_supported_cpuid() always returns no x2apic support when kernel_irqchip is off and so it still triggers warning with "-cpu ...,+x2apic". #qemu-system-x86_64 -cpu qemu64,+x2apic -machine kernel-irqchip=off warning: TCG doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21] You are right. The +x2apic flag is applied after x86_cpu_load_def() runs. My mistake.
Re: [Qemu-devel] [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip is off
On 2016年02月27日 03:54, Eduardo Habkost wrote: > On Thu, Feb 25, 2016 at 11:15:12PM +0800, Lan Tianyu wrote: >> x2apic feature is in the kvm_default_props and automatically added to all >> CPU models when KVM is enabled. But userspace devices don't support x2apic >> which can't be enabled without the in-kernel irqchip. It will trigger >> warning of "host doesn't support requested feature: CPUID.01H:ECX.x2apic >> [bit 21]" when kernel_irqchip is off. This patch is to fix it via removing >> x2apic feature when kernel_irqchip is off. >> >> Signed-off-by: Lan Tianyu >> --- >> target-i386/cpu.c | 4 >> 1 file changed, 4 insertions(+) >> >> diff --git a/target-i386/cpu.c b/target-i386/cpu.c >> index c78f824..298fb62 100644 >> --- a/target-i386/cpu.c >> +++ b/target-i386/cpu.c >> @@ -2125,6 +2125,10 @@ static void x86_cpu_load_def(X86CPU *cpu, >> X86CPUDefinition *def, Error **errp) >> >> /* Special cases not set in the X86CPUDefinition structs: */ >> if (kvm_enabled()) { >> +if (!kvm_irqchip_in_kernel()) { >> +x86_cpu_change_kvm_default("x2apic", "off"); > > This should be NULL instead of "off". I tried "NULL" before. But some cpus modules(E,G SandyBridge, IvyBridge, haswell) already have x2apic feature in their default features of struct X86CPUDefinition, passing "NULL" is not to add x2apic feature to the cpu module and will not help to remove x2apic feature for these cpu modules. So I changed "NULL" to "off". > Otherwise, the warning will > be disabled if using "-cpu ...,+x2apic". > kvm_arch_get_supported_cpuid() always returns no x2apic support when kernel_irqchip is off and so it still triggers warning with "-cpu ...,+x2apic". #qemu-system-x86_64 -cpu qemu64,+x2apic -machine kernel-irqchip=off warning: TCG doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21] >> +} >> + > > This function runs multiple times (once for each VCPU being > created). Should be harmless, though. Eventually we could move > the whole kvm_default_props logic outside cpu.c, to a KVM-x86 > accel class or to common PC initialization code. Good suggestion and we can try that later. > > >> x86_cpu_apply_props(cpu, kvm_default_props); >> } >> >> -- >> 1.9.3 >> > -- Best regards Tianyu Lan
[Qemu-devel] [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip is off
x2apic feature is in the kvm_default_props and automatically added to all CPU models when KVM is enabled. But userspace devices don't support x2apic which can't be enabled without the in-kernel irqchip. It will trigger warning of "host doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]" when kernel_irqchip is off. This patch is to fix it via removing x2apic feature when kernel_irqchip is off. Signed-off-by: Lan Tianyu --- target-i386/cpu.c | 4 1 file changed, 4 insertions(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index c78f824..298fb62 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2125,6 +2125,10 @@ static void x86_cpu_load_def(X86CPU *cpu, X86CPUDefinition *def, Error **errp) /* Special cases not set in the X86CPUDefinition structs: */ if (kvm_enabled()) { +if (!kvm_irqchip_in_kernel()) { +x86_cpu_change_kvm_default("x2apic", "off"); +} + x86_cpu_apply_props(cpu, kvm_default_props); } -- 1.9.3
Re: [Qemu-devel] kvm: "warning: host doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]"
2016-02-25 16:39 GMT+08:00 Jan Kiszka : > On 2016-02-25 09:33, Lan Tianyu wrote: >> 2016-02-20 17:00 GMT+08:00 Paolo Bonzini : >>> >>> >>> - Original Message - >>>> From: "Jan Kiszka" >>>> To: "Eduardo Habkost" , "Paolo Bonzini" >>>> >>>> Cc: "qemu-devel" , "kvm" >>>> Sent: Saturday, February 20, 2016 9:09:32 AM >>>> Subject: kvm: "warning: host doesn't support requested feature: >>>> CPUID.01H:ECX.x2apic [bit 21]" >>>> >>>> Hi all, >>>> >>>> I suppose 5120901a37 introduced this: qemu with kernel_irqchip=off now >>>> generates these warnings, one per VCPU, during QEMU startup. Is the plan >>>> to live with them until we finally have x2APIC emulation in userspace >>>> (ie. also MSR vmexiting to there), or should we otherwise avoid it? >>> >>> I think it's a bug, x2apic should be auto-suppressed with >>> kernel_irqchip=off. >>> >> >> The patch is to fix the issue. >> ->8 >> From 58f2a3a94c8e7bf9f3474bcafb6c59cc4f8bcbd9 Mon Sep 17 00:00:00 2001 >> From: Lan Tianyu >> Date: Sun, 15 Jul 2001 01:40:17 -0400 >> Subject: [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when >> kernel_irqchip=off >> >> x2apic feature is in the kvm_default_props and automatically added to all >> CPU models when KVM is enabled regardless of kernel_irqchip=off. This will >> trigger "warning: host doesn't support requested feature: CPUID.01H: >> ECX.x2apic [bit 21]" when kernel_irqchip=off. This patch is to remove x2apic >> feature when kernel_irqchip=off. > > We know this, but it's probably worth to mention the underlying reason > here: userspace devices don't support x2APIC. > >> >> Signed-off-by: Lan Tianyu >> --- >> target-i386/cpu.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/target-i386/cpu.c b/target-i386/cpu.c >> index 0d447b5..2ec7eb7 100644 >> --- a/target-i386/cpu.c >> +++ b/target-i386/cpu.c >> @@ -2105,6 +2105,9 @@ static void x86_cpu_load_def(X86CPU *cpu, >> X86CPUDefinition *def, Error **errp) >> >> /* Special cases not set in the X86CPUDefinition structs: */ >> if (kvm_enabled()) { >> + if (!kvm_irqchip_in_kernel()) >> + x86_cpu_change_kvm_default("x2apic", "off"); >> + >> x86_cpu_apply_props(cpu, kvm_default_props); >> } >> >> -- >> 1.9.3 >> > > Make sure to comply with the coding style (there is a checkpatch.pl also > in QEMU). And please post as a new thread with proper subject, otherwise > people (and tools) will not find your patch as such. > > Jan Thanks for comments. Will update soon.. -- Best regards Tianyu Lan
Re: [Qemu-devel] kvm: "warning: host doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]"
2016-02-20 17:00 GMT+08:00 Paolo Bonzini : > > > - Original Message - >> From: "Jan Kiszka" >> To: "Eduardo Habkost" , "Paolo Bonzini" >> >> Cc: "qemu-devel" , "kvm" >> Sent: Saturday, February 20, 2016 9:09:32 AM >> Subject: kvm: "warning: host doesn't support requested feature: >> CPUID.01H:ECX.x2apic [bit 21]" >> >> Hi all, >> >> I suppose 5120901a37 introduced this: qemu with kernel_irqchip=off now >> generates these warnings, one per VCPU, during QEMU startup. Is the plan >> to live with them until we finally have x2APIC emulation in userspace >> (ie. also MSR vmexiting to there), or should we otherwise avoid it? > > I think it's a bug, x2apic should be auto-suppressed with kernel_irqchip=off. > The patch is to fix the issue. ->8 >From 58f2a3a94c8e7bf9f3474bcafb6c59cc4f8bcbd9 Mon Sep 17 00:00:00 2001 From: Lan Tianyu Date: Sun, 15 Jul 2001 01:40:17 -0400 Subject: [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip=off x2apic feature is in the kvm_default_props and automatically added to all CPU models when KVM is enabled regardless of kernel_irqchip=off. This will trigger "warning: host doesn't support requested feature: CPUID.01H: ECX.x2apic [bit 21]" when kernel_irqchip=off. This patch is to remove x2apic feature when kernel_irqchip=off. Signed-off-by: Lan Tianyu --- target-i386/cpu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 0d447b5..2ec7eb7 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2105,6 +2105,9 @@ static void x86_cpu_load_def(X86CPU *cpu, X86CPUDefinition *def, Error **errp) /* Special cases not set in the X86CPUDefinition structs: */ if (kvm_enabled()) { + if (!kvm_irqchip_in_kernel()) + x86_cpu_change_kvm_default("x2apic", "off"); + x86_cpu_apply_props(cpu, kvm_default_props); } -- 1.9.3
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 2015年12月30日 00:46, Michael S. Tsirkin wrote: > Interesting. So you sare saying merely ifdown/ifup is 100ms? > This does not sound reasonable. > Is there a chance you are e.g. getting IP from dhcp? > > If so that is wrong - clearly should reconfigure the old IP > back without playing with dhcp. For testing, just set up > a static IP. MAC and IP are migrated with VM to target machine and not need to reconfigure IP after migration. >From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up() consumes 55ms during migration. -- Best regards Tianyu Lan
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: As long as you keep up this vague talk about performance during migration, without even bothering with any measurements, this patchset will keep going nowhere. I measured network service downtime for "keep device alive"(RFC patch V1 presented) and "put down and up network interface"(RFC patch V2 presented) during migration with some optimizations. The former is around 140ms and the later is around 240ms. My patchset relies on the maibox irq which doesn't work in the suspend state and so can't get downtime for suspend/resume cases. Will try to get the result later. There's Alex's patch that tracks memory changes during migration. It needs some simple enhancements to be useful in production (e.g. add a host/guest handshake to both enable tracking in guest and to detect the support in host), then it can allow starting migration with an assigned device, by invoking hot-unplug after most of memory have been migrated. Please implement this in qemu and measure the speed. Sure. Will do that. I will not be surprised if destroying/creating netdev in linux turns out to take too long, but before anyone bothered checking, it does not make sense to discuss further enhancements.
Re: [Qemu-devel] live migration vs device assignment (motivation)
Merry Christmas. Sorry for later response due to personal affair. On 2015年12月14日 03:30, Alexander Duyck wrote: >> > These sounds we need to add a faked bridge for migration and adding a >> > driver in the guest for it. It also needs to extend PCI bus/hotplug >> > driver to do pause/resume other devices, right? >> > >> > My concern is still that whether we can change PCI bus/hotplug like that >> > without spec change. >> > >> > IRQ should be general for any devices and we may extend it for >> > migration. Device driver also can make decision to support migration >> > or not. > The device should have no say in the matter. Either we are going to > migrate or we will not. This is why I have suggested my approach as > it allows for the least amount of driver intrusion while providing the > maximum number of ways to still perform migration even if the device > doesn't support it. Even if the device driver doesn't support migration, you still want to migrate VM? That maybe risk and we should add the "bad path" for the driver at least. > > The solution I have proposed is simple: > > 1. Extend swiotlb to allow for a page dirtying functionality. > > This part is pretty straight forward. I'll submit a few patches > later today as RFC that can provided the minimal functionality needed > for this. Very appreciate to do that. > > 2. Provide a vendor specific configuration space option on the QEMU > implementation of a PCI bridge to act as a bridge between direct > assigned devices and the host bridge. > > My thought was to add some vendor specific block that includes a > capabilities, status, and control register so you could go through and > synchronize things like the DMA page dirtying feature. The bridge > itself could manage the migration capable bit inside QEMU for all > devices assigned to it. So if you added a VF to the bridge it would > flag that you can support migration in QEMU, while the bridge would > indicate you cannot until the DMA page dirtying control bit is set by > the guest. > > We could also go through and optimize the DMA page dirtying after > this is added so that we can narrow down the scope of use, and as a > result improve the performance for other devices that don't need to > support migration. It would then be a matter of adding an interrupt > in the device to handle an event such as the DMA page dirtying status > bit being set in the config space status register, while the bit is > not set in the control register. If it doesn't get set then we would > have to evict the devices before the warm-up phase of the migration, > otherwise we can defer it until the end of the warm-up phase. > > 3. Extend existing shpc driver to support the optional "pause" > functionality as called out in section 4.1.2 of the Revision 1.1 PCI > hot-plug specification. Since your solution has added a faked PCI bridge. Why not notify the bridge directly during migration via irq and call device driver's callback in the new bridge driver? Otherwise, the new bridge driver also can check whether the device driver provides migration callback or not and call them to improve the passthough device's performance during migration. > > Note I call out "extend" here instead of saying to add this. > Basically what we should do is provide a means of quiescing the device > without unloading the driver. This is called out as something the OS > vendor can optionally implement in the PCI hot-plug specification. On > OSes that wouldn't support this it would just be treated as a standard > hot-plug event. We could add a capability, status, and control bit > in the vendor specific configuration block for this as well and if we > set the status bit would indicate the host wants to pause instead of > remove and the control bit would indicate the guest supports "pause" > in the OS. We then could optionally disable guest migration while the > VF is present and pause is not supported. > > To support this we would need to add a timer and if a new device > is not inserted in some period of time (60 seconds for example), or if > a different device is inserted, > we need to unload the original driver > from the device. In addition we would need to verify if drivers can > call the remove function after having called suspend without resume. > If not, we could look at adding a recovery function to remove the > driver from the device in the case of a suspend with either a failed > resume or no resume call. Once again it would probably be useful to > have for those cases where power management suspend/resume runs into > an issue like somebody causing a surprise removal while a device was > suspended. -- Best regards Tianyu Lan
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/11/2015 1:16 AM, Alexander Duyck wrote: On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu wrote: On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. The VF mailbox interrupt is a very bad idea. Really the device should be in a reset state on the other side of a migration. It doesn't make sense to have the interrupt firing if the device is not configured. This is one of the things that is preventing you from being able to migrate the device while the interface is administratively down or the VF driver is not loaded. From my opinion, if VF driver is not loaded and hardware doesn't start to work, the device state doesn't need to be migrated. We may add a flag for driver to check whether migration happened during it's down and reinitialize the hardware and clear the flag when system try to put it up. We may add migration core in the Linux kernel and provide some helps functions to facilitate to add migration support for drivers. Migration core is in charge to sync status with Qemu. Example. migration_register() Driver provides - Callbacks to be called before and after migration or for bad path - Its irq which it prefers to deal with migration event. migration_event_check() Driver calls it in the irq handler. Migration core code will check migration status and call its callbacks when migration happens. My thought on all this is that it might make sense to move this functionality into a PCI-to-PCI bridge device and make it a requirement that all direct-assigned devices have to exist behind that device in order to support migration. That way you would be working with a directly emulated device that would likely already be supporting hot-plug anyway. Then it would just be a matter of coming up with a few Qemu specific extensions that you would need to add to the device itself. The same approach would likely be portable enough that you could achieve it with PCIe as well via the same configuration space being present on the upstream side of a PCIe port or maybe a PCIe switch of some sort. It would then be possible to signal via your vendor-specific PCI capability on that device that all devices behind this bridge require DMA page dirtying, you could use the configuration in addition to the interrupt already provided for hot-plug to signal things like when you are starting migration, and possibly even just extend the shpc functionality so that if this capability is present you have the option to pause/resume instead of remove/probe the device in the case of certain hot-plug events. The fact is there may be some use for a pause/resume type approach for PCIe hot-plug in the near future anyway. From the sounds of it Apple has required it for all Thunderbolt device drivers so that they can halt the device in order to shuffle resources around, perhaps we should look at something similar for Linux. The other advantage behind grouping functions on one bridge is things like reset domains. The PCI error handling logic will want to be able to reset any devices that experienced an error in the event of something such as a surprise removal. By grouping all of the devices you could disable/reset/enable them as one logical group in the event of something such as the "bad path" approach Michael has mentioned. These sounds we need to add a faked bridge for migration and adding a driver in the guest for it. It also needs to extend PCI bus/hotplug driver to do pause/resume other devices, right? My concern is still that whether we can change PCI bus/hotplug like that without spec change. IRQ should be general for any devices and we may extend it for migration. Device driver also can make decision to support migration or not. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. Yes, so far there is not hardware migration support and it's hard to modify bus level code. It also will block implementation on the Windows.
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote: On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. Right, this is the "good path" optimization. Whether this buys anything as compared to just sending reset to the device when VCPU is stopped needs to be measured. In any case, we probably do need a way to interrupt driver on destination to make it reconfigure the device - otherwise it might take seconds for it to notice. And a way to make sure driver can handle this surprise reset so we can block migration if it can't. Yes, we need such a way to notify driver about migration status and do reset or restore operation on the destination machine. My original design is to take advantage of device's irq to do that. Driver can tell Qemu that which irq it prefers to handle such task and whether the irq is enabled or bound with handler. We may discuss the detail in the other thread. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. Yes, so far there is not hardware migration support VT-D supports setting dirty bit in the PTE in hardware. Actually, this doesn't support in the current hardware. VTD spec documents the dirty bit for first level translation which requires devices to support DMA request with PASID(process address space identifier). Most device don't support the feature. and it's hard to modify bus level code. Why is it hard? As Yang said, the concern is that PCI Spec doesn't document about how to do migration. It also will block implementation on the Windows. Implementation of what? We are discussing motivation here, not implementation. E.g. windows drivers typically support surprise removal, should you use that, you get some working code for free. Just stop worrying about it. Make it work, worry about closed source software later. Dave
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the >hypervisor or qemu to aware the device which means we may need a driver in >hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. > >It would be great if we could avoid changing the guest; but at least your guest > >driver changes don't actually seem to be that hardware specific; could your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? > >It is impossible to use one common solution for all devices unless the PCIE >spec documents it clearly and i think one day it will be there. But before >that, we need some workarounds on guest driver to make it work even it looks >ugly. Yes, so far there is not hardware migration support and it's hard to modify bus level code. It also will block implementation on the Windows. Dave
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote: Let's assume you do save state and do have a way to detect whether state matches a given hardware. For example, driver could store firmware and hardware versions in the state, and then on destination, retrieve them and compare. It will be pretty common that you have a mismatch, and you must not just fail migration. You need a way to recover, maybe with more downtime. Second, you can change the driver but you can not be sure it will have the chance to run at all. Host overload is a common reason to migrate out of the host. You also can not trust guest to do the right thing. So how long do you want to wait until you decide guest is not cooperating and kill it? Most people will probably experiment a bit and then add a bit of a buffer. This is not robust at all. Again, maybe you ask driver to save state, and if it does not respond for a while, then you still migrate, and driver has to recover on destination. With the above in mind, you need to support two paths: 1. "good path": driver stores state on source, checks it on destination detects a match and restores state into the device 2. "bad path": driver does not store state, or detects a mismatch on destination. driver has to assume device was lost, and reset it So what I am saying is, implement bad path first. Then good path is an optimization - measure whether it's faster, and by how much. These sound reasonable. Driver should have ability to do such check to ensure hardware or firmware coherence after migration and reset device when migration happens at some unexpected position. Also, it would be nice if on the bad path there was a way to switch to another driver entirely, even if that means a bit more downtime. For example, have a way for driver to tell Linux it has to re-do probing for the device. Just glace the code of device core. device_reprobe() does what you said. /** * device_reprobe - remove driver for a device and probe for a new driver * @dev: the device to reprobe * * This function detaches the attached driver (if any) for the given * device and restarts the driver probing process. It is intended * to use if probing criteria changed during a devices lifetime and * driver attachment should change accordingly. */ int device_reprobe(struct device *dev)
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/10/2015 1:14 AM, Alexander Duyck wrote: On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu wrote: For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. I assume the problem is that with a PCI hotplug event you are losing the state information for the device, do I have that right? Looking over the QAT drivers it doesn't seem like any of them support the suspend/resume PM calls. I would imagine it makes it difficult for a system with a QAT card in it to be able to drop the system to a low power state. You might want to try enabling suspend/resume support for the devices on bare metal before you attempt to take on migration as it would provide you with a good testing framework to see what needs to be saved/restored within the device and in what order before you attempt to do the same while migrating from one system to another. Sure. The suspend/resume job is under way. Actually, we have enabled QAT work for migration internally. Doing more test and fixing bugs. - Alex
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote: On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote: On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. This is just more confusion. hotplug is just a way to add and remove devices. switching VF and PF is up to guest and hypervisor. This is a combination. Because it's not able to migrate device state in the current world during migration(What we are doing), Exist solutions of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF before starting migration and then switch network from VF NIC to PV NIC in order to maintain the network connection. Plug VF again after migration and then switch from PV back to VF. Bond driver provides a way to switch between PV and VF NIC automatically with save IP and MAC and so bond driver is more preferred. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF OK and what's the source of the downtime? I'm guessing that's just arp being repopulated. So simply save and re-populate it. There would be a much cleaner solution. Or maybe there's a timer there that just delays hotplug for no reason. Fix it, everyone will benefit. It also requires guest to do switch configuration. That's just wrong. if you want a switch, you need to configure a switch. I meant the config of switching operation between PV and VF. These are hard to manage and deploy from our customers. So kernel want to remain flexible, and the stack is configurable. Downside: customers need to deploy userspace to configure it. Your solution: a hard-coded configuration within kernel and hypervisor. Sorry, this makes no sense. If kernel is easier for you to deploy than userspace, you need to rethink your deployment strategy. This is one factor. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. No idea what this means. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. Everyone needs performance and scalability. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. Let's assume for the sake of the argument that there's a lot going on and removing the device is just too slow (though you should figure out what's going on before giving up and just building something new from scratch). No, we can find a PV NIC as backup for VF NIC during migration but it doesn't work for other kinds of device since there is no backup for them. E,G When migration happens during users compresses files via QAT, it's impossible to remove QAT at that point. If do that, the compress operation will fail and affect user experience. I still don't think you should be migrating state. That's just too fragile, and it also means you depend on driver to be nice and shut down device on source, so you can not migrate at will. Instead, reset device on destination and re-initialize it. Yes, saving and restoring device state relies on the driver and so we reworks driver and make it more friend to migration.
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF It also requires guest to do switch configuration. These are hard to manage and deploy from our customers. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote: I remember reading that it's possible to implement a bus driver on windows if required. But basically I don't see how windows can be relevant to discussing guest driver patches. That discussion probably belongs on the qemu maling list, not on lkml. I am not sure whether we can write a bus driver for Windows to support migration. But I think device vendor who want to support migration will improve their driver if we provide such framework in the hypervisor which just need them to change their driver.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote: On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. It's nice that it's appreciated, but you then go on and ignore all that I have written here: https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html No, I will reply it separately and according your suggestion to snip it into 3 thread. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. This is not a reasonable argument. It makes no sense to duplicate code on Linux because you must duplicate code on Windows. Let's assume you must do it in the driver on windows because windows has closed source drivers. What does it matter? Linux can still do it as part of DMA API and have it apply to all drivers. Sure. Duplicated code should be encapsulated and make it able to reuse by other drivers. Just like you said the dummy write part. I meant the framework should not require to change Windows kernel code (such as PM core or PCI bus driver)and this will block implementation on the Windows. I think it's not problem to duplicate code in the Windows drivers. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. I suggested ways to do it all in the hypervisor without driver hacks, or hide it within DMA API without need to reserve extra space. Both approaches seem much cleaner. This sounds reasonable. We may discuss it detail in the separate thread.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/8/2015 1:12 AM, Alexander Duyck wrote: On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. Having to modify every driver that wants to support live migration is a bit much. In addition I don't see this being limited only to NIC devices. You can direct assign a number of different devices, your solution cannot be specific to NICs. We are also adding such migration support for QAT device and so our solution will not just be limit to NIC. Now just is the beginning. We can't limit user to only use Linux guest. So the migration feature should work for both Windows and Linux guest. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. The suspend/resume callback should be efficient in terms of time. After all we don't want the system to stall for a long period of time when it should be either running or asleep. Having it burn cycles in a power state limbo doesn't do anyone any good. If nothing else maybe it will help to push the vendors to speed up those functions which then benefit migration and the system sleep states. If we can benefit both migration and suspend, that would be wonderful. But migration and system pm is still different. Just for example, driver doesn't need to put device into deep D-status during migration and host can do this after migration while it's essential for system sleep. PCI configure space and interrupt config is emulated by Qemu and Qemu can migrate these configures to new machine. Driver doesn't need to deal with such thing. So I think migration still needs a different callback or different code path than device suspend/resume. Another concern is that we have to rework PM core ore PCI bus driver to call suspend/resume for passthrough devices during migration. This also blocks new feature works on the Windows. Also you keep assuming you can keep the device running while you do the migration and you can't. You are going to corrupt the memory if you do, and you have yet to provide any means to explain how you are going to solve that. The main problem is tracking DMA issue. I will repose my solution in the new thread for discussion. If not way to mark DMA page dirty when DMA is enabled, we have to stop DMA for a small time to do that at the last stage. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. You can't and it will still race. What concerns me is that your patches and the document you referenced earlier show a considerable lack of understanding about how DMA and device drivers work. There is a reason why device drivers have so many memory barriers and the like in them. The fact is when you have CPU and a device both accessing memory things have to be done in a very specific order and you cannot violate that. If you have a contiguous block of memory you expect the dev
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. The question is how we would go about triggering it. I really don't think the PCI configuration space approach is the right idea. I wonder if we couldn't get away with some sort of ACPI event instead. We already require ACPI support in order to shut down the system gracefully, I wonder if we couldn't get away with something similar in order to suspend/resume the direct assigned devices gracefully. I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. The dma page allocated by VF driver also needs to reserve space to do dummy write. No, this will not work. If for example you have a VF driver allocating memory for a 9K receive how will that work? It isn't as if you can poke a hole in the contiguous memory.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: There are several components to this: - dma_map_* needs to prevent page from being migrated while device is running. For example, expose some kind of bitmap from guest to host, set bit there while page is mapped. What happens if we stop the guest and some bits are still set? See dma_alloc_coherent below for some ideas. Yeah, I could see something like this working. Maybe we could do something like what was done for the NX bit and make use of the upper order bits beyond the limits of the memory range to mark pages as non-migratable? I'm curious. What we have with a DMA mapped region is essentially shared memory between the guest and the device. How would we resolve something like this with IVSHMEM, or are we blocked there as well in terms of migration? I have some ideas. Will post later. I look forward to it. - dma_unmap_* needs to mark page as dirty This can be done by writing into a page. - dma_sync_* needs to mark page as dirty This is trickier as we can not change the data. One solution is using atomics. For example: int x = ACCESS_ONCE(*p); cmpxchg(p, x, x); Seems to do a write without changing page contents. Like I said we can probably kill 2 birds with one stone by just implementing our own dma_mark_clean() for x86 virtualized environments. I'd say we could take your solution one step further and just use 0 instead of bothering to read the value. After all it won't write the area if the value at the offset is not 0. Really almost any atomic that has no side effect will do. atomic or with 0 atomic and with It's just that cmpxchg already happens to have a portable wrapper. I was originally thinking maybe an atomic_add with 0 would be the way to go. cmpxchg with any value too. Either way though we still are using a locked prefix and having to dirty a cache line per page which is going to come at some cost. I agree. It's likely not necessary for everyone to be doing this: only people that both run within the VM and want migration to work need to do this logging. So set some module option to have driver tell hypervisor that it supports logging. If bus mastering is enabled before this, migration is blocked. Or even pass some flag from hypervisor so driver can detect it needs to log writes. I guess this could be put in device config somewhere, though in practice it's a global thing, not a per device one, so maybe we need some new channel to pass this flag to guest. CPUID? Or maybe we can put some kind of agent in the initrd and use the existing guest agent channel after all. agent in initrd could open up a lot of new possibilities. - dma_alloc_coherent memory (e.g. device rings) must be migrated after device stopped modifying it. Just stopping the VCPU is not enough: you must make sure device is not changing it. Or maybe the device has some kind of ring flush operation, if there was a reasonably portable way to do this (e.g. a flush capability could maybe be added to SRIOV) then hypervisor could do this. This is where things start to get messy. I was suggesting the suspend/resume to resolve this bit, but it might be possible to also deal with this via something like this via clearing the bus master enable bit for the VF. If I am not mistaken that should disable MSI-X interrupts and halt any DMA. That should work as long as you have some mechanism that is tracking the pages in use for DMA. A bigger issue is recovering afterwards. Agreed. In case you need to resume on source, you really need to follow the same path as on destination, preferably detecting device reset and restoring the device state. The problem with detecting the reset is that you would likely have to be polling to do something like that. We could some event to guest to notify it about this event through a new or existing channel. Or we could make i
Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/4/2015 4:05 PM, Michael S. Tsirkin wrote: I haven't read it, but I would like to note you can't rely on research papers. If you propose a patch to be merged you need to measure what is its actual effect on modern linux at the end of 2015. Sure. Will do that.
Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote: >We hope >to find a better way to make SRIOV NIC work in these cases and this is >worth to do since SRIOV NIC provides better network performance compared >with PV NIC. If this is a performance optimization as the above implies, you need to include some numbers, and document how did you implement the switch and how did you measure the performance. OK. Some ideas of my patches come from paper "CompSC: Live Migration with Pass-through Devices". http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf It compared performance data between the solution of switching PV and VF and VF migration.(Chapter 7: Discussion) >Current patches have some issues. I think we can find >solution for them andimprove them step by step.
Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote: >We hope >to find a better way to make SRIOV NIC work in these cases and this is >worth to do since SRIOV NIC provides better network performance compared >with PV NIC. If this is a performance optimization as the above implies, you need to include some numbers, and document how did you implement the switch and how did you measure the performance. OK. Some ideas of my patches come from paper "CompSC: Live Migration with Pass-through Devices". http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf It compared performance data between the solution of switching PV and VF and VF migration.(Chapter 7: Discussion) >Current patches have some issues. I think we can find >solution for them andimprove them step by step.
Re: [Qemu-devel] [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
On 12/3/2015 6:25 AM, Alex Williamson wrote: This will of course break if the PCI SIG defines that capability index. Couldn't this be done within a vendor defined capability? Thanks, Yes, it should work and thanks for suggestion.
Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
On 12/3/2015 6:25 AM, Alex Williamson wrote: On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote: This patch is to add SRIOV VF migration support. Create new device type "vfio-sriov" and add faked PCI migration capability to the type device. The purpose of the new capability 1) sync migration status with VF driver in the VM 2) Get mailbox irq vector to notify VF driver during migration. 3) Provide a way to control injecting irq or not. Qemu will migrate PCI configure space regs and MSIX config for VF. Inject mailbox irq at last stage of migration to notify VF about migration event and wait VF driver ready for migration. VF driver writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table to tell Qemu. What makes this sr-iov specific? Why wouldn't we simply extend vfio-pci with a migration=on feature? Thanks, Sounds reasonable and will update it. Alex
Re: [Qemu-devel] [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
On 12/3/2015 6:25 AM, Alex Williamson wrote: I didn't seen a matching kernel patch series for this, but why is the kernel more capable of doing this than userspace is already? The following link is the kernel patch. http://marc.info/?l=kvm&m=144837328920989&w=2 These seem like pointless ioctls, we're creating a purely virtual PCI capability, the kernel doesn't really need to participate in that. VFIO kernel driver has pci_config_map which indicates the PCI capability position and length which helps to find free PCI config regs. Qemu side doesn't have such info and can't get the exact table size of PCI capability. If we want to add such support in the Qemu, needs duplicates a lot of code of vfio_pci_configs.c in the Qemu. Also, why are we restricting ourselves to standard capabilities? This version is to check whether it's on the right way and We can extend this to pci extended capability later. That's often a crowded space and we can't always know whether an area is free or not based only on it being covered by a capability. Some capabilities can also appear more than once, so there's context that isn't being passed to the kernel here. Thanks, The region outside of PCI capability are not passed to kernel or used by Qemu for MSI/MSIX . It's possible to use these places for new capability. One concerns is that guest driver may abuse them and quirk of masking some special regs outside of capability maybe helpful.
Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote: But it requires guest OS to do specific configurations inside and rely on bonding driver which blocks it work on Windows. From performance side, putting VF and virtio NIC under bonded interface will affect their performance even when not do migration. These factors block to use VF NIC passthough in some user cases(Especially in the cloud) which require migration. That's really up to guest. You don't need to do bonding, you can just move the IP and mac from userspace, that's possible on most OS-es. Or write something in guest kernel that is more lightweight if you are so inclined. What we are discussing here is the host-guest interface, not the in-guest interface. Current solution we proposed changes NIC driver and Qemu. Guest Os doesn't need to do special thing for migration. It's easy to deploy Except of course these patches don't even work properly yet. And when they do, even minor changes in host side NIC hardware across migration will break guests in hard to predict ways. Switching between PV and VF NIC will introduce network stop and the latency of hotplug VF is measurable. For some user cases(cloud service and OPNFV) which are sensitive to network stabilization and performance, these are not friend and blocks SRIOV NIC usage in these case. We hope to find a better way to make SRIOV NIC work in these cases and this is worth to do since SRIOV NIC provides better network performance compared with PV NIC. Current patches have some issues. I think we can find solution for them andimprove them step by step.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/1/2015 12:07 AM, Alexander Duyck wrote: They can only be corrected if the underlying assumptions are correct and they aren't. Your solution would have never worked correctly. The problem is you assume you can keep the device running when you are migrating and you simply cannot. At some point you will always have to stop the device in order to complete the migration, and you cannot stop it before you have stopped your page tracking mechanism. So unless the platform has an IOMMU that is somehow taking part in the dirty page tracking you will not be able to stop the guest and then the device, it will have to be the device and then the guest. >Doing suspend and resume() may help to do migration easily but some >devices requires low service down time. Especially network and I got >that some cloud company promised less than 500ms network service downtime. Honestly focusing on the downtime is getting the cart ahead of the horse. First you need to be able to do this without corrupting system memory and regardless of the state of the device. You haven't even gotten to that state yet. Last I knew the device had to be up in order for your migration to even work. I think the issue is that the content of rx package delivered to stack maybe changed during migration because the piece of memory won't be migrated to new machine. This may confuse applications or stack. Current dummy write solution can ensure the content of package won't change after doing dummy write while the content maybe not received data if migration happens before that point. We can recheck the content via checksum or crc in the protocol after dummy write to ensure the content is what VF received. I think stack has already done such checks and the package will be abandoned if failed to pass through the check. Another way is to tell all memory driver are using to Qemu and let Qemu to migrate these memory after stopping VCPU and the device. This seems safe but implementation maybe complex.
Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote: It is still not very clear what it is you are trying to achieve, and whether your patchset achieves it. You merely say "adding live migration" but it seems pretty clear this isn't about being able to migrate a guest transparently, since you are adding a host/guest handshake. This isn't about functionality either: I think that on KVM, it isn't hard to live migrate if you can do a host/guest handshake, even today, with no kernel changes: 1. before migration, expose a pv nic to guest (can be done directly on boot) 2. use e.g. a serial connection to move IP from an assigned device to pv nic 3. maybe move the mac as well 4. eject the assigned device 5. detect eject on host (QEMU generates a DEVICE_DELETED event when this happens) and start migration This looks like the bonding driver solution which put pv nic and VF in one bonded interface under active-backup mode. The bonding driver will switch from VF to PV nic automatically when VF is unplugged during migration. This is the only available solution for VF NIC migration. But it requires guest OS to do specific configurations inside and rely on bonding driver which blocks it work on Windows. From performance side, putting VF and virtio NIC under bonded interface will affect their performance even when not do migration. These factors block to use VF NIC passthough in some user cases(Especially in the cloud) which require migration. Current solution we proposed changes NIC driver and Qemu. Guest Os doesn't need to do special thing for migration. It's easy to deploy and all changes are in the NIC driver, NIC vendor can implement migration support just in the their driver.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 11/26/2015 11:56 AM, Alexander Duyck wrote: > I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. Current drivers are not migration friendly. If the driver wants to support migration, it's necessary to be changed. RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and DMA tracking during migration. These are common for most drivers and they maybe problematic in the previous version but can be corrected later. Doing suspend and resume() may help to do migration easily but some devices requires low service down time. Especially network and I got that some cloud company promised less than 500ms network service downtime. So I think performance effect also should be taken into account when we design the framework. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug controller and these event are defined in the controller spec. It's hard to extend more events. Otherwise, we also need to add some specific codes in the PCI hotplug core since it's only add and remove PCI device when it gets events. It's also a challenge to modify Windows hotplug codes. So we may need to find another way. Thanks. - Alex
Re: [Qemu-devel] [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote: Frankly, I don't really see what this short term hack buys us, and if it goes in, we'll have to maintain it forever. The framework of how to notify VF about migration status won't be changed regardless of stopping VF or not before doing migration. We hope to reach agreement on this first. Tracking dirty memory still need to more discussions and we will continue working on it. Stop VF may help to work around the issue and make tracking easier. Also, assuming you just want to do ifdown/ifup for some reason, it's easy enough to do using a guest agent, in a completely generic way. Just ifdown/ifup is not enough for migration. It needs to restore some PCI settings before doing ifup on the target machine
Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote: >+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr, >+ uint32_t val, int len) >+{ >+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); >+ >+if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS >+&& val == PCI_VF_READY_FOR_MIGRATION) { >+qemu_event_set(&migration_event); This would wake migration so it can proceed - except it needs QEMU lock to run, and that's taken by the migration thread. Sorry, I seem to miss something. Which lock may cause dead lock when calling vfio_migration_cap_handle() and run migration? The function is called when VF accesses faked PCI capability. It seems unlikely that this ever worked - how did you test this?
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 2015年11月25日 13:30, Alexander Duyck wrote: > No, what I am getting at is that you can't go around and modify the > configuration space for every possible device out there. This > solution won't scale. PCI config space regs are emulation by Qemu and so We can find the free PCI config space regs for the faked PCI capability. Its position can be not permanent. > If you instead moved the logic for notifying > the device into a separate mechanism such as making it a part of the > hot-plug logic then you only have to write the code once per OS in > order to get the hot-plug capability to pause/resume the device. What > I am talking about is not full hot-plug, but rather to extend the > existing hot-plug in Qemu and the Linux kernel to support a > "pause/resume" functionality. The PCI hot-plug specification calls > out the option of implementing something like this, but we don't > currently have support for it. > Could you elaborate the part of PCI hot-plug specification you mentioned? My concern is whether it needs to change PCI spec or not. > I just feel doing it through PCI hot-plug messages will scale much > better as you could likely make use of the power management > suspend/resume calls to take care of most of the needed implementation > details. > > - Alex -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
On 2015年11月25日 05:20, Michael S. Tsirkin wrote: > I have to say, I was much more interested in the idea > of tracking dirty memory. I have some thoughts about > that one - did you give up on it then? No, our finial target is to keep VF active before doing migration and tracking dirty memory is essential. But this seems not easy to do that in short term for upstream. As starters, stop VF before migration. After deep thinking, the way of stopping VF still needs tracking DMA-accessed dirty memory to make sure the received data buffer before stopping VF migrated. It's easier to do that via dummy writing data buffer when receive packet. -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 2015年11月24日 22:20, Alexander Duyck wrote: > I'm still not a fan of this approach. I really feel like this is > something that should be resolved by extending the existing PCI hot-plug > rather than trying to instrument this per driver. Then you will get the > goodness for multiple drivers and multiple OSes instead of just one. An > added advantage to dealing with this in the PCI hot-plug environment > would be that you could then still do a hot-plug even if the guest > didn't load a driver for the VF since you would be working with the PCI > slot instead of the device itself. > > - Alex Hi Alex: What's you mentioned seems the bonding driver solution. Paper "Live Migration with Pass-through Device for Linux VM" describes it. It does VF hotplug during migration. In order to maintain Network connection when VF is out, it takes advantage of Linux bonding driver to switch between VF NIC and emulated NIC. But the side affects, that requires VM to do additional configure and the performance during switching two NIC is not good. -- Best regards Tianyu Lan
[Qemu-devel] [RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO
Signed-off-by: Lan Tianyu --- hw/vfio/pci.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e7583b5..404a5cd 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3625,11 +3625,6 @@ static Property vfio_pci_dev_properties[] = { DEFINE_PROP_END_OF_LIST(), }; -static const VMStateDescription vfio_pci_vmstate = { -.name = "vfio-pci", -.unmigratable = 1, -}; - static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); @@ -3637,7 +3632,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) dc->reset = vfio_pci_reset; dc->props = vfio_pci_dev_properties; -dc->vmsd = &vfio_pci_vmstate; dc->desc = "VFIO-based PCI device assignment"; set_bit(DEVICE_CATEGORY_MISC, dc->categories); pdc->init = vfio_initfn; -- 1.9.3
[Qemu-devel] [RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration
This patch is to add a callback which is called just before stopping VCPU. It's for VF migration to trigger mailbox irq as later as possible to decrease service downtime. Signed-off-by: Lan Tianyu --- include/migration/vmstate.h | 3 +++ include/sysemu/sysemu.h | 1 + migration/migration.c | 3 ++- migration/savevm.c | 13 + 4 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index dc681a6..093faf1 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -58,6 +58,9 @@ typedef struct SaveVMHandlers { /* This runs after restoring CPU related state */ void (*post_load_state)(void *opaque); + +/* This runs before stopping VCPU */ +void (*save_before_stop)(QEMUFile *f, void *opaque); LoadStateHandler *load_state; } SaveVMHandlers; diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index df80951..3d0d72c 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -84,6 +84,7 @@ void qemu_announce_self(void); bool qemu_savevm_state_blocked(Error **errp); void qemu_savevm_state_begin(QEMUFile *f, const MigrationParams *params); +void qemu_savevm_save_before_stop(QEMUFile *f); void qemu_savevm_state_header(QEMUFile *f); int qemu_savevm_state_iterate(QEMUFile *f); void qemu_savevm_state_complete(QEMUFile *f); diff --git a/migration/migration.c b/migration/migration.c index c6ac08a..fccadea 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -759,7 +759,6 @@ int64_t migrate_xbzrle_cache_size(void) } /* migration thread support */ - static void *migration_thread(void *opaque) { MigrationState *s = opaque; @@ -788,6 +787,8 @@ static void *migration_thread(void *opaque) } else { int ret; +qemu_savevm_save_before_stop(s->file); + qemu_mutex_lock_iothread(); start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER); diff --git a/migration/savevm.c b/migration/savevm.c index 48b6223..c2e4802 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -715,6 +715,19 @@ void qemu_savevm_post_load(void) } } +void qemu_savevm_save_before_stop(QEMUFile *f) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { +if (!se->ops || !se->ops->save_before_stop) { +continue; +} + +se->ops->save_before_stop(f, se->opaque); +} +} + void qemu_savevm_state_header(QEMUFile *f) { -- 1.9.3
[Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
This patch is to add SRIOV VF migration support. Create new device type "vfio-sriov" and add faked PCI migration capability to the type device. The purpose of the new capability 1) sync migration status with VF driver in the VM 2) Get mailbox irq vector to notify VF driver during migration. 3) Provide a way to control injecting irq or not. Qemu will migrate PCI configure space regs and MSIX config for VF. Inject mailbox irq at last stage of migration to notify VF about migration event and wait VF driver ready for migration. VF driver writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table to tell Qemu. Signed-off-by: Lan Tianyu --- hw/vfio/Makefile.objs | 2 +- hw/vfio/pci.c | 6 ++ hw/vfio/pci.h | 4 ++ hw/vfio/sriov.c | 178 ++ 4 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 hw/vfio/sriov.c diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index d540c9d..9cf0178 100644 --- a/hw/vfio/Makefile.objs +++ b/hw/vfio/Makefile.objs @@ -1,6 +1,6 @@ ifeq ($(CONFIG_LINUX), y) obj-$(CONFIG_SOFTMMU) += common.o -obj-$(CONFIG_PCI) += pci.o +obj-$(CONFIG_PCI) += pci.o sriov.o obj-$(CONFIG_SOFTMMU) += platform.o obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o endif diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 7c43fc1..e7583b5 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } else if (was_enabled && !is_enabled) { vfio_disable_msix(vdev); } +} else if (vdev->migration_cap && +ranges_overlap(addr, len, vdev->migration_cap, 0x10)) { +/* Write everything to QEMU to keep emulated bits correct */ +pci_default_write_config(pdev, addr, val, len); +vfio_migration_cap_handle(pdev, addr, val, len); } else { /* Write everything to QEMU to keep emulated bits correct */ pci_default_write_config(pdev, addr, val, len); @@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev) vfio_register_err_notifier(vdev); vfio_register_req_notifier(vdev); vfio_setup_resetfn(vdev); +vfio_add_migration_capability(vdev); return 0; diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index 6c00575..ee6ca5e 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice { PCIHostDeviceAddress host; EventNotifier err_notifier; EventNotifier req_notifier; +uint16_tmigration_cap; int (*resetfn)(struct VFIOPCIDevice *); uint32_t features; #define VFIO_FEATURE_ENABLE_VGA_BIT 0 @@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, uint32_t val, int len); void vfio_enable_msix(VFIOPCIDevice *vdev); +void vfio_add_migration_capability(VFIOPCIDevice *vdev); +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr, + uint32_t val, int len); diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c new file mode 100644 index 000..3109538 --- /dev/null +++ b/hw/vfio/sriov.c @@ -0,0 +1,178 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hw/hw.h" +#include "hw/vfio/pci.h" +#include "hw/vfio/vfio.h" +#include "hw/vfio/vfio-common.h" + +#define TYPE_VFIO_SRIOV "vfio-sriov" + +#define SRIOV_LM_SETUP 0x01 +#define SRIOV_LM_COMPLETE 0x02 + +QemuEvent migration_event; + +static void vfio_dev_post_load(void *opaque) +{ +struct PCIDevice *pdev = (struct PCIDevice *)opaque; +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); +MSIMessage msg; +int vector; + +if (vfio_pci_read_config(pdev, +vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1) +!= PCI_VF_MIGRATION_ENABLE) +return; + +vector = vfio_pci_read_config(pdev, +vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1); + +msg = msix_get_message(pdev, vector); +kvm_irqchip_send_msi(kvm_state, msg); +} + +static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id) +{ +struct PCIDevice *pdev = (struct PCIDevice *)opaque; +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); +int ret; + +if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE) +return 0; + +ret = pci_device_load(pdev, f); +if (ret) { +error_report("Faild to load PCI config space.\n"); +return ret; +} + +if (msix_enabled(pdev)) { +vfio_enable_msix(vdev); +msix_load(pdev, f); +} + +vfio_pci_write_config(pdev,vdev->migration_cap + +PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1); +vfio_pci_write_config(pdev,vdev->migration_cap + +PCI_VF_MIGRATION_VF_STATUS, PCI_VF_WAIT_FOR_MIGRATION, 1); +
[Qemu-devel] [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver
This patch is to add migration support for ixgbevf driver. Using faked PCI migration capability table communicates with Qemu to share migration status and mailbox irq vector index. Qemu will notify VF via sending MSIX msg to trigger mailbox vector during migration and store migration status in the PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table. The mailbox irq will be triggered just befoe stop-and-copy stage and after migration on the target machine. VF driver will put down net when detect migration and tell Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS reg. After migration, put up net again. Qemu will in charge of migrating PCI config space regs and MSIX config. The patch is to dedicate on the normal case that net traffic works when mailbox irq is enabled. For other cases(such as the driver isn't loaded, adapter is suspended or closed), mailbox irq won't be triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP reg. These case will be resolved later. Signed-off-by: Lan Tianyu --- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++ 2 files changed, 107 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 775d089..4b8ba2f 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -438,6 +438,11 @@ struct ixgbevf_adapter { u64 bp_tx_missed; #endif + u8 migration_cap; + u8 last_migration_reg; + unsigned long migration_status; + struct work_struct migration_task; + u8 __iomem *io_addr; /* Mainly for iounmap use */ u32 link_speed; bool link_up; diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..95860c2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -96,6 +96,8 @@ static int debug = -1; module_param(debug, int, 0); MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); +#define MIGRATION_IN_PROGRESS 0 + static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter) { if (!test_bit(__IXGBEVF_DOWN, &adapter->state) && @@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector *q_vector) } } +static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) +{ + struct pci_dev *pdev = adapter->pdev; + u8 val; + + pci_read_config_byte(pdev, +adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS, +&val); + + if (val != adapter->last_migration_reg) { + schedule_work(&adapter->migration_task); + adapter->last_migration_reg = val; + } + +} + static irqreturn_t ixgbevf_msix_other(int irq, void *data) { struct ixgbevf_adapter *adapter = data; @@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void *data) hw->mac.get_link_status = 1; + ixgbevf_migration_check(adapter); ixgbevf_service_event_schedule(adapter); IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other); @@ -1383,6 +1402,7 @@ out: static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter) { struct net_device *netdev = adapter->netdev; + struct pci_dev *pdev = adapter->pdev; int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS; int vector, err; int ri = 0, ti = 0; @@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter) goto free_queue_irqs; } + if (adapter->migration_cap) { + pci_write_config_byte(pdev, + adapter->migration_cap + PCI_VF_MIGRATION_IRQ, + vector); + } + return 0; free_queue_irqs: @@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct ixgbevf_adapter *adapter) ixgbevf_update_stats(adapter); } +static void ixgbevf_migration_task(struct work_struct *work) +{ + struct ixgbevf_adapter *adapter = container_of(work, + struct ixgbevf_adapter, + migration_task); + struct pci_dev *pdev = adapter->pdev; + struct net_device *netdev = adapter->netdev; + u8 val; + + if (!test_bit(MIGRATION_IN_PROGRESS, &adapter->migration_status)) { + pci_read_config_byte(pdev, +adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS, +&val); + if (val != VMM_MIGRATION_START) + return; + + pr_info("migration start\n"); + set_bit(MIGRATION_IN_PROGRESS, &a
[Qemu-devel] [RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state
After migration, Qemu needs to trigger mailbox irq to notify VF driver in the guest about status change. The irq delivery restarts to work after restoring CPU state. This patch is to add new callback to run after restoring CPU state and provide a way to trigger mailbox irq later. Signed-off-by: Lan Tianyu --- include/migration/vmstate.h | 2 ++ migration/savevm.c | 15 +++ 2 files changed, 17 insertions(+) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 0695d7c..dc681a6 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -56,6 +56,8 @@ typedef struct SaveVMHandlers { int (*save_live_setup)(QEMUFile *f, void *opaque); uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size); +/* This runs after restoring CPU related state */ +void (*post_load_state)(void *opaque); LoadStateHandler *load_state; } SaveVMHandlers; diff --git a/migration/savevm.c b/migration/savevm.c index 9e0e286..48b6223 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp) return false; } +void qemu_savevm_post_load(void) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { +if (!se->ops || !se->ops->post_load_state) { +continue; +} + +se->ops->post_load_state(se->opaque); +} +} + + void qemu_savevm_state_header(QEMUFile *f) { trace_savevm_state_header(); @@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f) } cpu_synchronize_all_post_init(); +qemu_savevm_post_load(); ret = 0; -- 1.9.3
[Qemu-devel] [RFC PATCH V2 1/3] VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
This patch is to add new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size and get free PCI config space regs according pos and size. Qemu will add faked PCI capability for migration and need such info. Signed-off-by: Lan Tianyu --- drivers/vfio/pci/vfio_pci.c | 21 drivers/vfio/pci/vfio_pci_config.c | 38 +++-- drivers/vfio/pci/vfio_pci_private.h | 5 + include/uapi/linux/vfio.h | 12 4 files changed, 70 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 69fab0f..2e42de0 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -784,6 +784,27 @@ hot_reset_release: kfree(groups); return ret; + } else if (cmd == VFIO_GET_PCI_CAP_INFO) { + struct vfio_pci_cap_info info; + int offset; + + if (copy_from_user(&info, (void __user *)arg, sizeof(info))) + return -EFAULT; + + switch (info.index) { + case VFIO_PCI_CAP_GET_SIZE: + info.size = vfio_get_cap_size(vdev, info.cap, info.offset); + break; + case VFIO_PCI_CAP_GET_FREE_REGION: + offset = vfio_find_free_pci_config_reg(vdev, + info.offset, info.size); + info.offset = offset; + break; + default: + return -EINVAL; + } + + return copy_to_user((void __user *)arg, &info, sizeof(info)); } return -ENOTTY; diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index ff75ca3..8afbda4 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -841,6 +841,21 @@ static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos) return pos; } +int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev, + int pos, int size) +{ + int i, offset = pos; + + for (i = pos; i < PCI_CFG_SPACE_SIZE; i++) { + if (vdev->pci_config_map[i] != PCI_CAP_ID_INVALID) + offset = i + 1; + else if (i - offset + 1 == size) + return offset; + } + + return 0; +} + static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val) @@ -1199,6 +1214,20 @@ static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev, return ret; } +int vfio_get_cap_size(struct vfio_pci_device *vdev, u8 cap, int pos) +{ + int len; + + len = pci_cap_length[cap]; + if (len == 0xFF) { /* Variable length */ + len = vfio_cap_len(vdev, cap, pos); + if (len < 0) + return len; + } + + return len; +} + static int vfio_cap_init(struct vfio_pci_device *vdev) { struct pci_dev *pdev = vdev->pdev; @@ -1238,12 +1267,9 @@ static int vfio_cap_init(struct vfio_pci_device *vdev) return ret; if (cap <= PCI_CAP_ID_MAX) { - len = pci_cap_length[cap]; - if (len == 0xFF) { /* Variable length */ - len = vfio_cap_len(vdev, cap, pos); - if (len < 0) - return len; - } + len = vfio_get_cap_size(vdev, cap, pos); + if (len < 0) + return len; } if (!len) { diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index ae0e1b4..91b4f9b 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -89,4 +89,9 @@ extern void vfio_pci_uninit_perm_bits(void); extern int vfio_config_init(struct vfio_pci_device *vdev); extern void vfio_config_free(struct vfio_pci_device *vdev); +extern int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev, + int pos, int size); +extern int vfio_get_cap_size(struct vfio_pci_device *vdev, + u8 cap, int pos); + #endif /* VFIO_PCI_PRIVATE_H */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index b57b750..dfa7023 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -495,6 +495,18 @@ struct vfio_eeh_pe_op { #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21) +#define VFIO_GET_PCI_CAP_INFO _IO(VFIO_TYPE, VFIO_BASE + 22) +struct vfio_pci_cap_info { + __u32 argsz; + __u32 flags; +#define VFIO_PCI_CAP_GET_SIZE (1 << 0) +#define VFI