from:"Lan, Tianyu"

Re: [Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen

2017-10-26 Thread Lan Tianyu

On 2017年10月26日 22:27, Michael S. Tsirkin wrote:
> On Thu, Oct 26, 2017 at 02:19:43PM +0200, Eduardo Habkost wrote:
>> On Mon, Aug 21, 2017 at 10:22:15AM +0800, Lan Tianyu wrote:
>>> On 2017年08月19日 00:38, Eduardo Habkost wrote:
>>>> On Thu, Aug 17, 2017 at 09:37:10AM +0800, Lan Tianyu wrote:
>>>>> On 2017年08月16日 19:21, Paolo Bonzini wrote:
>>>>>> On 16/08/2017 02:22, Lan Tianyu wrote:
>>>>>>> Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU
>>>>>>> check for Xen here when vcpu number is more than 255.
>>>>>>
>>>>>> I think you still need to do a check for vIOMMU being enabled.
>>>>>
>>>>> Yes, this will be done in the Xen tool stack and Qemu doesn't have such
>>>>> knowledge. Operations of create, destroy Xen vIOMMU will be done in the
>>>>> Xen tool stack.
>>>>
>>>> Shouldn't we make QEMU have knowledge of the vIOMMU device, then?
>>>> Won't QEMU need to know about it eventually?
>>>>
>>>
>>> Hi Eduardo:
>>>  Thanks for your review.
>>>  Xen has some guest modes which doesn't use Qemu and we tried to
>>> make Xen vIOMMU framework compatible with all guest modes. So far, we
>>> are adding interrupt remapping function for Xen vIOMMU and find qemu
>>> doesn't need to know Xen vIOMMU. The check of vcpu number > 255 here
>>> will be done in Xen side and so skip the check in Qemu to avoid blocking
>>> Xen creating >255 vcpus.
>>>  We may make Qemu have knowledge of the vIOMMU device if it's
>>> necessary when adding new function.
>>
>> I was expecting it to go through the PC tree, but I will queue it
>> on x86-next instead.
> 
> I was waiting for an ack from you or Paolo as you participated in the
> discussion. But sure, go ahead
> 
> Acked-by: Michael S. Tsirkin <m...@redhat.com>
> 

Great. Thanks.
-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen

2017-08-20 Thread Lan Tianyu

On 2017年08月19日 00:38, Eduardo Habkost wrote:
> On Thu, Aug 17, 2017 at 09:37:10AM +0800, Lan Tianyu wrote:
>> On 2017年08月16日 19:21, Paolo Bonzini wrote:
>>> On 16/08/2017 02:22, Lan Tianyu wrote:
>>>> Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU
>>>> check for Xen here when vcpu number is more than 255.
>>>
>>> I think you still need to do a check for vIOMMU being enabled.
>>
>> Yes, this will be done in the Xen tool stack and Qemu doesn't have such
>> knowledge. Operations of create, destroy Xen vIOMMU will be done in the
>> Xen tool stack.
> 
> Shouldn't we make QEMU have knowledge of the vIOMMU device, then?
> Won't QEMU need to know about it eventually?
> 

Hi Eduardo:
 Thanks for your review.
 Xen has some guest modes which doesn't use Qemu and we tried to
make Xen vIOMMU framework compatible with all guest modes. So far, we
are adding interrupt remapping function for Xen vIOMMU and find qemu
doesn't need to know Xen vIOMMU. The check of vcpu number > 255 here
will be done in Xen side and so skip the check in Qemu to avoid blocking
Xen creating >255 vcpus.
 We may make Qemu have knowledge of the vIOMMU device if it's
necessary when adding new function.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen

2017-08-16 Thread Lan Tianyu

On 2017年08月16日 19:21, Paolo Bonzini wrote:
> On 16/08/2017 02:22, Lan Tianyu wrote:
>> Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU
>> check for Xen here when vcpu number is more than 255.
> 
> I think you still need to do a check for vIOMMU being enabled.

Yes, this will be done in the Xen tool stack and Qemu doesn't have such
knowledge. Operations of create, destroy Xen vIOMMU will be done in the
Xen tool stack.

> 
> Paolo
> 
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  hw/i386/pc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 5943539..fc17885 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -1260,7 +1260,7 @@ void pc_machine_done(Notifier *notifier, void *data)
>>  fw_cfg_modify_i16(pcms->fw_cfg, FW_CFG_NB_CPUS, pcms->boot_cpus);
>>  }
>>  
>> -if (pcms->apic_id_limit > 255) {
>> +if (pcms->apic_id_limit > 255 && !xen_enabled()) {
>>  IntelIOMMUState *iommu = 
>> INTEL_IOMMU_DEVICE(x86_iommu_get_default());
>>  
>>  if (!iommu || !iommu->x86_iommu.intr_supported ||
>>
>

[Qemu-devel] [PATCH] x86: Skip check apic_id_limit for Xen

2017-08-16 Thread Lan Tianyu

Xen vIOMMU device model will be in Xen hypervisor. Skip vIOMMU
check for Xen here when vcpu number is more than 255.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/pc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 5943539..fc17885 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1260,7 +1260,7 @@ void pc_machine_done(Notifier *notifier, void *data)
 fw_cfg_modify_i16(pcms->fw_cfg, FW_CFG_NB_CPUS, pcms->boot_cpus);
 }
 
-if (pcms->apic_id_limit > 255) {
+if (pcms->apic_id_limit > 255 && !xen_enabled()) {
 IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
 
 if (!iommu || !iommu->x86_iommu.intr_supported ||
-- 
1.8.3.1

[Qemu-devel] [PATCH v3 2/2] x86: Increase max vcpu number to 8192

2017-08-16 Thread Lan Tianyu

For HPC usage case, it will create a huge VM with vcpus number as same as host
cpus and this requires more vcpus support in a single VM. This patch is to
increase max vcpu number from 288 to 8192 which is current default maximum cpu
number for Linux kernel.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/pc_q35.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index b469d05..3468f5a 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m)
 m->default_display = "std";
 m->no_floppy = 1;
 m->has_dynamic_sysbus = true;
-m->max_cpus = 288;
+m->max_cpus = 8192;
 }
 
 static void pc_q35_2_11_machine_options(MachineClass *m)
@@ -314,6 +314,7 @@ static void pc_q35_2_10_machine_options(MachineClass *m)
 {
 pc_q35_2_11_machine_options(m);
 m->alias = NULL;
+m->max_cpus = 288;
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH v3 1/2] pc: add 2.11 machine type

2017-08-16 Thread Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/pc_piix.c| 14 +++---
 hw/i386/pc_q35.c | 14 +++---
 include/hw/compat.h  |  3 +++
 include/hw/i386/pc.h |  3 +++
 4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 11b4336..e99ec8c 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -436,11 +436,21 @@ static void pc_i440fx_machine_options(MachineClass *m)
 m->default_display = "std";
 }
 
-static void pc_i440fx_2_10_machine_options(MachineClass *m)
+static void pc_i440fx_2_11_machine_options(MachineClass *m)
 {
 pc_i440fx_machine_options(m);
 m->alias = "pc";
 m->is_default = 1;
+}
+
+DEFINE_I440FX_MACHINE(v2_11, "pc-i440fx-2.11", NULL,
+  pc_i440fx_2_11_machine_options);
+
+static void pc_i440fx_2_10_machine_options(MachineClass *m)
+{
+pc_i440fx_machine_options(m);
+m->is_default = 0;
+m->alias = NULL;
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
 }
 
@@ -450,8 +460,6 @@ DEFINE_I440FX_MACHINE(v2_10, "pc-i440fx-2.10", NULL,
 static void pc_i440fx_2_9_machine_options(MachineClass *m)
 {
 pc_i440fx_2_10_machine_options(m);
-m->is_default = 0;
-m->alias = NULL;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_9);
 }
 
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 169a214..b469d05 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -302,11 +302,20 @@ static void pc_q35_machine_options(MachineClass *m)
 m->max_cpus = 288;
 }
 
+static void pc_q35_2_11_machine_options(MachineClass *m)
+{
+ pc_q35_machine_options(m);
+ m->alias = "q35";
+}
+DEFINE_Q35_MACHINE(v2_11, "pc-q35-2.11", NULL,
+   pc_q35_2_11_machine_options);
+
 static void pc_q35_2_10_machine_options(MachineClass *m)
 {
-pc_q35_machine_options(m);
-m->alias = "q35";
+pc_q35_2_11_machine_options(m);
+m->alias = NULL;
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
+SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
 }
 
 DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL,
@@ -315,7 +324,6 @@ DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL,
 static void pc_q35_2_9_machine_options(MachineClass *m)
 {
 pc_q35_2_10_machine_options(m);
-m->alias = NULL;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_9);
 }
 
diff --git a/include/hw/compat.h b/include/hw/compat.h
index 08f3600..3e101f8 100644
--- a/include/hw/compat.h
+++ b/include/hw/compat.h
@@ -1,6 +1,9 @@
 #ifndef HW_COMPAT_H
 #define HW_COMPAT_H
 
+#define HW_COMPAT_2_10 \
+/* empty */
+
 #define HW_COMPAT_2_9 \
 {\
 .driver   = "pci-bridge",\
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index d80859b..8226904 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -369,6 +369,9 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
 int e820_get_num_entries(void);
 bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
+#define PC_COMPAT_2_10 \
+HW_COMPAT_2_10 \
+
 #define PC_COMPAT_2_9 \
 HW_COMPAT_2_9 \
 {\
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH V2 2/3] xen-pt: bind/unbind interrupt remapping format MSI

2017-08-14 Thread Lan Tianyu

Hi Anthony:

On 2017年08月12日 02:04, Anthony PERARD wrote:
> On Wed, Aug 09, 2017 at 04:51:21PM -0400, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>>
>> If a vIOMMU is exposed to guest, guest will configure the msi to remapping
>> format. The original code isn't suitable to the new format. A new pair
>> bind/unbind interfaces are added for this usage. This patch recognizes
>> this case and uses new interfaces to bind/unbind msi.
>>
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
> 
> Reviewed-by: Anthony PERARD <anthony.per...@citrix.com>
> 
> That patch series can be applied once the Xen side patches are merged.
> 

Great. Thanks for your review.

-- 
Best regards
Tianyu Lan

[Qemu-devel] [Update PATCH V2] x86: Increase max vcpu number to 8192

2017-08-11 Thread Lan Tianyu

Intel Xeon phi chip will support 352 logical threads. For HPC
usage case, it will create a huge VM with vcpus number as same as host
cpus. This patch is to increase max vcpu number from 288 to 8192 which
is current default maximum cpu number for Linux kernel.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
Change since v1:
* Bump max vcpu number from 352 to 8192
* Add compat support for new max vcpu limitation

 hw/i386/pc_q35.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 169a214..e093601 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -299,13 +299,21 @@ static void pc_q35_machine_options(MachineClass *m)
 m->default_display = "std";
 m->no_floppy = 1;
 m->has_dynamic_sysbus = true;
-m->max_cpus = 288;
+m->max_cpus = 8192;
 }
 
+static void pc_q35_2_11_machine_options(MachineClass *m)
+{
+ pc_q35_machine_options(m);
+ m->alias = "q35";
+}
+DEFINE_Q35_MACHINE(v2_11, "pc-q35-2.11", NULL,
+   pc_q35_2_11_machine_options);
+
 static void pc_q35_2_10_machine_options(MachineClass *m)
 {
-pc_q35_machine_options(m);
-m->alias = "q35";
+pc_q35_2_11_machine_options(m);
+m->max_cpus = 288;
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
 }
 
-- 
1.8.3.1

[Qemu-devel] [PATCH v2] x86: Increase max vcpu number to 8192

2017-08-11 Thread Lan Tianyu

Intel Xeon phi chip will support 352 logical threads. For HPC
usage case, it will create a huge VM with vcpus number as same as host
cpus. This patch is to increase max vcpu number from 288 to 8192 which
is current default maximum cpu number for Linux kernel.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
Change since v1:
* Bump max vcpu number from 352 to 8192
* Add compat support for new max vcpu limitation

 hw/i386/pc_q35.c | 15 ---
 include/hw/i386/pc.h |  2 ++
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 169a214..06fdc95 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -299,14 +299,23 @@ static void pc_q35_machine_options(MachineClass *m)
 m->default_display = "std";
 m->no_floppy = 1;
 m->has_dynamic_sysbus = true;
-m->max_cpus = 288;
+m->max_cpus = 8192;
 }
 
+static void pc_q35_2_11_machine_options(MachineClass *m)
+{
+ pc_q35_machine_options(m);
+ m->alias = "q35";
+}
+DEFINE_Q35_MACHINE(v2_11, "pc-q35-2.11", NULL,
+   pc_q35_2_11_machine_options);
+
 static void pc_q35_2_10_machine_options(MachineClass *m)
 {
-pc_q35_machine_options(m);
-m->alias = "q35";
+pc_q35_2_11_machine_options(m);
+m->max_cpus = 288;
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
+SET_MACHINE_COMPAT(m, PC_COMPAT_2_10);
 }
 
 DEFINE_Q35_MACHINE(v2_10, "pc-q35-2.10", NULL,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index d80859b..634abc6 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -369,6 +369,8 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
 int e820_get_num_entries(void);
 bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
+#define PC_COMPAT_2_10 \
+HW_COMPAT_2_10 \
 #define PC_COMPAT_2_9 \
 HW_COMPAT_2_9 \
 {\
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH] x86: Increase max vcpu number to 352

2017-08-11 Thread Lan Tianyu

On 2017年08月11日 03:22, Radim Krčmář wrote:
> 2017-08-10 15:16-0300, Eduardo Habkost:
>> On Thu, Aug 10, 2017 at 02:41:03PM +0200, Radim Krčmář wrote:
>>> 2017-08-10 19:02+0800, Lan Tianyu:
>>>> On 2017年08月10日 18:26, Daniel P. Berrange wrote:
>>>>> On Thu, Aug 10, 2017 at 06:08:07PM +0800, Lan Tianyu wrote:
>>>>>> Intel Xeon phi chip will support 352 logical threads. For HPC
>>>>>> usage case, it will create a huge VM with vcpus number as same as host
>>>>>> cpus. This patch is to increase max vcpu number to 352.
>>>>>
>>>>> If we pick arbitray limits based on size of physical CPUs that happen
>>>>> to be shipping today, we'll continue the cat+mouse game forever trailing
>>>>> latest CPUs that vendors ship.
>>>>>
>>>>> IMHO we should pick a higher number influenced by technical constraints
>>>>> of the q35 impl instead. eg can we go straight to something like 512 or
>>>>> 1024  ?
>>>>
>>>> Sure. 512 should be enough and some arrays is defined according to max
>>>> vcpu number.
>>>
>>> Hm, which arrays are that?  I was thinking it is safe to bump it to
>>> INT_MAX as the number is only used when setting global max_cpus.
>>
>> We had a MAX_CPUMASK_BITS macro, and bitmaps whose sizes were
>> defined at compile time based on it.  But commit
>> cdda2018e3b9ce0c18938767dfdb1e05a05b67ca removed it.  Probably
>> those arrays all use max_cpus, by now (and the default for
>> max_cpus is smp_cpus, not MachineClass::max_cpus).
> 
> Ah, thanks.
> 
>> Anyway, if we set it to INT_MAX, there are some cases where more
>> appropriate error checking/reporting could be required because
>> they won't handle overflow very well:
>> * pcms->apic_id_limit initialization at pc_cpus_init()
>> * ACPI code that assumes possible_cpus->cpus[i].arch_id fits
>>   in a 32-bit integer
>> * Other x86_cpu_apic_id_from_index() calls in PC code
>>   (especially the initialization of possible_cpus->cpus[i].arch_id).
>>   Note that x86_cpu_apic_id_from_index(cpu_index) might not fit
>>   in 32 bits even if cpu_index <= UINT32_MAX.
> 
> Good point, looks like it all comes to x86_cpu_apic_id_from_index().
> Each level of the topology has at most one underutilized bit, so
> 2^(32 - 3) would be safe.
> 
> It is still needlessly large for the foreseeable future, but 512 is
> going to be surpassed pretty soon, so I think that jumping at least to
> 8k would be better.
> (8k the current default maximum for Linux and the resulting overcommit
>  of ~20 is bearable for smoke testing on current hardware.)
> 

Hi All:
Thanks for your input. I tried Qemu with 8192 as max_vcpu and it works
normally. I will update patches.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH] x86: Increase max vcpu number to 352

2017-08-10 Thread Lan Tianyu

On 2017年08月10日 18:26, Daniel P. Berrange wrote:
> On Thu, Aug 10, 2017 at 06:08:07PM +0800, Lan Tianyu wrote:
>> Intel Xeon phi chip will support 352 logical threads. For HPC
>> usage case, it will create a huge VM with vcpus number as same as host
>> cpus. This patch is to increase max vcpu number to 352.
> 
> If we pick arbitray limits based on size of physical CPUs that happen
> to be shipping today, we'll continue the cat+mouse game forever trailing
> latest CPUs that vendors ship.
> 
> IMHO we should pick a higher number influenced by technical constraints
> of the q35 impl instead. eg can we go straight to something like 512 or
> 1024  ?

Sure. 512 should be enough and some arrays is defined according to max
vcpu number.

> 
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  hw/i386/pc_q35.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 169a214..5e93749 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m)
>>  m->default_display = "std";
>>  m->no_floppy = 1;
>>  m->has_dynamic_sysbus = true;
>> -m->max_cpus = 288;
>> +m->max_cpus = 352;
>>  }
> 
> You'll need to introduce machine type back compat support so that we
> avoid changing the 2.10 q35 machine type - only the 2.11 machine
> type should have the new limit.

How about the following change?

-static void pc_q35_2_10_machine_options(MachineClass *m)
+static void pc_q35_2_11_machine_options(MachineClass *m)
 {
  pc_q35_machine_options(m);
  m->alias = "q35";
+}
+
+static void pc_q35_2_10_machine_options(MachineClass *m)
+{
+pc_q35_2_11_options(m);
+m->alias = "q35";
+m->max_cpus = 288;
 m->numa_auto_assign_ram = numa_legacy_auto_assign_ram;
 }



> 
> Regards,
> Daniel
> 


-- 
Best regards
Tianyu Lan

[Qemu-devel] [PATCH] x86: Increase max vcpu number to 352

2017-08-10 Thread Lan Tianyu

Intel Xeon phi chip will support 352 logical threads. For HPC
usage case, it will create a huge VM with vcpus number as same as host
cpus. This patch is to increase max vcpu number to 352.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/pc_q35.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 169a214..5e93749 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m)
 m->default_display = "std";
 m->no_floppy = 1;
 m->has_dynamic_sysbus = true;
-m->max_cpus = 288;
+m->max_cpus = 352;
 }
 
 static void pc_q35_2_10_machine_options(MachineClass *m)
-- 
1.8.3.1

[Qemu-devel] [PATCH] x86: Increase max vcpu number to 352

2017-08-10 Thread Lan Tianyu

Intel Xeon phi chip will support 352 logical threads. For HPC
usage case, it will create a huge VM with vcpus number as same as host
cpus. This patch is to increase max vcpu number to 352.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/pc_q35.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 169a214..5e93749 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -299,7 +299,7 @@ static void pc_q35_machine_options(MachineClass *m)
 m->default_display = "std";
 m->no_floppy = 1;
 m->has_dynamic_sysbus = true;
-m->max_cpus = 288;
+m->max_cpus = 352;
 }
 
 static void pc_q35_2_10_machine_options(MachineClass *m)
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH v2 0/3] Qemu: Add Xen vIOMMU interrupt remapping function support

2017-08-10 Thread Lan Tianyu

On 2017年08月10日 17:04, Paolo Bonzini wrote:
> On 09/08/2017 22:51, Lan Tianyu wrote:
>> This patchset is to deal with MSI interrupt remapping request when guest
>> updates MSI registers.
>>
>> Chao Gao (3):
>>   i386/msi: Correct mask of destination ID in MSI address
>>   xen-pt: bind/unbind interrupt remapping format MSI
>>   msi: Handle remappable format interrupt request
>>
>>  configure |  4 +++-
>>  hw/i386/xen/xen-hvm.c |  8 ++-
>>  hw/pci/msi.c  |  5 +++--
>>  hw/pci/msix.c |  4 +++-
>>  hw/xen/xen_pt_msi.c   | 52 
>> +++
>>  include/hw/i386/apic-msidef.h |  3 ++-
>>  include/hw/xen/xen.h  |  2 +-
>>  include/hw/xen/xen_common.h   | 25 +
>>  stubs/xen-hvm.c   |  2 +-
>>  9 files changed, 83 insertions(+), 22 deletions(-)
>>
> 
> Non-Xen parts look good (though I cannot ack them).
> 
> Paolo
> 

Never minder. Thanks for your review.

-- 
Best regards
Tianyu Lan

[Qemu-devel] [PATCH V2 2/3] xen-pt: bind/unbind interrupt remapping format MSI

2017-08-09 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

If a vIOMMU is exposed to guest, guest will configure the msi to remapping
format. The original code isn't suitable to the new format. A new pair
bind/unbind interfaces are added for this usage. This patch recognizes
this case and uses new interfaces to bind/unbind msi.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 configure |  4 +++-
 hw/xen/xen_pt_msi.c   | 50 ---
 include/hw/i386/apic-msidef.h |  1 +
 include/hw/xen/xen_common.h   | 25 ++
 4 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/configure b/configure
index dd73cce..92b56d3 100755
--- a/configure
+++ b/configure
@@ -2108,13 +2108,15 @@ EOF
 elif
 cat > $TMPC <
 #include 
 int main(void) {
+  xc_interface *xc = NULL;
   xenforeignmemory_handle *xfmem;
 
   xfmem = xenforeignmemory_open(0, 0);
   xenforeignmemory_map2(xfmem, 0, 0, 0, 0, 0, 0, 0);
-
+  xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0, 0);
   return 0;
 }
 EOF
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index ff9a79f..5c5d15a 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -163,16 +163,23 @@ static int msi_msix_update(XenPCIPassthroughState *s,
 int rc = 0;
 uint64_t table_addr = 0;
 
-XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
-   " (entry: %#x)\n",
-   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
-
 if (is_msix) {
 table_addr = s->msix->mmio_base_addr;
 }
 
-rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
-  pirq, gflags, table_addr);
+if (addr & MSI_ADDR_IF_MASK) {
+XEN_PT_LOG(d, "Updating MSI%s with addr %#" PRIx64 " data %#x\n",
+   is_msix ? "-X" : "", addr, data);
+rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq,
+ d->devfn, data, addr, table_addr);
+} else {
+XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
+   " (entry: %#x)\n",
+   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
+
+rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
+  pirq, gflags, table_addr);
+}
 
 if (rc) {
 XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n",
@@ -204,13 +211,30 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
 }
 
 if (is_binded) {
-XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
-   is_msix ? "-X" : "", pirq, gvec);
-rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags);
-if (rc) {
-XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
gvec: %#x)\n",
-   is_msix ? "-X" : "", errno, pirq, gvec);
-return rc;
+if (addr & MSI_ADDR_IF_MASK) {
+XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, "
+   "addr: %#" PRIx64 ")\n",
+   is_msix ? "-X" : "", pirq, data, addr);
+rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq,
+d->devfn, data, addr);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, "
+   "data: %x, addr: %#" PRIx64 ")\n",
+   is_msix ? "-X" : "", rc, pirq, data, addr);
+return rc;
+}
+
+} else {
+XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
+   is_msix ? "-X" : "", pirq, gvec);
+rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
+  pirq, gflags);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, "
+   "gvec: %#x)\n",
+   is_msix ? "-X" : "", errno, pirq, gvec);
+return rc;
+}
 }
 }
 
diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 420b411..a2b52d9 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -27,5 +27,6 @@
 #define MSI_ADDR_DEST_ID_SHIFT  12
 #define MSI_ADDR_DEST_IDX_SHIFT 4
 #define  MSI_ADDR_DEST_ID_MASK  0x000ff000
+#define  MSI_ADDR_IF_MASK   0x0010

[Qemu-devel] [PATCH V2 3/3] msi: Handle remappable format interrupt request

2017-08-09 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

According to VT-d spec Interrupt Remapping and Interrupt Posting ->
Interrupt Remapping -> Interrupt Request Formats On Intel 64
Platforms, fields of MSI data register have changed. This patch
avoids wrongly regarding a remappable format interrupt request as
an interrupt binded with an event channel.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
Acked-by: Anthony PERARD <anthony.per...@citrix.com>
---
 hw/i386/xen/xen-hvm.c | 8 +++-
 hw/pci/msi.c  | 5 +++--
 hw/pci/msix.c | 4 +++-
 hw/xen/xen_pt_msi.c   | 2 +-
 include/hw/xen/xen.h  | 2 +-
 stubs/xen-hvm.c   | 2 +-
 6 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index d9ccd5d..cbbce73 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -145,8 +145,14 @@ void xen_piix_pci_write_config_client(uint32_t address, 
uint32_t val, int len)
 }
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
 {
+/* If the MSI address is configured in remapping format, the MSI will not
+ * be remapped into a pirq.
+ */
+if (msi_addr_lo & MSI_ADDR_IF_MASK) {
+return 0;
+}
 /* If vector is 0, the msi is remapped into a pirq, passed as
  * dest_id.
  */
diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 5e05ce5..d05c876 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev)
 static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
 {
 uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
-uint32_t mask, data;
+uint32_t mask, data, addr_lo;
 bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
 assert(vector < PCI_MSI_VECTORS_MAX);
 
@@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned 
int vector)
 }
 
 data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
-if (xen_is_pirq_msi(data)) {
+addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev));
+if (xen_is_pirq_msi(addr_lo, data)) {
 return false;
 }
 
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 5078d3d..65cf00c 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -83,9 +83,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int 
vector, bool fmask)
 {
 unsigned offset = vector * PCI_MSIX_ENTRY_SIZE;
 uint8_t *data = >msix_table[offset + PCI_MSIX_ENTRY_DATA];
+uint8_t *addr_lo = >msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR];
 /* MSIs on Xen can be remapped into pirqs. In those cases, masking
  * and unmasking go through the PV evtchn path. */
-if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) {
+if (xen_enabled() && xen_is_pirq_msi(pci_get_long(addr_lo),
+ pci_get_long(data))) {
 return false;
 }
 return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] &
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 5c5d15a..82e13b2 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s,
 
 assert((!is_msix && msix_entry == 0) || is_msix);
 
-if (xen_is_pirq_msi(data)) {
+if (xen_is_pirq_msi(addr, data)) {
 *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr);
 if (!*ppirq) {
 /* this probably identifies an misconfiguration of the guest,
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 7efcdaa..0d6c83e 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -34,7 +34,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 void xen_piix3_set_irq(void *opaque, int irq_num, int level);
 void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len);
 void xen_hvm_inject_msi(uint64_t addr, uint32_t data);
-int xen_is_pirq_msi(uint32_t msi_data);
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data);
 
 qemu_irq *xen_interrupt_controller_init(void);
 
diff --git a/stubs/xen-hvm.c b/stubs/xen-hvm.c
index 3ca6c51..aeb1592 100644
--- a/stubs/xen-hvm.c
+++ b/stubs/xen-hvm.c
@@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data)
 {
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
 {
 return 0;
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH v2 0/3] Qemu: Add Xen vIOMMU interrupt remapping function support

2017-08-09 Thread Lan Tianyu

This patchset is to deal with MSI interrupt remapping request when guest
updates MSI registers.

Chao Gao (3):
  i386/msi: Correct mask of destination ID in MSI address
  xen-pt: bind/unbind interrupt remapping format MSI
  msi: Handle remappable format interrupt request

 configure |  4 +++-
 hw/i386/xen/xen-hvm.c |  8 ++-
 hw/pci/msi.c  |  5 +++--
 hw/pci/msix.c |  4 +++-
 hw/xen/xen_pt_msi.c   | 52 +++
 include/hw/i386/apic-msidef.h |  3 ++-
 include/hw/xen/xen.h  |  2 +-
 include/hw/xen/xen_common.h   | 25 +
 stubs/xen-hvm.c   |  2 +-
 9 files changed, 83 insertions(+), 22 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [PATCH V2 1/3] i386/msi: Correct mask of destination ID in MSI address

2017-08-09 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

According to SDM 10.11.1, only [19:12] bits of MSI address are
Destination ID, change the mask to avoid ambiguity for VT-d spec
has used the bit 4 to indicate a remappable interrupt request.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
Reviewed-by: Anthony PERARD <anthony.per...@citrix.com>
Reviewed-by: Peter Xu <pet...@redhat.com>
---
 include/hw/i386/apic-msidef.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 8b4d4cc..420b411 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -26,6 +26,6 @@
 
 #define MSI_ADDR_DEST_ID_SHIFT  12
 #define MSI_ADDR_DEST_IDX_SHIFT 4
-#define  MSI_ADDR_DEST_ID_MASK  0x000
+#define  MSI_ADDR_DEST_ID_MASK  0x000ff000
 
 #endif /* HW_APIC_MSIDEF_H */
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH 2/3] xen-pt: bind/unbind interrupt remapping format MSI

2017-07-03 Thread Lan Tianyu

Hi Anthony:

On 2017年06月30日 23:48, Anthony PERARD wrote:
> On Thu, Jun 29, 2017 at 01:49:53AM -0400, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>>
>> If a vIOMMU is exposed to guest, guest will configure the msi to remapping
>> format. The original code isn't suitable to the new format. A new pair
>> bind/unbind interfaces are added for this usage. This patch recognizes
>> this case and uses new interfaces to bind/unbind msi.
>>
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
> 
> Hi,
> 
> The patch series is going to need to be rebased on top of QEMU upstream.
> For starter, configure have changed a bit.

Thanks for your reminder. Will do that.


> 
>> ---
>>  configure | 54 
>> +++
>>  hw/xen/xen_pt_msi.c   | 50 ---
>>  include/hw/i386/apic-msidef.h |  1 +
>>  include/hw/xen/xen_common.h   | 25 
>>  4 files changed, 117 insertions(+), 13 deletions(-)
>>
>> diff --git a/configure b/configure
>> index 476210b..b3ac49f 100755
>> --- a/configure
>> +++ b/configure
>> @@ -1982,6 +1982,60 @@ EOF
>>  /*
>>   * If we have stable libs the we don't want the libxc compat
>>   * layers, regardless of what CFLAGS we may have been given.
>> + */
>> +#undef XC_WANT_COMPAT_EVTCHN_API
>> +#undef XC_WANT_COMPAT_GNTTAB_API
>> +#undef XC_WANT_COMPAT_MAP_FOREIGN_API
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#if !defined(HVM_MAX_VCPUS)
>> +# error HVM_MAX_VCPUS not defined
>> +#endif
>> +int main(void) {
>> +  xc_interface *xc = NULL;
>> +  xenforeignmemory_handle *xfmem;
>> +  xenevtchn_handle *xe;
>> +  xengnttab_handle *xg;
>> +  xen_domain_handle_t handle;
>> +  xengnttab_grant_copy_segment_t* seg = NULL;
>> +
>> +  xs_daemon_open();
>> +
>> +  xc = xc_interface_open(0, 0, 0);
>> +  xc_hvm_set_mem_type(0, 0, HVMMEM_ram_ro, 0, 0);
>> +  xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0);
>> +  xc_hvm_inject_msi(xc, 0, 0xf000, 0x);
>> +  xc_hvm_create_ioreq_server(xc, 0, HVM_IOREQSRV_BUFIOREQ_ATOMIC, NULL);
>> +  xc_domain_create(xc, 0, handle, 0, NULL, NULL);
>> +
>> +  xfmem = xenforeignmemory_open(0, 0);
>> +  xenforeignmemory_map(xfmem, 0, 0, 0, 0, 0);
>> +
>> +  xe = xenevtchn_open(0, 0);
>> +  xenevtchn_fd(xe);
>> +
>> +  xg = xengnttab_open(0, 0);
>> +  xengnttab_grant_copy(xg, 0, seg);
>> +
>> +  xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0 ,0);
>> +
>> +  return 0;
>> +}
>> +EOF
>> +  compile_prog "" "$xen_libs $xen_stable_libs"
>> +then
>> +xen_ctrl_version=4100
>> +xen=yes
> 
> There have been some change/refactoring in configure, so this won't
> work. The xen_ctrl_version got one more digit.
> 
> Can you try with this patch? Which is also simpler.

Sure. Thanks.

> diff --git a/configure b/configure
> index c571ad14e5..a06f2c0b92 100755
> --- a/configure
> +++ b/configure
> @@ -2021,6 +2021,24 @@ EOF
>  # Xen unstable
>  elif
>  cat > $TMPC < +#include 
> +int main(void) {
> +  xc_interface *xc = NULL;
> +
> +  xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0 ,0);
> +
> +  return 0;
> +}
> +EOF
> +compile_prog "" "$xen_libs -lxendevicemodel $xen_stable_libs"
> +  then
> +  xen_stable_libs="-lxendevicemodel $xen_stable_libs"
> +  xen_ctrl_version=41000
> +  xen=yes
> +
> +# Xen 4.9
> +elif
> +cat > $TMPC <  #undef XC_WANT_COMPAT_DEVICEMODEL_API
>  #define __XEN_TOOLS__
>  #include 
> 
> 
>> index 8e1580d..4ba43a8 100644
>> --- a/include/hw/xen/xen_common.h
>> +++ b/include/hw/xen/xen_common.h
>> @@ -438,4 +438,29 @@ static inline int xengnttab_grant_copy(xengnttab_handle 
>> *xgt, uint32_t count,
>>  }
>>  #endif
>>  
>> +/* Xen before 4.10 */
>> +#if CONFIG_XEN_CTRL_INTERFACE_VERSION < 4100
> 
> This will needs to be
> CONFIG_XEN_CTRL_INTERFACE_VERSION < 41000
> 

Will update.

-- 
Best regards
Tianyu Lan

[Qemu-devel] [PATCH 1/3] i386/msi: Correct mask of destination ID in MSI address

2017-06-29 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

According to SDM 10.11.1, only [19:12] bits of MSI address are
Destination ID, change the mask to avoid ambiguity for VT-d spec
has used the bit 4 to indicate a remappable interrupt request.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/hw/i386/apic-msidef.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 8b4d4cc..420b411 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -26,6 +26,6 @@
 
 #define MSI_ADDR_DEST_ID_SHIFT  12
 #define MSI_ADDR_DEST_IDX_SHIFT 4
-#define  MSI_ADDR_DEST_ID_MASK  0x000
+#define  MSI_ADDR_DEST_ID_MASK  0x000ff000
 
 #endif /* HW_APIC_MSIDEF_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH 3/3] msi: Handle remappable format interrupt request

2017-06-29 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

According to VT-d spec Interrupt Remapping and Interrupt Posting ->
Interrupt Remapping -> Interrupt Request Formats On Intel 64
Platforms, fields of MSI data register have changed. This patch
avoids wrongly regarding a remappable format interrupt request as
an interrupt binded with an event channel.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/pci/msi.c | 5 +++--
 hw/pci/msix.c| 4 +++-
 hw/xen/xen_pt_msi.c  | 2 +-
 include/hw/xen/xen.h | 2 +-
 xen-hvm-stub.c   | 2 +-
 xen-hvm.c| 8 +++-
 6 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index a87b227..3713ea6 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev)
 static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
 {
 uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
-uint32_t mask, data;
+uint32_t mask, data, addr_lo;
 bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
 assert(vector < PCI_MSI_VECTORS_MAX);
 
@@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned 
int vector)
 }
 
 data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
-if (xen_is_pirq_msi(data)) {
+addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev));
+if (xen_is_pirq_msi(addr_lo, data)) {
 return false;
 }
 
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 0ec1cb1..6dda83c 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -81,9 +81,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int 
vector, bool fmask)
 {
 unsigned offset = vector * PCI_MSIX_ENTRY_SIZE;
 uint8_t *data = >msix_table[offset + PCI_MSIX_ENTRY_DATA];
+uint8_t *addr_lo = >msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR];
 /* MSIs on Xen can be remapped into pirqs. In those cases, masking
  * and unmasking go through the PV evtchn path. */
-if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) {
+if (xen_enabled() && xen_is_pirq_msi(pci_get_long(addr_lo),
+ pci_get_long(data))) {
 return false;
 }
 return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] &
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 26a69d9..68da0e4 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s,
 
 assert((!is_msix && msix_entry == 0) || is_msix);
 
-if (xen_is_pirq_msi(data)) {
+if (xen_is_pirq_msi(addr, data)) {
 *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr);
 if (!*ppirq) {
 /* this probably identifies an misconfiguration of the guest,
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index a8f3afb..b1ef41e 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 void xen_piix3_set_irq(void *opaque, int irq_num, int level);
 void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len);
 void xen_hvm_inject_msi(uint64_t addr, uint32_t data);
-int xen_is_pirq_msi(uint32_t msi_data);
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data);
 
 qemu_irq *xen_interrupt_controller_init(void);
 
diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c
index c500325..a3a3bb3 100644
--- a/xen-hvm-stub.c
+++ b/xen-hvm-stub.c
@@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data)
 {
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
 {
 return 0;
 }
diff --git a/xen-hvm.c b/xen-hvm.c
index 0892361..9f1a7bd 100644
--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -146,8 +146,14 @@ void xen_piix_pci_write_config_client(uint32_t address, 
uint32_t val, int len)
 }
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data)
 {
+/* If the MSI address is configured in remapping format, the MSI will not
+ * be remapped into a pirq.
+ */
+if (msi_addr_lo & MSI_ADDR_IF_MASK) {
+return 0;
+}
 /* If vector is 0, the msi is remapped into a pirq, passed as
  * dest_id.
  */
-- 
1.8.3.1

[Qemu-devel] [PATCH 2/3] xen-pt: bind/unbind interrupt remapping format MSI

2017-06-29 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

If a vIOMMU is exposed to guest, guest will configure the msi to remapping
format. The original code isn't suitable to the new format. A new pair
bind/unbind interfaces are added for this usage. This patch recognizes
this case and uses new interfaces to bind/unbind msi.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 configure | 54 +++
 hw/xen/xen_pt_msi.c   | 50 ---
 include/hw/i386/apic-msidef.h |  1 +
 include/hw/xen/xen_common.h   | 25 
 4 files changed, 117 insertions(+), 13 deletions(-)

diff --git a/configure b/configure
index 476210b..b3ac49f 100755
--- a/configure
+++ b/configure
@@ -1982,6 +1982,60 @@ EOF
 /*
  * If we have stable libs the we don't want the libxc compat
  * layers, regardless of what CFLAGS we may have been given.
+ */
+#undef XC_WANT_COMPAT_EVTCHN_API
+#undef XC_WANT_COMPAT_GNTTAB_API
+#undef XC_WANT_COMPAT_MAP_FOREIGN_API
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#if !defined(HVM_MAX_VCPUS)
+# error HVM_MAX_VCPUS not defined
+#endif
+int main(void) {
+  xc_interface *xc = NULL;
+  xenforeignmemory_handle *xfmem;
+  xenevtchn_handle *xe;
+  xengnttab_handle *xg;
+  xen_domain_handle_t handle;
+  xengnttab_grant_copy_segment_t* seg = NULL;
+
+  xs_daemon_open();
+
+  xc = xc_interface_open(0, 0, 0);
+  xc_hvm_set_mem_type(0, 0, HVMMEM_ram_ro, 0, 0);
+  xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0);
+  xc_hvm_inject_msi(xc, 0, 0xf000, 0x);
+  xc_hvm_create_ioreq_server(xc, 0, HVM_IOREQSRV_BUFIOREQ_ATOMIC, NULL);
+  xc_domain_create(xc, 0, handle, 0, NULL, NULL);
+
+  xfmem = xenforeignmemory_open(0, 0);
+  xenforeignmemory_map(xfmem, 0, 0, 0, 0, 0);
+
+  xe = xenevtchn_open(0, 0);
+  xenevtchn_fd(xe);
+
+  xg = xengnttab_open(0, 0);
+  xengnttab_grant_copy(xg, 0, seg);
+
+  xc_domain_update_msi_irq_remapping(xc, 0, 0, 0, 0, 0 ,0);
+
+  return 0;
+}
+EOF
+  compile_prog "" "$xen_libs $xen_stable_libs"
+then
+xen_ctrl_version=4100
+xen=yes
+  elif
+  cat > $TMPC <msix->mmio_base_addr;
 }
 
-rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
-  pirq, gflags, table_addr);
+if (addr & MSI_ADDR_IF_MASK) {
+XEN_PT_LOG(d, "Updating MSI%s with addr %#" PRIx64 " data %#x\n",
+   is_msix ? "-X" : "", addr, data);
+rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq,
+ d->devfn, data, addr, table_addr);
+} else {
+XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
+   " (entry: %#x)\n",
+   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
+
+rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
+  pirq, gflags, table_addr);
+}
 
 if (rc) {
 XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n",
@@ -204,13 +211,30 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
 }
 
 if (is_binded) {
-XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
-   is_msix ? "-X" : "", pirq, gvec);
-rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags);
-if (rc) {
-XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
gvec: %#x)\n",
-   is_msix ? "-X" : "", errno, pirq, gvec);
-return rc;
+if (addr & MSI_ADDR_IF_MASK) {
+XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, "
+   "addr: %#" PRIx64 ")\n",
+   is_msix ? "-X" : "", pirq, data, addr);
+rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq,
+d->devfn, data, addr);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, "
+   "data: %x, addr: %#" PRIx64 ")\n",
+   is_msix ? "-X" : "", rc, pirq, data, addr);
+return rc;
+}
+
+} else {
+XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
+   is_msix ? "-X" : "", pirq, gvec);
+rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
+  pirq, gflags);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, "
+

[Qemu-devel] [PATCH 0/3] Qemu: Add Xen vIOMMU interrupt remapping function support

2017-06-29 Thread Lan Tianyu

This patchset is to deal with MSI interrupt remapping request when guest
updates MSI registers.

Chao Gao (3):
  i386/msi: Correct mask of destination ID in MSI address
  xen-pt: bind/unbind interrupt remapping format MSI
  msi: Handle remappable format interrupt request

 configure | 54 +++
 hw/pci/msi.c  |  5 ++--
 hw/pci/msix.c |  4 +++-
 hw/xen/xen_pt_msi.c   | 52 ++---
 include/hw/i386/apic-msidef.h |  3 ++-
 include/hw/xen/xen.h  |  2 +-
 include/hw/xen/xen_common.h   | 25 
 xen-hvm-stub.c|  2 +-
 xen-hvm.c |  8 ++-
 9 files changed, 134 insertions(+), 21 deletions(-)

-- 
1.8.3.1

Re: [Qemu-devel] [Xen-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI

2017-05-23 Thread Lan Tianyu

On 2017年05月24日 01:06, Anthony PERARD wrote:
> On Tue, May 23, 2017 at 08:16:25PM +0800, Lan Tianyu wrote:
>> On 2017年05月19日 20:04, Jan Beulich wrote:
>>>>>> On 19.05.17 at 13:16, <anthony.per...@citrix.com> wrote:
>>>> On Thu, May 18, 2017 at 01:32:59AM -0400, Lan Tianyu wrote:
>>>>> --- a/include/hw/i386/apic-msidef.h
>>>>> +++ b/include/hw/i386/apic-msidef.h
>>>>> @@ -26,6 +26,7 @@
>>>>>  
>>>>>  #define MSI_ADDR_DEST_ID_SHIFT  12
>>>>>  #define MSI_ADDR_DEST_IDX_SHIFT 4
>>>>> -#define  MSI_ADDR_DEST_ID_MASK  0x000
>>>>> +#define  MSI_ADDR_DEST_ID_MASK  0x000fff00
>>>> The value of MSI_ADDR_DEST_ID_MASK is changed here. I think the patch
>>>> should be:
>>>> +#define  MSI_ADDR_DEST_ID_MASK  0x0000
>>> Judging from other sources, rather the other way around - the
>>> mask needs to have further bits removed (should be 0x000ff000
>>> afaict). Xen sources confirm this, and while Linux has the value
>>> you suggest, that contradicts
>> Agree. Defining the mask as "0x000ff000" makes more sense.
>> Just check Qemu source code. Only apic_send_msi() and msi_dest_id() use
>> the mask
>> to get dest apic id. They mask MSI address field with 
>> MSI_ADDR_DEST_ID_MASK and
>> then right-shift 12bit. The low 12bit won't be used.
>>
>> Anthony, does this make sense?
> Yes, it does.
> The change to MSI_ADDR_DEST_ID_MASK should probably go in its own patch.
>
OK. Will update.


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [Xen-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI

2017-05-23 Thread Lan Tianyu

On 2017年05月19日 20:04, Jan Beulich wrote:
>>>> On 19.05.17 at 13:16, <anthony.per...@citrix.com> wrote:
>> On Thu, May 18, 2017 at 01:32:59AM -0400, Lan Tianyu wrote:
>>> --- a/include/hw/i386/apic-msidef.h
>>> +++ b/include/hw/i386/apic-msidef.h
>>> @@ -26,6 +26,7 @@
>>>  
>>>  #define MSI_ADDR_DEST_ID_SHIFT  12
>>>  #define MSI_ADDR_DEST_IDX_SHIFT 4
>>> -#define  MSI_ADDR_DEST_ID_MASK  0x000
>>> +#define  MSI_ADDR_DEST_ID_MASK  0x000fff00
>> The value of MSI_ADDR_DEST_ID_MASK is changed here. I think the patch
>> should be:
>> +#define  MSI_ADDR_DEST_ID_MASK  0x0000
> Judging from other sources, rather the other way around - the
> mask needs to have further bits removed (should be 0x000ff000
> afaict). Xen sources confirm this, and while Linux has the value
> you suggest, that contradicts
Agree. Defining the mask as "0x000ff000" makes more sense.
Just check Qemu source code. Only apic_send_msi() and msi_dest_id() use
the mask
to get dest apic id. They mask MSI address field with 
MSI_ADDR_DEST_ID_MASK and
then right-shift 12bit. The low 12bit won't be used.

Anthony, does this make sense?

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH V2 2/2] msi: Handle remappable format interrupt request

2017-05-21 Thread Lan Tianyu

Hi Anthony:
Thanks for your review.

On 2017年05月19日 19:57, Anthony PERARD wrote:
> On Thu, May 18, 2017 at 01:33:00AM -0400, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>>
>> According to VT-d spec Interrupt Remapping and Interrupt Posting ->
>> Interrupt Remapping -> Interrupt Request Formats On Intel 64
>> Platforms, fields of MSI data register have changed. This patch
>> avoids wrongly regarding a remappable format interrupt request as
>> an interrupt binded with an event channel.
>>
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  hw/pci/msi.c | 5 +++--
>>  hw/pci/msix.c| 4 +++-
>>  hw/xen/xen_pt_msi.c  | 2 +-
>>  include/hw/xen/xen.h | 2 +-
>>  xen-hvm-stub.c   | 2 +-
>>  xen-hvm.c| 7 ++-
>>  6 files changed, 15 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/pci/msi.c b/hw/pci/msi.c
>> index a87b227..199cb47 100644
>> --- a/hw/pci/msi.c
>> +++ b/hw/pci/msi.c
>> @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev)
>>  static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
>>  {
>>  uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
>> -uint32_t mask, data;
>> +uint32_t mask, data, addr_lo;
>>  bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
>>  assert(vector < PCI_MSI_VECTORS_MAX);
>>  
>> @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned 
>> int vector)
>>  }
>>  
>>  data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
>> -if (xen_is_pirq_msi(data)) {
>> +addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev));
>> +if (xen_is_pirq_msi(data, addr_lo)) {
>>  return false;
>>  }
>>  
>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>> index bb54e8b..efe2982 100644
>> --- a/hw/pci/msix.c
>> +++ b/hw/pci/msix.c
>> @@ -82,9 +82,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned 
>> int vector, bool fmask)
>>  {
>>  unsigned offset = vector * PCI_MSIX_ENTRY_SIZE;
>>  uint8_t *data = >msix_table[offset + PCI_MSIX_ENTRY_DATA];
>> +uint8_t *addr_lo = >msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR];
>>  /* MSIs on Xen can be remapped into pirqs. In those cases, masking
>>   * and unmasking go through the PV evtchn path. */
>> -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) {
>> +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data),
>> + pci_get_long(addr_lo))) {
>>  return false;
>>  }
>>  return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] &
>> diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
>> index 5fab95e..45a9e9f 100644
>> --- a/hw/xen/xen_pt_msi.c
>> +++ b/hw/xen/xen_pt_msi.c
>> @@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s,
>>  
>>  assert((!is_msix && msix_entry == 0) || is_msix);
>>  
>> -if (xen_is_pirq_msi(data)) {
>> +if (xen_is_pirq_msi(data, addr)) {
>>  *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr);
>>  if (!*ppirq) {
>>  /* this probably identifies an misconfiguration of the guest,
>> diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
>> index 09c2ce5..af759bc 100644
>> --- a/include/hw/xen/xen.h
>> +++ b/include/hw/xen/xen.h
>> @@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
>>  void xen_piix3_set_irq(void *opaque, int irq_num, int level);
>>  void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int 
>> len);
>>  void xen_hvm_inject_msi(uint64_t addr, uint32_t data);
>> -int xen_is_pirq_msi(uint32_t msi_data);
>> +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo);
> 
> Maybe inverting the arguments would be better, so the arguments would be
> the address first, then the data, like I think it is often the case.
> What do you think?


That make sense. Will update.

> 
>>  
>>  qemu_irq *xen_interrupt_controller_init(void);
>>  
>> diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c
>> index c500325..dae421c 100644
>> --- a/xen-hvm-stub.c
>> +++ b/xen-hvm-stub.c
>> @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data)
>>  {
>>  }
>>  
>> -int xen_is_pirq_msi(uint32_t msi_data)
>> +int xen

[Qemu-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI

2017-05-18 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

If a vIOMMU is exposed to guest, guest will configure the msi to remapping
format. The original code isn't suitable to the new format. A new pair
bind/unbind interfaces are added for this usage. This patch recognizes
this case and use new interfaces to bind/unbind msi.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/xen/xen_pt_msi.c   | 50 ---
 include/hw/i386/apic-msidef.h |  3 ++-
 2 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 62add06..5fab95e 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -163,16 +163,24 @@ static int msi_msix_update(XenPCIPassthroughState *s,
 int rc = 0;
 uint64_t table_addr = 0;
 
-XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
-   " (entry: %#x)\n",
-   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
-
 if (is_msix) {
 table_addr = s->msix->mmio_base_addr;
 }
 
-rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
-  pirq, gflags, table_addr);
+if (addr & MSI_ADDR_IF_MASK) {
+XEN_PT_LOG(d, "Updating MSI%s with addr %#" PRIx64 "data %#x\n",
+   is_msix ? "-X": "", addr, data);
+rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq,
+ d->devfn, data, addr, table_addr);
+}
+else {
+XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
+   " (entry: %#x)\n",
+   is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry);
+
+rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
+  pirq, gflags, table_addr);
+}
 
 if (rc) {
 XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n",
@@ -204,13 +212,29 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
 }
 
 if (is_binded) {
-XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
-   is_msix ? "-X" : "", pirq, gvec);
-rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags);
-if (rc) {
-XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
gvec: %#x)\n",
-   is_msix ? "-X" : "", errno, pirq, gvec);
-return rc;
+if ( addr & MSI_ADDR_IF_MASK ) {
+XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, "
+   "addr: %#" PRIx64 ")\n",
+   is_msix ? "-X" : "", pirq, data, addr);
+rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq,
+d->devfn, data, addr);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, "
+   "data: %x, addr: %#" PRIx64 ")\n",
+   is_msix ? "-X" : "", rc, pirq, data, addr);
+return rc;
+}
+
+} else {
+XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
+   is_msix ? "-X" : "", pirq, gvec);
+rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, 
gflags);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, "
+   "gvec: %#x)\n",
+   is_msix ? "-X" : "", errno, pirq, gvec);
+return rc;
+}
 }
 }
 
diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 8b4d4cc..2c450f9 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -26,6 +26,7 @@
 
 #define MSI_ADDR_DEST_ID_SHIFT  12
 #define MSI_ADDR_DEST_IDX_SHIFT 4
-#define  MSI_ADDR_DEST_ID_MASK  0x000
+#define  MSI_ADDR_DEST_ID_MASK  0x000fff00
+#define  MSI_ADDR_IF_MASK   0x0010
 
 #endif /* HW_APIC_MSIDEF_H */
-- 
1.8.3.1

[Qemu-devel] [RFC PATCH V2 0/2] Qemu: Add Xen vIOMMU interrupt remapping function.

2017-05-18 Thread Lan Tianyu

Change since V1:
   1) Move create/destroy vIOMMU and query vIOMMU capabilities to tool 
stack.
   2) Fix some code style issue.

This patchset is to deal with MSI interrupt remapping request when guest
updates MSI registers.

Repo:
https://github.com/lantianyu/qemu/tree/xen_viommu_rfc_v2

Chao Gao (2):
  xen-pt: bind/unbind interrupt remapping format MSI
  msi: Handle remappable format interrupt request

 hw/pci/msi.c  |  5 +++--
 hw/pci/msix.c |  4 +++-
 hw/xen/xen_pt_msi.c   | 52 +++
 include/hw/i386/apic-msidef.h |  3 ++-
 include/hw/xen/xen.h  |  2 +-
 xen-hvm-stub.c|  2 +-
 xen-hvm.c |  7 +-
 7 files changed, 54 insertions(+), 21 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [RFC PATCH V2 2/2] msi: Handle remappable format interrupt request

2017-05-18 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

According to VT-d spec Interrupt Remapping and Interrupt Posting ->
Interrupt Remapping -> Interrupt Request Formats On Intel 64
Platforms, fields of MSI data register have changed. This patch
avoids wrongly regarding a remappable format interrupt request as
an interrupt binded with an event channel.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/pci/msi.c | 5 +++--
 hw/pci/msix.c| 4 +++-
 hw/xen/xen_pt_msi.c  | 2 +-
 include/hw/xen/xen.h | 2 +-
 xen-hvm-stub.c   | 2 +-
 xen-hvm.c| 7 ++-
 6 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index a87b227..199cb47 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev)
 static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
 {
 uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
-uint32_t mask, data;
+uint32_t mask, data, addr_lo;
 bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
 assert(vector < PCI_MSI_VECTORS_MAX);
 
@@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned 
int vector)
 }
 
 data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
-if (xen_is_pirq_msi(data)) {
+addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev));
+if (xen_is_pirq_msi(data, addr_lo)) {
 return false;
 }
 
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index bb54e8b..efe2982 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -82,9 +82,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int 
vector, bool fmask)
 {
 unsigned offset = vector * PCI_MSIX_ENTRY_SIZE;
 uint8_t *data = >msix_table[offset + PCI_MSIX_ENTRY_DATA];
+uint8_t *addr_lo = >msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR];
 /* MSIs on Xen can be remapped into pirqs. In those cases, masking
  * and unmasking go through the PV evtchn path. */
-if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) {
+if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data),
+ pci_get_long(addr_lo))) {
 return false;
 }
 return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] &
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 5fab95e..45a9e9f 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s,
 
 assert((!is_msix && msix_entry == 0) || is_msix);
 
-if (xen_is_pirq_msi(data)) {
+if (xen_is_pirq_msi(data, addr)) {
 *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr);
 if (!*ppirq) {
 /* this probably identifies an misconfiguration of the guest,
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 09c2ce5..af759bc 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 void xen_piix3_set_irq(void *opaque, int irq_num, int level);
 void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len);
 void xen_hvm_inject_msi(uint64_t addr, uint32_t data);
-int xen_is_pirq_msi(uint32_t msi_data);
+int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo);
 
 qemu_irq *xen_interrupt_controller_init(void);
 
diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c
index c500325..dae421c 100644
--- a/xen-hvm-stub.c
+++ b/xen-hvm-stub.c
@@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data)
 {
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo)
 {
 return 0;
 }
diff --git a/xen-hvm.c b/xen-hvm.c
index 5043beb..db29121 100644
--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -146,8 +146,13 @@ void xen_piix_pci_write_config_client(uint32_t address, 
uint32_t val, int len)
 }
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo)
 {
+/* If msi address is configurate to remapping format, the msi will not
+ * remapped into a pirq.
+ */
+if (msi_addr_lo & MSI_ADDR_IF_MASK)
+return 0;
 /* If vector is 0, the msi is remapped into a pirq, passed as
  * dest_id.
  */
-- 
1.8.3.1

Re: [Qemu-devel] [RFC PATCH 14/20] intel_iommu: add FOR_EACH_ASSIGN_DEVICE macro

2017-04-28 Thread Lan Tianyu

On 2017年04月26日 18:06, Liu, Yi L wrote:
> Add FOR_EACH_ASSIGN_DEVICE. It would be used to loop all assigned
> devices when processing guest pasid table linking and iommu cache
> invalidate propagation.
> 
> Signed-off-by: Liu, Yi L 
> ---
>  hw/i386/intel_iommu.c  | 32 
>  hw/i386/intel_iommu_internal.h | 11 +++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 0c412d2..f291995 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -55,6 +55,38 @@ static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | 
> VTD_DBGBIT(CSR);
>  #define VTD_DPRINTF(what, fmt, ...) do {} while (0)
>  #endif
>  
> +#define FOR_EACH_ASSIGN_DEVICE(__notify_info_type, \
> +   __opaque_type, \
> +   __hook_info, \
> +   __hook_fn) \
> +do { \
> +IntelIOMMUNotifierNode *node; \
> +VTDNotifierIterator iterator; \
> +int ret = 0; \
> +__notify_info_type *notify_info; \
> +__opaque_type *opaq; \
> +int argsz; \
> +argsz = sizeof(*notify_info) + sizeof(*opaq); \
> +notify_info = g_malloc0(argsz); \
> +QLIST_FOREACH(node, &(s->notifiers_list), next) { \
> +VTDAddressSpace *vtd_as = node->vtd_as; \
> +VTDContextEntry ce[2]; \
> +iterator.bus = pci_bus_num(vtd_as->bus); \
> +ret = vtd_dev_to_context_entry(s, iterator.bus, \
> +   vtd_as->devfn, [0]); \
> +if (ret != 0) { \
> +continue; \
> +} \
> +iterator.sid = vtd_make_source_id(iterator.bus, vtd_as->devfn); \
> +iterator.did =  VTD_CONTEXT_ENTRY_DID(ce[0].hi); \
> +iterator.host_sid = node->host_sid; \
> +iterator.vtd_as = vtd_as; \
> +iterator.ce = [0]; \
> +__hook_fn(, __hook_info, notify_info); \
> +} \
> +g_free(notify_info); \
> +} while (0)
> +
>  static void vtd_define_quad(IntelIOMMUState *s, hwaddr addr, uint64_t val,
>  uint64_t wmask, uint64_t w1cmask)
>  {
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index f2a7d12..5178398 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -439,6 +439,17 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_EXT_CONTEXT_TT_NO_DEV_IOTLB   (4ULL << 2)
>  #define VTD_EXT_CONTEXT_TT_DEV_IOTLB  (5ULL << 2)
>  
> +struct VTDNotifierIterator {
> +VTDAddressSpace *vtd_as;
> +VTDContextEntry *ce;
> +uint16_t host_sid;
> +uint16_t sid;
> +uint16_t did;
> +uint8_t  bus;

The "bus" seems to be redundant.
It is already contained in the "sid", right?

> +};
> +
> +typedef struct VTDNotifierIterator VTDNotifierIterator;
> +
>  /* Paging Structure common */
>  #define VTD_SL_PT_PAGE_SIZE_MASK(1ULL << 7)
>  /* Bits to decide the offset for each level */
> 


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH 09/20] Memory: introduce iommu_ops->record_device

2017-04-28 Thread Lan Tianyu

On 2017年04月26日 18:06, Liu, Yi L wrote:
> With vIOMMU exposed to guest, vIOMMU emulator needs to do translation
> between host and guest. e.g. a device-selective TLB flush, vIOMMU
> emulator needs to replace guest SID with host SID so that to limit
> the invalidation. This patch introduces a new callback
> iommu_ops->record_device() to notify vIOMMU emulator to record necessary
> information about the assigned device.

This patch is to prepare to translate guest sbdf to host sbdf.

Alex:
Could we add a new vfio API to do such translation? This will be more
straight forward than storing host sbdf in the vIOMMU device model.

> 
> Signed-off-by: Liu, Yi L 
> ---
>  include/exec/memory.h | 11 +++
>  memory.c  | 12 
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 7bd13ab..49087ef 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -203,6 +203,8 @@ struct MemoryRegionIOMMUOps {
>  IOMMUNotifierFlag new_flags);
>  /* Set this up to provide customized IOMMU replay function */
>  void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> +void (*record_device)(MemoryRegion *iommu,
> +  void *device_info);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -708,6 +710,15 @@ void memory_region_notify_iommu(MemoryRegion *mr,
>  void memory_region_notify_one(IOMMUNotifier *notifier,
>IOMMUTLBEntry *entry);
>  
> +/*
> + * memory_region_notify_device_record: notify IOMMU to record assign
> + * device.
> + * @mr: the memory region to notify
> + * @ device_info: device information
> + */
> +void memory_region_notify_device_record(MemoryRegion *mr,
> +void *info);
> +
>  /**
>   * memory_region_register_iommu_notifier: register a notifier for changes to
>   * IOMMU translation entries.
> diff --git a/memory.c b/memory.c
> index 0728e62..45ef069 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1600,6 +1600,18 @@ static void 
> memory_region_update_iommu_notify_flags(MemoryRegion *mr)
>  mr->iommu_notify_flags = flags;
>  }
>  
> +void memory_region_notify_device_record(MemoryRegion *mr,
> +void *info)
> +{
> +assert(memory_region_is_iommu(mr));
> +
> +if (mr->iommu_ops->record_device) {
> +mr->iommu_ops->record_device(mr, info);
> +}
> +
> +return;
> +}
> +
>  void memory_region_register_iommu_notifier(MemoryRegion *mr,
> IOMMUNotifier *n)
>  {
>

Re: [Qemu-devel] [RFC PATCH 02/20] intel_iommu: exposed extended-context mode to guest

2017-04-28 Thread Lan Tianyu

On 2017年04月27日 18:32, Peter Xu wrote:
> On Wed, Apr 26, 2017 at 06:06:32PM +0800, Liu, Yi L wrote:
>> VT-d implementations reporting PASID or PRS fields as "Set", must also
>> report ecap.ECS as "Set". Extended-Context is required for SVM.
>>
>> When ECS is reported, intel iommu driver would initiate extended root entry
>> and extended context entry, and also PASID table if there is any SVM capable
>> device.
>>
>> Signed-off-by: Liu, Yi L 
>> ---
>>  hw/i386/intel_iommu.c  | 131 
>> +++--
>>  hw/i386/intel_iommu_internal.h |   9 +++
>>  include/hw/i386/intel_iommu.h  |   2 +-
>>  3 files changed, 97 insertions(+), 45 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 400d0d1..bf98fa5 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -497,6 +497,11 @@ static inline bool vtd_root_entry_present(VTDRootEntry 
>> *root)
>>  return root->val & VTD_ROOT_ENTRY_P;
>>  }
>>  
>> +static inline bool vtd_root_entry_upper_present(VTDRootEntry *root)
>> +{
>> +return root->rsvd & VTD_ROOT_ENTRY_P;
>> +}
>> +
>>  static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
>>VTDRootEntry *re)
>>  {
>> @@ -509,6 +514,9 @@ static int vtd_get_root_entry(IntelIOMMUState *s, 
>> uint8_t index,
>>  return -VTD_FR_ROOT_TABLE_INV;
>>  }
>>  re->val = le64_to_cpu(re->val);
>> +if (s->ecs) {
>> +re->rsvd = le64_to_cpu(re->rsvd);
>> +}
> 
> I feel it slightly hacky to play with re->rsvd. How about:
> 
> union VTDRootEntry {
> struct {
> uint64_t val;
> uint64_t rsvd;
> } base;
> struct {
> uint64_t ext_lo;
> uint64_t ext_hi;
> } extended;
> };
> 
> (Or any better way that can get rid of rsvd...)
> 
> Even:
> 
> struct VTDRootEntry {
> union {
> struct {
> uint64_t val;
> uint64_t rsvd;
> } base;
> struct {
> uint64_t ext_lo;
> uint64_t ext_hi;
> } extended;
> } data;
> bool extended;
> };
> 
> Then we read the entry into data, and setup extended bit. A benefit of
> it is that we may avoid passing around IntelIOMMUState everywhere to
> know whether we are using extended context entries.
> 
>>  return 0;
>>  }
>>  
>> @@ -517,19 +525,30 @@ static inline bool 
>> vtd_context_entry_present(VTDContextEntry *context)
>>  return context->lo & VTD_CONTEXT_ENTRY_P;
>>  }
>>  
>> -static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t 
>> index,
>> -   VTDContextEntry *ce)
>> +static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
>> + VTDRootEntry *root, uint8_t index, VTDContextEntry *ce)
>>  {
>> -dma_addr_t addr;
>> +dma_addr_t addr, ce_size;
>>  
>>  /* we have checked that root entry is present */
>> -addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
>> -if (dma_memory_read(_space_memory, addr, ce, sizeof(*ce))) {
>> +ce_size = (s->ecs) ? (2 * sizeof(*ce)) : (sizeof(*ce));
>> +addr = (s->ecs && (index > 0x7f)) ?
>> +   ((root->rsvd & VTD_ROOT_ENTRY_CTP) + (index - 0x80) * ce_size) :
>> +   ((root->val & VTD_ROOT_ENTRY_CTP) + index * ce_size);
>> +
>> +if (dma_memory_read(_space_memory, addr, ce, ce_size)) {
>>  trace_vtd_re_invalid(root->rsvd, root->val);
>>  return -VTD_FR_CONTEXT_TABLE_INV;
>>  }
>> -ce->lo = le64_to_cpu(ce->lo);
>> -ce->hi = le64_to_cpu(ce->hi);
>> +
>> +ce[0].lo = le64_to_cpu(ce[0].lo);
>> +ce[0].hi = le64_to_cpu(ce[0].hi);
> 
> Again, I feel this even hackier. :)
> 
> I would slightly prefer to play the same union trick to context
> entries, just like what I proposed to the root entries above...
> 
>> +
>> +if (s->ecs) {
>> +ce[1].lo = le64_to_cpu(ce[1].lo);
>> +ce[1].hi = le64_to_cpu(ce[1].hi);
>> +}
>> +
>>  return 0;
>>  }
>>  
>> @@ -595,9 +614,11 @@ static inline uint32_t 
>> vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
>>  return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>>  }
>>  
>> -static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
>> +static inline uint32_t vtd_ce_get_type(IntelIOMMUState *s,
>> +   VTDContextEntry *ce)
>>  {
>> -return ce->lo & VTD_CONTEXT_ENTRY_TT;
>> +return s->ecs ? (ce->lo & VTD_CONTEXT_ENTRY_TT) :
>> +(ce->lo & VTD_EXT_CONTEXT_ENTRY_TT);
>>  }
>>  
>>  static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
>> @@ -842,16 +863,20 @@ static int vtd_dev_to_context_entry(IntelIOMMUState 
>> *s, uint8_t bus_num,
>>  return ret_fr;
>>  }
>>  
>> -if (!vtd_root_entry_present()) {
>> +if (!vtd_root_entry_present() ||
>> +(s->ecs && (devfn > 0x7f) && (!vtd_root_entry_upper_present( 
>> {
>>  /*

Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)

2017-04-20 Thread Lan Tianyu

On 2017年04月20日 15:04, Peter Xu wrote:
> On Thu, Apr 20, 2017 at 06:36:16AM +, Liu, Yi L wrote:
> 
> [...]
> 
 In my understanding, container->space->as->root cannot work here no
 matter passthru-mode is enabled or not. The code here is aiming to
 check if vIOMMU exists. After the vfio series, the vtd_dev_as->root is
 not initialized to be a iommu MemoryRegion. Compared with checking if
 it is system_memory(), I think adding a mechanism to get the iommu
>>> MemoryRegion may be a better choice. Just like the current
>>> pci_device_iommu_address_space().
>>>
>>> Isn't pci_device_iommu_address_space() used to get that IOMMU memory region?
> 
> Again I should say s/memory region/address space/...
> 
>>
>> It actually returns the AddressSpace, and the AddressSpace includes a memory 
>> region.
>> It is as->root. But after adding the vfio series, through the IOMMU memory 
>> region
>> is got, but it has no iommu_ops. Just as the following code shows. That's 
>> why I said
>> even without passthru-mode, Tianyu's that code snippet is not able to get 
>> the correct
>> check.
>>
>> memory_region_init(_dev_as->root, OBJECT(s),
>>"vtd_root", UINT64_MAX);
>> address_space_init(_dev_as->as, _dev_as->root, name);
> 
> The problem is, I am not sure whether there is always _only_ one IOMMU
> region behind one device. E.g., IIUC ppc can have more than one IOMMU
> memory regions, but translate for different address ranges.
> 

If we really want to check this via memory region, we may check all
subregions under container's address space. Register vIOMMU related
notifiers if one of them has iommu_ops,


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)

2017-04-20 Thread Lan, Tianyu


On 4/20/2017 1:40 PM, Peter Xu wrote:

On Thu, Apr 20, 2017 at 04:55:24AM +, Liu, Yi L wrote:

[...]


In my previous RFC patchset of fault event reporting, I registered
fault notifier when there is a VFIO group attached to VFIO container
and used the address space to check whether vIOMMU is added. The
address space is returned by vtd_host_dma_iommu(). vtd_find_add_as()
initializes device's IOMMU memory region and put it into device's
address space as root memory region.
Does this make sense?

@@ -1103,6 +1132,14 @@ static int vfio_connect_container(VFIOGroup
*group, AddressSpace *as,
 goto listener_release_exit;
 }

+if (memory_region_is_iommu(container->space->as->root)) {


I would suggest we don't play with as->root any more. After vtd vfio series, 
this may
not be true if passthrough mode is enabled (then the root may be switched to an
system memory alias). I don't know the best way to check this, one alternative 
might
be that we check whether
container->space->as == system_memory(), it should be workable, but in


Sorry, I was meaning _space_memory.


a slightly hackish way.


In my understanding, container->space->as->root cannot work here no matter 
passthru-mode
is enabled or not. The code here is aiming to check if vIOMMU exists. After the 
vfio series,
the vtd_dev_as->root is not initialized to be a iommu MemoryRegion. Compared 
with checking
if it is system_memory(), I think adding a mechanism to get the iommu 
MemoryRegion may
be a better choice. Just like the current pci_device_iommu_address_space().


Isn't pci_device_iommu_address_space() used to get that IOMMU memory
region? And, one thing to mention is that container->space->as is
actually derived from pci_device_iommu_address_space() (when calling
vfio_get_group()).

I feel like that playing around with an IOMMU memory region is still
not clear enough in many cases. I still feel like some day we would
like an "IOMMU object". Then, we can register non-iotlb notifiers
against that IOMMU object, rather than memory regions...


Our target is to check whether assigned device is under a vIOMMU.
We may check whether iommu_fn point has been populated in its pci bus' 
data structure(PCIDevice). This is what pci_device_iommu_address_space()

does.

Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough (PT)

2017-04-19 Thread Lan Tianyu

Hi All:
Sorry for later response.
On 2017年04月18日 17:04, Liu, Yi L wrote:
>> -Original Message-
>> From: Peter Xu [mailto:pet...@redhat.com]
>> Sent: Tuesday, April 18, 2017 3:27 PM
>> To: Liu, Yi L <yi.l@intel.com>
>> Cc: qemu-devel@nongnu.org; Lan, Tianyu <tianyu@intel.com>; Michael S .
>> Tsirkin <m...@redhat.com>; Jason Wang <jasow...@redhat.com>; Marcel
>> Apfelbaum <mar...@redhat.com>; David Gibson <da...@gibson.dropbear.id.au>
>> Subject: Re: [Qemu-devel] [PATCH v2 7/7] intel_iommu: support passthrough 
>> (PT)
>>
>> On Tue, Apr 18, 2017 at 06:02:44AM +, Liu, Yi L wrote:
>>>>> Hi Peter,
>>>>>
>>>>> Skip address space switching is a good idea to support Passthru mode.
>>>>> However, without the address space, the vfio notifier would not be
>>>>> registered, thus vIOMMU emulator has no way to connect to host. It
>>>>> is no harm if there is only map/unmap notifier. But if we have
>>>>> more notifiers other than map/unmap, it may be a problem.
>>>>>
>>>>> I think we need to reconsider it here.
>>>>
>>>> For now I think as switching is good to us in general. Could I know
>>>> more context about this? Would it be okay to work on top of this in the 
>>>> future?
>>>>
>>>
>>> Let me explain. For vSVM enabling, it needs to add new notifier to
>>> bind gPASID table to host. If an assigned device uses SVM in guest, as
>>> designed guest IOMMU driver would set "pt" for it. This is to make
>>> sure the host second-level page table stores GPA->HPA mapping. So that
>>> pIOMMU can do nested translation to achieve gVA->GPA GPA->HPA mapping.
>>
>> That should mean that in the guest it will only enable first level 
>> translation, while in
>> the host it'll be nested (first + second)? Then you should be trapping the 
>> guest
>> extended context entry invalidation, then pass the PASID table pointer 
>> downward to
>> the host IOMMU, am I correct?
> 
> exactly.
> 
>>
>>>
>>> So the situation is if we want to keep GPA->HPA mapping, we should
>>> skip address space switch, but the vfio listener would not know vIOMMU
>>> is added, so no vfio notifier would be registered. But if we do the
>>> switch, the GPA->HPA mapping is unmapped. And need another way to build the
>> GPA->HPA mapping.
>>
>> If my understanding above is correct, current IOMMU notifier may not satisfy 
>> your
>> need. If you see the current notify interface, it's delivering 
>> IOMMUTLBEntry. I
>> suspect it only suites for IOTLB notifications, but not if you want to pass 
>> some
>> pointer (one GPA) downward somewhere.
> 
> Yeah, you got it more than absolutely. One of my patch is to modify the para 
> to be
> void * and let each notifiers to translate separately. I think it should be a 
> reasonable
> change.
> 
> Supposedly, I would send RFC for vSVM soon. We may talk more it at that 
> thread.
> 
>>>
>>> I think we may have two choice here. Pls let me know your ideas.
>>>
>>> 1. skip the switch for "pt" mode, find other way to register vfio
>>> notifiers. not sure if this is workable. Just a quick thought.
>>>
>>> 2. do the switch, and rebuild GPA->HPA mapping if device use "pt"
>>> mode. For this option, I also have two way to build the GPA->HPA mapping.
>>> a) walk all the memory region sections in address_space_memory, and build 
>>> the
>> mapping.
>>> address_space_memory.dispatch->map.sections, sections stores all the mapped
>> sections.
>>>
>>> b) let guest iommu driver mimics a 1:1 mapping for the devices use
>>> "pt" mode. in this way, the GPA->HPA mapping could be setup by walking
>>> the guest SL page table. This is consistent with your vIOVA replay solution.
>>
>> I'll prefer (1). Reason explained above.
>>
>>>
>>> Also I'm curious if Tianyu's fault report framework needs to register new 
>>> notifiers.
>>
>> For Tianyu's case, I feel like we need other kind of notifier as well, but 
>> not the
>> current IOTLB one. And, of course in this case it'll be in an reverted 
>> direction, which
>> is from device to vIOMMU.
>>
>> (I am thinking whether it's a good time now to let any PCI device able  to 
>> fetch its
>> direct owner IOMMU object, then it can register

Re: [Qemu-devel] [RFC PATCH 0/4] Qemu: Add Xen vIOMMU support

2017-03-20 Thread Lan Tianyu

On 2017年03月20日 19:38, Paolo Bonzini wrote:
> Fair enough, though I'd be worried about increasing the attack surface
> of the hypervisor.  For KVM, for example, IOMMU emulation requires using
> the "split irqchip" feature to move the PIC and IOAPIC out of the kernel
> and back to QEMU.

Yes, just like Roger mentioned we also need to support no-qemu mode on
Xen and this is tradeoff result.

> 
> Also, I think this series is missing changes to support IOMMU
> translation in the vIOMMU device model.

Yes, this series just enabled vIOMMU's irq remapping function and we
need to pass virtual device's DMA request to Xen hypervisor for
translation when enable DMA translation.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH 1/4] I440: Allow adding sysbus devices with -device on I440

2017-03-20 Thread Lan Tianyu

Hi Eduardo:
Thanks for your review.
On 2017年03月21日 03:49, Eduardo Habkost wrote:
> On Fri, Mar 17, 2017 at 07:29:14PM +0800, Lan Tianyu wrote:
>> From: Chao Gao <chao@intel.com>
>>
>> xen-viommu will be a sysbus device and the device model will
>> be enabled via "-device" parameter.
>>
>> Signed-off-by: Chao Gao <chao@intel.com>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
> 
> I'm worried about the bugs we may expose by accepting all the
> other sysbus devices in the command-line in addition to
> xen-viommu.
> 
> I am working on a RFC to replace "has_dynamic_sysbus" with a
> whitelist of sysbus device classes. This way we could enable only
> xen-viommu on i440fx, instead of suddenly enabling all sysbus
> devices just because of xen-viommu.

That sounds reasonable. Thanks for the job and we rebase on your patches.



> 
>> ---
>>  hw/i386/pc_piix.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index a07dc81..3289593 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -436,6 +436,7 @@ static void pc_i440fx_machine_options(MachineClass *m)
>>  m->hot_add_cpu = pc_hot_add_cpu;
>>  m->default_machine_opts = "firmware=bios-256k.bin";
>>  m->default_display = "std";
>> +m->has_dynamic_sysbus = true;
>>  }
>>  
>>  static void pc_i440fx_2_7_machine_options(MachineClass *m)
>> -- 
>> 1.8.3.1
>>
> 


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH 0/4] Qemu: Add Xen vIOMMU support

2017-03-19 Thread Lan Tianyu

On 2017年03月17日 22:48, Paolo Bonzini wrote:
> 
> 
> On 17/03/2017 12:29, Lan Tianyu wrote:
>> This patchset is to add Xen vIOMMU device model and handle
>> irq remapping stuffs. Xen vIOMMU emulation is in the Xen hypervisor
>> and the new device module in Qemu works as hypercall wrappers to
>> create and destroy vIOMMU in hypervisor.
>>
>> Xen only supports emulated I440 and so we enable vIOMMU with emulated
>> I440 chipset. This works on Linux and Windows guest.
> 
> Any plans to change this?  Why is Xen not able to use Q35 with Intel
> IOMMU, with only special hooks for interrupt remapping?
> 
> Paolo
> 

Hi Paolo:
Thanks for review. For Xen side, we won't reuse Intel IOMMU device model
in Qemu and create counterpart in Xen hypervisor. The reasons are
 1) Avoid round trips between Qemu and Xen hypervisor
 2) Ease of integration with the rest part of the hypervisor(vIOAPIC,
vMSI and so on).

We don't have plan to enable Q35 for Xen now. There is no dependency
between Q35 emulation and vIOMMU implementation from our test. As
Stefano mentioned, Anthony worked on the Q35 support before. These two
tasks can be done in parallel.

-- 
Best regards
Tianyu Lan

[Qemu-devel] [RFC PATCH 4/4] msi: taking interrupt format into consideration during judging a pirq is binded with a event channel

2017-03-17 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

As remapping format interrupt has been introduced, the vector in msi remapping
format can also be 0, same as a interrupt is binded with a event channel.
So we can't just use whether vector is 0 or not to judge a msi is binded
to a event channel or not.

This patch takes the msi interrupt format into consideration.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/pci/msi.c | 5 +++--
 hw/pci/msix.c| 4 +++-
 hw/xen/xen_pt_msi.c  | 2 +-
 include/hw/xen/xen.h | 2 +-
 xen-hvm-stub.c   | 2 +-
 xen-hvm.c| 7 ++-
 6 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index a87b227..8d1ac9e 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev)
 static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
 {
 uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
-uint32_t mask, data;
+uint32_t mask, data, addr_lo;
 bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
 assert(vector < PCI_MSI_VECTORS_MAX);
 
@@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned 
int vector)
 }
 
 data = pci_get_word(dev->config + msi_data_off(dev, msi64bit));
-if (xen_is_pirq_msi(data)) {
+addr_lo = pci_get_word(dev->config + msi_address_lo_off(dev));
+if (xen_is_pirq_msi(data, addr_lo)) {
 return false;
 }
 
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 0ec1cb1..6b8045a 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -81,9 +81,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int 
vector, bool fmask)
 {
 unsigned offset = vector * PCI_MSIX_ENTRY_SIZE;
 uint8_t *data = >msix_table[offset + PCI_MSIX_ENTRY_DATA];
+uint8_t *addr_lo = >msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR];
 /* MSIs on Xen can be remapped into pirqs. In those cases, masking
  * and unmasking go through the PV evtchn path. */
-if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) {
+if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data),
+ pci_get_long(addr_lo))) {
 return false;
 }
 return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] &
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 8b0d7fc..f799fed 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -114,7 +114,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s,
 
 assert((!is_msix && msix_entry == 0) || is_msix);
 
-if (xen_is_pirq_msi(data)) {
+if (xen_is_pirq_msi(data, (uint32_t)(addr & 0x))) {
 *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr);
 if (!*ppirq) {
 /* this probably identifies an misconfiguration of the guest,
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index a8f3afb..c15beb5 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -33,7 +33,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 void xen_piix3_set_irq(void *opaque, int irq_num, int level);
 void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len);
 void xen_hvm_inject_msi(uint64_t addr, uint32_t data);
-int xen_is_pirq_msi(uint32_t msi_data);
+int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo);
 
 qemu_irq *xen_interrupt_controller_init(void);
 
diff --git a/xen-hvm-stub.c b/xen-hvm-stub.c
index c500325..dae421c 100644
--- a/xen-hvm-stub.c
+++ b/xen-hvm-stub.c
@@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data)
 {
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo)
 {
 return 0;
 }
diff --git a/xen-hvm.c b/xen-hvm.c
index 2f348ed..9e78b23 100644
--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -146,8 +146,13 @@ void xen_piix_pci_write_config_client(uint32_t address, 
uint32_t val, int len)
 }
 }
 
-int xen_is_pirq_msi(uint32_t msi_data)
+int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo)
 {
+/* If msi address is configurate to remapping format, the msi will not
+ * remapped into a pirq.
+ */
+if ( msi_addr_lo & 0x10 )
+return 0;
 /* If vector is 0, the msi is remapped into a pirq, passed as
  * dest_id.
  */
-- 
1.8.3.1

[Qemu-devel] [RFC PATCH 2/4] Xen: add a dummy vIOMMU to create/destroy vIOMMU in Xen

2017-03-17 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

Since adding a dynamic sysbus device is forbidden, so choose TYPE_DEVICE
as parent class.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/xen/Makefile.objs |   1 +
 hw/xen/xen_viommu.c  | 116 +++
 2 files changed, 117 insertions(+)
 create mode 100644 hw/xen/xen_viommu.c

diff --git a/hw/xen/Makefile.objs b/hw/xen/Makefile.objs
index d367094..e37d808 100644
--- a/hw/xen/Makefile.objs
+++ b/hw/xen/Makefile.objs
@@ -3,3 +3,4 @@ common-obj-$(CONFIG_XEN_BACKEND) += xen_backend.o 
xen_devconfig.o
 
 obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen-host-pci-device.o
 obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pt.o xen_pt_config_init.o 
xen_pt_graphics.o xen_pt_msi.o
+obj-$(CONFIG_XEN) += xen_viommu.o
diff --git a/hw/xen/xen_viommu.c b/hw/xen/xen_viommu.c
new file mode 100644
index 000..9bf9158
--- /dev/null
+++ b/hw/xen/xen_viommu.c
@@ -0,0 +1,116 @@
+/*
+ * Xen virtual IOMMU (virtual VT-D)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+
+#include "hw/qdev-core.h"
+#include "hw/sysbus.h"
+#include "hw/xen/xen.h"
+#include "hw/xen/xen_backend.h"
+
+#define TYPE_XEN_VIOMMU_DEVICE "xen_viommu"
+#define  XEN_VIOMMU_DEVICE(obj) \
+OBJECT_CHECK(XenVIOMMUState, (obj), TYPE_XEN_VIOMMU_DEVICE)
+
+typedef struct XenVIOMMUState XenVIOMMUState;
+
+struct XenVIOMMUState {
+DeviceState dev;
+uint32_t id;
+uint64_t cap;
+uint64_t base_addr;
+};
+
+static void xen_viommu_realize(DeviceState *dev, Error **errp)
+{
+int rc;
+uint64_t cap;
+char *dom;
+char viommu_path[1024];
+XenVIOMMUState *s = XEN_VIOMMU_DEVICE(dev);
+
+s->id = -1;
+
+/* Read vIOMMU attributes from Xenstore. */
+dom = xs_get_domain_path(xenstore, xen_domid);
+snprintf(viommu_path, sizeof(viommu_path), "%s/viommu", dom);
+rc = xenstore_read_uint64(viommu_path, "base_addr", >base_addr);  
+if (rc) {
+error_report("Can't get base address of vIOMMU");
+exit(1);
+}
+
+rc = xenstore_read_uint64(viommu_path, "cap", >cap);
+if (rc) {
+error_report("Can't get capabilities of vIOMMU");
+exit(1);
+}
+
+rc = xc_viommu_query_cap(xen_xc, xen_domid, );
+if (rc) {
+exit(1);
+}
+
+
+if ((cap & s->cap) != cap) {
+error_report("xen: Unsupported capability %lx", s->cap);
+exit(1);
+}
+
+rc = xc_viommu_create(xen_xc, xen_domid, s->base_addr, s->cap, >id);
+if (rc) {
+s->id = -1;
+error_report("xen: failed(%d) to create viommu ", rc);
+exit(1);
+}
+}
+
+static void xen_viommu_instance_finalize(Object *o)
+{
+int rc;
+XenVIOMMUState *s = XEN_VIOMMU_DEVICE(o);
+
+if (s->id != -1) {
+rc = xc_viommu_destroy(xen_xc, xen_domid, s->id); 
+if (rc) {
+error_report("xen: failed(%d) to destroy viommu ", rc);
+}
+}
+}
+
+static void xen_viommu_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
+dc->hotpluggable = false;
+dc->realize = xen_viommu_realize;
+}
+
+static const TypeInfo xen_viommu_info = {
+.name  = TYPE_XEN_VIOMMU_DEVICE,
+.parent= TYPE_SYS_BUS_DEVICE,
+.instance_size = sizeof(XenVIOMMUState),
+.instance_finalize = xen_viommu_instance_finalize,
+.class_init= xen_viommu_class_init,
+};
+
+static void xen_viommu_register_types(void)
+{
+type_register_static(_viommu_info); 
+}
+
+type_init(xen_viommu_register_types);
-- 
1.8.3.1

[Qemu-devel] [RFC PATCH 3/4] xen-pt: bind/unbind interrupt remapping format MSI

2017-03-17 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

If a vIOMMU is exposed to guest, guest will configure the msi to remapping
format. The original code isn't suitable to the new format. A new pair
bind/unbind interfaces are added for this usage. This patch recognizes
this case and use new interfaces to bind/unbind msi.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/xen/xen_pt_msi.c   | 36 
 include/hw/i386/apic-msidef.h |  1 +
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index 62add06..8b0d7fc 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -161,6 +161,7 @@ static int msi_msix_update(XenPCIPassthroughState *s,
 uint8_t gvec = msi_vector(data);
 uint32_t gflags = msi_gflags(data, addr);
 int rc = 0;
+bool ir = !!(addr & MSI_ADDR_IM_MASK);
 uint64_t table_addr = 0;
 
 XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x"
@@ -171,8 +172,14 @@ static int msi_msix_update(XenPCIPassthroughState *s,
 table_addr = s->msix->mmio_base_addr;
 }
 
-rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
+if (ir) {
+rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq,
+d->devfn, data, addr, table_addr);
+}
+else {
+rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
   pirq, gflags, table_addr);
+}
 
 if (rc) {
 XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n",
@@ -204,13 +211,26 @@ static int msi_msix_disable(XenPCIPassthroughState *s,
 }
 
 if (is_binded) {
-XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
-   is_msix ? "-X" : "", pirq, gvec);
-rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags);
-if (rc) {
-XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
gvec: %#x)\n",
-   is_msix ? "-X" : "", errno, pirq, gvec);
-return rc;
+if ( addr & MSI_ADDR_IM_MASK ) {
+XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, addr: 
%lx)\n",
+   is_msix ? "-X" : "", pirq, data, addr);
+rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq,
+d->devfn, data, addr);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, 
data: %x, addr: %lx)\n",
+   is_msix ? "-X" : "", rc, pirq, data, addr);
+return rc;
+}
+
+} else {
+XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n",
+   is_msix ? "-X" : "", pirq, gvec);
+rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, 
gflags);
+if (rc) {
+XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, 
gvec: %#x)\n",
+   is_msix ? "-X" : "", errno, pirq, gvec);
+return rc;
+}
 }
 }
 
diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 8b4d4cc..08b584f 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -27,5 +27,6 @@
 #define MSI_ADDR_DEST_ID_SHIFT  12
 #define MSI_ADDR_DEST_IDX_SHIFT 4
 #define  MSI_ADDR_DEST_ID_MASK  0x000
+#define  MSI_ADDR_IM_MASK   0x0010
 
 #endif /* HW_APIC_MSIDEF_H */
-- 
1.8.3.1

[Qemu-devel] [RFC PATCH 1/4] I440: Allow adding sysbus devices with -device on I440

2017-03-17 Thread Lan Tianyu

From: Chao Gao <chao@intel.com>

xen-viommu will be a sysbus device and the device model will
be enabled via "-device" parameter.

Signed-off-by: Chao Gao <chao@intel.com>
Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/pc_piix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index a07dc81..3289593 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -436,6 +436,7 @@ static void pc_i440fx_machine_options(MachineClass *m)
 m->hot_add_cpu = pc_hot_add_cpu;
 m->default_machine_opts = "firmware=bios-256k.bin";
 m->default_display = "std";
+m->has_dynamic_sysbus = true;
 }
 
 static void pc_i440fx_2_7_machine_options(MachineClass *m)
-- 
1.8.3.1

[Qemu-devel] [RFC PATCH 0/4] Qemu: Add Xen vIOMMU support

2017-03-17 Thread Lan Tianyu

This patchset is to add Xen vIOMMU device model and handle
irq remapping stuffs. Xen vIOMMU emulation is in the Xen hypervisor
and the new device module in Qemu works as hypercall wrappers to
create and destroy vIOMMU in hypervisor.

Xen only supports emulated I440 and so we enable vIOMMU with emulated
I440 chipset. This works on Linux and Windows guest.


Chao Gao (4):
  I440: Allow adding sysbus devices with -device on I440
  Xen: add a dummy vIOMMU to create/destroy vIOMMU in Xen
  xen-pt: bind/unbind interrupt remapping format MSI
  msi: taking interrupt format into consideration during judging a pirq
is binded with a event channel

 hw/i386/pc_piix.c |   1 +
 hw/pci/msi.c  |   5 +-
 hw/pci/msix.c |   4 +-
 hw/xen/Makefile.objs  |   1 +
 hw/xen/xen_pt_msi.c   |  38 ++
 hw/xen/xen_viommu.c   | 116 ++
 include/hw/i386/apic-msidef.h |   1 +
 include/hw/xen/xen.h  |   2 +-
 xen-hvm-stub.c|   2 +-
 xen-hvm.c |   7 ++-
 10 files changed, 162 insertions(+), 15 deletions(-)
 create mode 100644 hw/xen/xen_viommu.c

-- 
1.8.3.1

[Qemu-devel] [Resend RFC PATCH 3/4] Intel iommu: Add Intel IOMMU fault event callback

2017-02-19 Thread Lan Tianyu

This patch is to deal with fault event reported from IOMMU driver.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/i386/intel_iommu.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9b1ba1b..79507d2 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2286,6 +2286,30 @@ static void vtd_iommu_notify_flag_changed(MemoryRegion 
*iommu,
 }
 }
 
+static void vtd_iommu_notify_fault_event(MemoryRegion *iommu,
+ IOMMUFaultInfo *info)
+{
+VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
+IntelIOMMUState *s = vtd_as->iommu_state;
+bool is_fpd_set = false;
+uint8_t bus_num = pci_bus_num(vtd_as->bus);
+uint8_t devfn = vtd_as->devfn;
+VTDContextEntry ce;
+
+/* Replace source id with device's vbdf */
+info->sid = vtd_make_source_id(bus_num, devfn);
+
+if (!vtd_dev_to_context_entry(s, bus_num, devfn, )) {
+is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+if (is_fpd_set) {
+trace_vtd_fault_disabled();
+} else {
+vtd_report_dmar_fault(s, info->sid, info->addr,
+  info->fault_reason, info->is_write);
+}
+}
+}
+
 static const VMStateDescription vtd_vmstate = {
 .name = "iommu-intel",
 .version_id = 1,
@@ -2816,6 +2840,7 @@ static void vtd_init(IntelIOMMUState *s)
 
 s->iommu_ops.translate = vtd_iommu_translate;
 s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
+s->iommu_ops.notify_fault_event = vtd_iommu_notify_fault_event;
 s->iommu_ops.replay = vtd_iommu_replay;
 s->root = 0;
 s->root_extended = false;
-- 
1.8.3.1

[Qemu-devel] [Resend RFC PATCH 0/4] VT-d: Inject fault event from IOMMU hardware

2017-02-19 Thread Lan Tianyu

Resend patchset due to wrong Qemu devel mail address. Sorry for noise.

This patchset proposes a solution for vIOMMU to get hardware IOMMU
fault event and info. Motivation is to make vIOMMU inject associated
fault event when pIOMMU reports fault event. vIOMMU is in charge of
transforming fault info and inject to guest. The feature is also very
important to support first level translation(Translation request with
PASID) in VM which requires vIOMMU to inject device page request to VM.

VFIO can get notification and read fault info via new VFIO cmds.
Add fault event handler in the memory IOMMU ops and Intel IOMMU 
device model needs to register its fault event callback. VFIO
will call the callback via memory wrapper function when get
fault notification.

This patches is prototype code and just passes build test. IOMMU driver
new interface is still in the design stage. This patches is to confirm
interface between Qemu and VFIO kernel driver is on the right way.

Very appreciate for comments. 

Lan Tianyu (4):
  VFIO: Set eventfd for IOMMU fault event via new vfio cmd
  Memory: Introduce IOMMU fault event callback
  Intel iommu: Add Intel IOMMU fault event callback
  VFIO: Read IOMMU fault info from kernel space when get fault event

 hw/i386/intel_iommu.c | 25 
 hw/vfio/common.c  | 88 +++
 include/exec/memory.h | 19 ++
 include/hw/vfio/vfio-common.h |  3 ++
 linux-headers/linux/vfio.h| 35 +
 memory.c  |  8 
 6 files changed, 178 insertions(+)

-- 
1.8.3.1

[Qemu-devel] [Resend RFC PATCH 4/4] VFIO: Read IOMMU fault info from kernel space when get fault event

2017-02-19 Thread Lan Tianyu

This patch is to implement fault event handler with new vfio cmd to
get fault info and notify vIOMMU device model.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/common.c   | 51 ++
 linux-headers/linux/vfio.h | 22 
 2 files changed, 73 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 628b424..4f76e26 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -297,6 +297,57 @@ static bool 
vfio_listener_skipped_section(MemoryRegionSection *section)
 
 static void vfio_iommu_fault(void *opaque)
 {
+VFIOContainer *container = opaque;
+struct vfio_iommu_type1_get_fault_info *info;
+struct vfio_iommu_fault_info *fault_info;
+MemoryRegion *mr = container->space->as->root;
+int count = 0, i, ret;
+IOMMUFaultInfo tmp;
+
+if (!event_notifier_test_and_clear(>fault_notifier)) {
+return;
+}
+
+info = g_malloc0(sizeof(*info));
+if (!info) {
+error_report("vfio: Fail to allocate memory");
+return;
+}
+
+info->argsz = sizeof(*info);
+
+ret = ioctl(container->fd, VFIO_IOMMU_GET_FAULT_INFO, info);
+if (ret && ret != -ENOSPC) {
+error_report("vfio: Can't get fault info");
+goto err_exit;
+}
+
+count = info->count;
+if (count <= 0) {
+goto err_exit;
+}
+
+info = g_realloc(info, sizeof(*info) + count * sizeof(*fault_info));
+info->argsz = sizeof(*info) + count * sizeof(*fault_info);
+fault_info = info->fault_info;
+
+ret = ioctl(container->fd, VFIO_IOMMU_GET_FAULT_INFO, info);
+if (ret) {
+error_report("vfio: Can't get fault info");
+goto err_exit;
+}
+
+for (i = 0; i < info->count; i++) {
+tmp.addr = fault_info[i].addr;
+tmp.sid = fault_info[i].sid;
+tmp.fault_reason = fault_info[i].fault_reason;
+tmp.is_write = fault_info[i].is_write;
+
+memory_region_iommu_fault_notify(mr, );
+}
+
+err_exit:
+g_free(info);
 }
 
 static int vfio_set_iommu_fault_notifier(struct VFIOContainer *container)
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ca890ee..8b172f5 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -550,6 +550,28 @@ struct vfio_iommu_type1_set_fault_eventfd {
 
 #define VFIO_IOMMU_SET_FAULT_EVENTFD   _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/*
+ * VFIO_IOMMU_GET_FAULT_INFO   _IO(VFIO_TYPE, VFIO_BASE + 18)
+ *
+ * Return IOMMU fault info to userspace.
+ */
+
+struct vfio_iommu_fault_info {
+   __u64   addr;
+   __u16   sid;
+   __u8fault_reason;
+   __u8is_write:1;
+};
+
+struct vfio_iommu_type1_get_fault_info {
+   __u32   argsz;
+   __u32   flags;
+   __u32   count;
+   struct vfio_iommu_fault_info fault_info[];
+};
+
+#define VFIO_IOMMU_GET_FAULT_INFO  _IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
 
 /*
-- 
1.8.3.1

[Qemu-devel] [Resend RFC PATCH 1/4] VFIO: Set eventfd for IOMMU fault event via new vfio cmd

2017-02-19 Thread Lan Tianyu


This patch is to assign an event fd to VFIO IOMMU type1 driver
in order to get notification when IOMMU driver reports fault event.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/common.c  | 37 +
 include/hw/vfio/vfio-common.h |  3 +++
 linux-headers/linux/vfio.h| 13 +
 3 files changed, 53 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b33b9f..628b424 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -33,6 +33,7 @@
 #include "qemu/error-report.h"
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
 #include "trace.h"
 #include "qapi/error.h"
 
@@ -294,6 +295,34 @@ static bool 
vfio_listener_skipped_section(MemoryRegionSection *section)
section->offset_within_address_space & (1ULL << 63);
 }
 
+static void vfio_iommu_fault(void *opaque)
+{
+}
+
+static int vfio_set_iommu_fault_notifier(struct VFIOContainer *container)
+{
+struct vfio_iommu_type1_set_fault_eventfd eventfd;
+int ret;
+
+ret = event_notifier_init(>fault_notifier, 0);
+if (ret < 0) {
+error_report("vfio: Failed to init notifier for IOMMU fault event");
+return ret;
+}
+
+eventfd.fd = event_notifier_get_fd(>fault_notifier);
+eventfd.argsz = sizeof(eventfd);
+
+ret = ioctl(container->fd, VFIO_IOMMU_SET_FAULT_EVENTFD, );
+if (ret < 0) {
+error_report("vfio: Failed to set notifier for IOMMU fault event");
+return ret;
+}
+
+qemu_set_fd_handler(eventfd.fd, vfio_iommu_fault, NULL, container);
+return 0;
+}
+
 /* Called with rcu_read_lock held.  */
 static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
bool *read_only)
@@ -1103,6 +1132,14 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 goto listener_release_exit;
 }
 
+if (memory_region_is_iommu(container->space->as->root)) {
+if (vfio_set_iommu_fault_notifier(container)) {
+error_setg_errno(errp, -ret,
+"Fail to set IOMMU fault notifier");
+goto listener_release_exit;
+}
+}
+
 container->initialized = true;
 
 QLIST_INIT(>group_list);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c582de1..1b594c6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -26,6 +26,7 @@
 #include "exec/memory.h"
 #include "qemu/queue.h"
 #include "qemu/notify.h"
+#include "qemu/event_notifier.h"
 #ifdef CONFIG_LINUX
 #include 
 #endif
@@ -81,6 +82,8 @@ typedef struct VFIOContainer {
 unsigned iommu_type;
 int error;
 bool initialized;
+EventNotifier fault_notifier;
+
 /*
  * This assumes the host IOMMU can support only a single
  * contiguous IOVA window.  We may need to generalize that in
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 759b850..ca890ee 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -537,6 +537,19 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE  _IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/*
+ * VFIO_IOMMU_SET_FAULT_EVENT_FD   _IO(VFIO_TYPE, VFIO_BASE + 17)
+ * 
+ * Receive eventfd from userspace to notify fault event from IOMMU.
+ */
+struct vfio_iommu_type1_set_fault_eventfd {
+   __u32   argsz;
+   __u32   flags;
+   __u32   fd;
+};
+
+#define VFIO_IOMMU_SET_FAULT_EVENTFD   _IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
 
 /*
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-07 Thread Lan Tianyu

On 2016年12月08日 10:39, Peter Xu wrote:
> Btw, do you have time to help review my RFC series for "vt-d replay"?
> :) I'd like to receive any further review comments besides the issues
> mentioned above since IMHO they can be seperated from current series,
> I have noted them down in my pending list.
Sure. I have reviewed some patches. It looks good for me :)

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback

2016-12-07 Thread Lan Tianyu

On 2016年12月06日 18:36, Peter Xu wrote:
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: size of the IOVA range to walk over

*·@end:·IOVA·range·end·address·(start·<=·addr·<·end)
This comment for vtd_page_walk_level() maybe more accurate here?

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-07 Thread Lan Tianyu


On 2016年12月07日 14:43, Peter Xu wrote:

On Wed, Dec 07, 2016 at 02:09:16PM +0800, Lan Tianyu wrote:

On 2016年12月06日 18:59, Peter Xu wrote:

On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:

[...]




User space driver(E.G DPDK) also can enable/disable
IOVA for device dynamically.


Could you provide more detailed (or any pointer) on how to do that? I
did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
like it is for ppc only.


No, I just give an example that user space may do that but no more
research. But since Qemu already can enable device's IOVA, other user
application also should can do that with the same VFIO interface, right?


AFAIU we can't do that at least on x86. We can use vfio interface to
bind group into container, but we should not be able to dynamically
disable IOMMU protection. IIUC That needs to taint the kernel.

The only way I know is that we probe vfio-pci with no-iommu mode, in
that case, we disabled IOMMU, but we can never dynamically enable it
as well.

Please correct me if I am wrong.



Actually, disabling device's IOVA doesn't require to disable kernel
global DMA protect and just clear device's VTD context entry in the
context table. Go though IOMMU and VFIO code, find this will happen when
call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy
VM or unplug assigned device in Qemu. Please help to double check.

Call trace:
__vfio_group_unset_container()
vfio_iommu_type1_detach_group()
iommu_detach_group()
dmar_remove_one_dev_info()
__dmar_remove_one_dev_info()
domain_context_clear()


The legacy KVM device assign code also will call iommu_detach_device()
when deassign a device.

From device emulation view, we need to make sure correct register
emulation regardless of guest behavior.


Even if the context entry is cleared and invalidated, IMHO it does not
mean that we should be using GPA address space, nor do we need to put
it into guest physical address space. Instead, it simply means this
device cannot do any IO at that time. If IO comes, IOMMU should do
fault reporting to guest OS, which should be treated as error.


Yes, that looks right and there will be fault event reported by pIOMMU
if context entry is no present for DMA untranslated request. This goes 
back to the first gap to report pIOMMU fault event to vIOMMU.


For disabling via clearing DMA translation via gcmd.TE bit, assigned
device should work after clearing operation and it's still necessary to
restore GPA->HPA mapping since we can't assume guest won't clear the bit
after enabling DMA translation.

This maybe low priority gap since Linux IOMMU driver don't disable DMA
translation frequently or dynamically. But we also should consider the
situation.



So I think we are emulating the correct guest behavior here - we don't
need to do anything if a device is detached from an existing IOMMU
domain in guest. If we do (e.g., we replay the GPA address space on
that device when it is detached, so the shadow page table for that
device maps the whole guest memory), that is dangerous, because with
that the device can DMA to anywhere it wants to guest memory.


If guest want to disabling DMA translation, this is expected behavior
and device model should follow guest configuration. This just likes most
distributions don't enable VTD DMA translation by default and it's OS
choice.

--
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Lan Tianyu

On 2016年12月06日 18:59, Peter Xu wrote:
> On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
>>>
>>>> User space driver(E.G DPDK) also can enable/disable
>>>> IOVA for device dynamically.
>>>
>>> Could you provide more detailed (or any pointer) on how to do that? I
>>> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
>>> like it is for ppc only.
>>
>> No, I just give an example that user space may do that but no more
>> research. But since Qemu already can enable device's IOVA, other user
>> application also should can do that with the same VFIO interface, right?
> 
> AFAIU we can't do that at least on x86. We can use vfio interface to
> bind group into container, but we should not be able to dynamically
> disable IOMMU protection. IIUC That needs to taint the kernel.
> 
> The only way I know is that we probe vfio-pci with no-iommu mode, in
> that case, we disabled IOMMU, but we can never dynamically enable it
> as well.
> 
> Please correct me if I am wrong.


Actually, disabling device's IOVA doesn't require to disable kernel
global DMA protect and just clear device's VTD context entry in the
context table. Go though IOMMU and VFIO code, find this will happen when
call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy
VM or unplug assigned device in Qemu. Please help to double check.

Call trace:
__vfio_group_unset_container()
vfio_iommu_type1_detach_group()
iommu_detach_group()
dmar_remove_one_dev_info()
__dmar_remove_one_dev_info()
domain_context_clear()


The legacy KVM device assign code also will call iommu_detach_device()
when deassign a device.

>From device emulation view, we need to make sure correct register
emulation regardless of guest behavior.

> 
> Thanks,
> 
> -- peterx
> 


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Lan Tianyu

On 2016年12月06日 15:22, Peter Xu wrote:
> On Tue, Dec 06, 2016 at 03:06:20PM +0800, Lan Tianyu wrote:
>> On 2016年12月06日 14:51, Peter Xu wrote:
>>> On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote:
>>>
>>> [...]
>>>
>>>>>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
>>>>>> When guest enables IOVA for device, vIOMMU will invalidate all previous
>>>>>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
>>>>>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
>>>>>> mapping for the device otherwise the device won't work again in the VM.
>>>>>
>>>>> If we can have a workable replay mechanism, this problem will be
>>>>> solved IMHO.
>>>>
>>>> Basic idea is to replay related memory regions to restore GPA->HPA
>>>> mapping when guest disables IOVA.
>>>
>>> Btw, could you help elaborate in what case we will trigger such a
>>> condition? Or, can we dynamically enable/disable an IOMMU?
>>
>> First case I think of is nest virtualization and we assign physical
>> device to l2 guest.
> 
> If we assign physical device to l2 guest, then it will belongs to an
> IOMMU domain in either l1 guest and host, right? (assuming we are
> using vfio-pci to do device assignment)

Yes.

> 
>> User space driver(E.G DPDK) also can enable/disable
>> IOVA for device dynamically.
> 
> Could you provide more detailed (or any pointer) on how to do that? I
> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
> like it is for ppc only.

No, I just give an example that user space may do that but no more
research. But since Qemu already can enable device's IOVA, other user
application also should can do that with the same VFIO interface, right?

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Lan Tianyu

On 2016年12月06日 14:51, Peter Xu wrote:
> On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
>>>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
>>>> When guest enables IOVA for device, vIOMMU will invalidate all previous
>>>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
>>>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
>>>> mapping for the device otherwise the device won't work again in the VM.
>>>
>>> If we can have a workable replay mechanism, this problem will be
>>> solved IMHO.
>>
>> Basic idea is to replay related memory regions to restore GPA->HPA
>> mapping when guest disables IOVA.
> 
> Btw, could you help elaborate in what case we will trigger such a
> condition? Or, can we dynamically enable/disable an IOMMU?

First case I think of is nest virtualization and we assign physical
device to l2 guest. User space driver(E.G DPDK) also can enable/disable
IOVA for device dynamically.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Lan Tianyu

On 2016年12月02日 14:52, Peter Xu wrote:
> On Thu, Dec 01, 2016 at 02:44:14PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
>> Hi:
>> I think there are still other gaps to enable passthough device with
>> vIOMMU's DMA translation support.
>>
>> 1. Since this patchset is to shadow guest IO page table to
>> pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault
>> events from pIOMMU if guest os does misconfigurations. We should report
>> these fault events to guest. This means we need to pass the fault event
>> from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should
>> be added to connect pIOMMU and vIOMMU.
> 
> Thanks for raising this up - IMHO this is a good question.
> 
>>
>> The task should be divided into three parts
>> 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface
> 
> Here you mean "how host kernel capture the DMAR fault", right?

Yes, VFIO kernel driver should know the fault event when it's triggered.

> 
> IMHO We can have something like notifier/notifiee as well in DMAR
> fault reporting - people (like vfio driver in kernel) can register to
> fault reports related to specific device. When DMAR receives faults
> for those devices, it triggers the notification list, then vfio can be
> notified.

Yes, something likes this.


> 
>> 2) Add new channel in VFIO subsystem to connect pIOMMU driver and
>> vIOMMU in Qemu
>> 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject
>> virtual fault event to guest.
>>
>> Such VFIO channel is also required by device's PRS(Page Request
>> Services) support. This is also a part of SVM(Shared virtual memory)
>> support in VM. Here is SVM design doc link.
>> http://marc.info/?l=kvm=148049586514120=2
>>
>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
>> When guest enables IOVA for device, vIOMMU will invalidate all previous
>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
>> mapping for the device otherwise the device won't work again in the VM.
> 
> If we can have a workable replay mechanism, this problem will be
> solved IMHO.

Basic idea is to replay related memory regions to restore GPA->HPA
mapping when guest disables IOVA.


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Lan, Tianyu

On 12/3/2016 1:30 AM, Alex Williamson wrote:
> On Fri, 2 Dec 2016 14:08:59 +0800
> Peter Xu <pet...@redhat.com> wrote:
>
>> On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote:
>>> On 2016年11月30日 17:23, Peter Xu wrote:
>>>> On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
>>>>> * intel_iommu's replay op is not implemented yet (May come in different 
>>>>> patch
>>>>>   set).
>>>>>   The replay function is required for hotplug vfio device and to move 
>>>>> devices
>>>>>   between existing domains.
>>>>
>>>> I am thinking about this replay thing recently and now I start to
>>>> doubt whether the whole vt-d vIOMMU framework suites this...
>>>>
>>>> Generally speaking, current work is throwing away the IOMMU "domain"
>>>> layer here. We maintain the mapping only per device, and we don't care
>>>> too much about which domain it belongs. This seems problematic.
>>>>
>>>> A simplest wrong case for this is (let's assume cache-mode is
>>>> enabled): if we have two assigned devices A and B, both belong to the
>>>> same domain 1. Meanwhile, in domain 1 assume we have one mapping which
>>>> is the first page (iova range 0-0xfff). Then, if guest wants to
>>>> invalidate the page, it'll notify VT-d vIOMMU with an invalidation
>>>> message. If we do this invalidation per-device, we'll need to UNMAP
>>>> the region twice - once for A, once for B (if we have more devices, we
>>>> will unmap more times), and we can never know we have done duplicated
>>>> work since we don't keep domain info, so we don't know they are using
>>>> the same address space. The first unmap will work, and then we'll
>>>> possibly get some errors on the rest of dma unmap failures.
>>>
>>>
>>> Hi Peter:
>>
>> Hi, Tianyu,
>>
>>> According VTD spec 6.2.2.1, "Software must ensure that, if multiple
>>> context-entries (or extended-context-entries) are programmed
>>> with the same Domain-id (DID), such entries must be programmed with same
>>> value for the secondlevel page-table pointer (SLPTPTR) field, and same
>>> value for the PASID Table Pointer (PASIDTPTR) field.".
>>>
>>> So if two assigned device may have different IO page table, they should
>>> be put into different domains.
>>
>>
>> By default, devices will be put into different domains. However it
>> should be legal that we put two assigned devices into the same IOMMU
>> domain (in the guest), right? And we should handle both cases well
>> IMHO.
>>
>> Actually I just wrote a tool to do it based on vfio-pci:
>>
>>   
>> https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
>>
>> If we run this tool in the guest with parameter like:
>>
>>   ./vfio-bind-groups 00:03.0 00:04.0
>>
>> It'll create one single domain, and put PCI device 00:03.0, 00:04.0
>> into the same IOMMU domain.
>
> On the host though, I'd expect we still have separate IOMMU domains,
> one for each device and we do DMA_{UN}MAP ioctls separately per
> container.  Thanks,

Agree. Guest may use different IO page tables for multi assigned devices
and this requires to put assigned device in different VTD domain on
host. I think we can't put assigned devices into the same VTD domain
before enabling dynamic containers.

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Lan Tianyu

On 2016年11月30日 17:23, Peter Xu wrote:
> On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
>> * intel_iommu's replay op is not implemented yet (May come in different 
>> patch 
>>   set).
>>   The replay function is required for hotplug vfio device and to move 
>> devices 
>>   between existing domains.
> 
> I am thinking about this replay thing recently and now I start to
> doubt whether the whole vt-d vIOMMU framework suites this...
> 
> Generally speaking, current work is throwing away the IOMMU "domain"
> layer here. We maintain the mapping only per device, and we don't care
> too much about which domain it belongs. This seems problematic.
> 
> A simplest wrong case for this is (let's assume cache-mode is
> enabled): if we have two assigned devices A and B, both belong to the
> same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> is the first page (iova range 0-0xfff). Then, if guest wants to
> invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> message. If we do this invalidation per-device, we'll need to UNMAP
> the region twice - once for A, once for B (if we have more devices, we
> will unmap more times), and we can never know we have done duplicated
> work since we don't keep domain info, so we don't know they are using
> the same address space. The first unmap will work, and then we'll
> possibly get some errors on the rest of dma unmap failures.


Hi Peter:
According VTD spec 6.2.2.1, "Software must ensure that, if multiple
context-entries (or extended-context-entries) are programmed
with the same Domain-id (DID), such entries must be programmed with same
value for the secondlevel page-table pointer (SLPTPTR) field, and same
value for the PASID Table Pointer (PASIDTPTR) field.".

So if two assigned device may have different IO page table, they should
be put into different domains.


> 
> Looks like we just cannot live without knowing this domain layer.
> Because the address space is binded to the domain. If we want to sync
> the address space (here to setup a correct shadow page table), we need
> to do it per-domain.
> 
> What I can think of as a solution is that we introduce this "domain"
> layer - like a memory region per domain. When invalidation happens,
> it's per-domain, not per-device any more (actually I guess that's what
> current vt-d iommu driver in kernel is doing, we just ignored it - we
> fetch the devices that matches the domain ID). We can/need to maintain
> something different, like sid <-> domain mappings (we can do this as
> long as we are notified when context entries changed), per-domain
> mappings (just like per-device mappings that we are trying to build in
> this series, but what we really need is IMHO per domain one), etc.
> When device switches domain, we switch the IOMMU memory region
> accordingly.
> 
> Does this make any sense? Comments are greatly welcomed (especially
> from AlexW and DavidG).
> 
> Thanks,
> 
> -- peterx
> 


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Lan Tianyu

On 2016年12月01日 12:21, Tian, Kevin wrote:
>> I am thinking about this replay thing recently and now I start to
>> > doubt whether the whole vt-d vIOMMU framework suites this...
>> > 
>> > Generally speaking, current work is throwing away the IOMMU "domain"
>> > layer here. We maintain the mapping only per device, and we don't care
>> > too much about which domain it belongs. This seems problematic.
>> > 
>> > A simplest wrong case for this is (let's assume cache-mode is
>> > enabled): if we have two assigned devices A and B, both belong to the
>> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
>> > is the first page (iova range 0-0xfff). Then, if guest wants to
>> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
>> > message. If we do this invalidation per-device, we'll need to UNMAP
>> > the region twice - once for A, once for B (if we have more devices, we
>> > will unmap more times), and we can never know we have done duplicated
>> > work since we don't keep domain info, so we don't know they are using
>> > the same address space. The first unmap will work, and then we'll
>> > possibly get some errors on the rest of dma unmap failures.
> Tianyu and I discussed there is a bigger problem: today VFIO assumes 
> only one address space per container, which is fine w/o vIOMMU (all devices 
> in 
> same container share same GPA->HPA translation). However it's not the case
> when vIOMMU is enabled, because guest Linux implements per-device 
> IOVA space. If a VFIO container includes multiple devices, it means 
> multiple address spaces required per container...
> 

Hi All:
Some updates about relationship about assigned device and container.

If vIOMMU is disabled, all assigned devices will use global
address_space_memory as address space (Detail please see
pci_device_iommu_address_space()). VFIO creates container according to
address space and so all assigned devices will put into one container.

If vIOMMU is enabled, Intel vIOMMU will allocate separate address spaces
for each assigned devices and then VFIO will create different container
for each assigned device. In other word, it will be one assigned device
per container when vIOMMU is enabled. The original concern won't take
place. This is my understanding and please correct me if something wrong.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-11-30 Thread Lan Tianyu

On 2016年11月28日 23:51, Aviv B.D wrote:
> From: "Aviv Ben-David" 
>
> * Advertize Cache Mode capability in iommu cap register.
>   This capability is controlled by "cache-mode" property of intel-iommu 
> device.
>   To enable this option call QEMU with "-device intel-iommu,cache-mode=true".
>
> * On page cache invalidation in intel vIOMMU, check if the domain belong to
>   registered notifier, and notify accordingly.
>
> Currently this patch still doesn't enabling VFIO devices support with vIOMMU
> present. Current problems:
> * vfio_iommu_map_notify is not aware about memory range belong to specific
>   VFIOGuestIOMMU.
> * intel_iommu's replay op is not implemented yet (May come in different patch
>   set).
>   The replay function is required for hotplug vfio device and to move devices
>   between existing domains.
>
> Changes from v1 to v2:
> * remove assumption that the cache do not clears
> * fix lockup on high load.
>
> Changes from v2 to v3:
> * remove debug leftovers
> * split to sepearate commits
> * change is_write to flags in vtd_do_iommu_translate, add IOMMU_NO_FAIL
>   to suppress error propagating to guest.
>
> Changes from v3 to v4:
> * Add property to intel_iommu device to control the CM capability,
>   default to False.
> * Use s->iommu_ops.notify_flag_changed to register notifiers.
>
> Changes from v4 to v4 RESEND:
> * Fix codding style pointed by checkpatch.pl script.
>
> Changes from v4 to v5:
> * Reduce the number of changes in patch 2 and make flags real bitfield.
> * Revert deleted debug prints.
> * Fix memory leak in patch 3.
>
> Changes from v5 to v6:
> * fix prototype of iommu_translate function for more IOMMU types.
> * VFIO will be notified only on the difference, without unmap
>   before change to maps.
>
> Changes from v6 to v7:
> * Add replay op to iommu_ops, with current behavior as default implementation
>   (Patch 4).
> * Add stub to disable VFIO with intel_iommu support (Patch 5).
> * Cosmetic changes to other patches.
>
> Aviv Ben-David (5):
>   IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to
> guest
>   IOMMU: change iommu_op->translate's is_write to flags, add support to
> NO_FAIL flag mode
>   IOMMU: enable intel_iommu map and unmap notifiers
>   IOMMU: add specific replay function with default implemenation
>   IOMMU: add specific null implementation of iommu_replay to intel_iommu
>
>  exec.c |   3 +-
>  hw/alpha/typhoon.c |   2 +-
>  hw/i386/amd_iommu.c|   4 +-
>  hw/i386/intel_iommu.c  | 165 
> ++---
>  hw/i386/intel_iommu_internal.h |   3 +
>  hw/pci-host/apb.c  |   2 +-
>  hw/ppc/spapr_iommu.c   |   2 +-
>  hw/s390x/s390-pci-bus.c|   2 +-
>  include/exec/memory.h  |  10 ++-
>  include/hw/i386/intel_iommu.h  |  11 +++
>  memory.c   |  11 ++-
>  11 files changed, 177 insertions(+), 38 deletions(-)
>

Hi:
I think there are still other gaps to enable passthough device with
vIOMMU's DMA translation support.

1. Since this patchset is to shadow guest IO page table to
pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault
events from pIOMMU if guest os does misconfigurations. We should report
these fault events to guest. This means we need to pass the fault event
from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should
be added to connect pIOMMU and vIOMMU.

The task should be divided into three parts
1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface
2) Add new channel in VFIO subsystem to connect pIOMMU driver and
vIOMMU in Qemu
3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject
virtual fault event to guest.

Such VFIO channel is also required by device's PRS(Page Request
Services) support. This is also a part of SVM(Shared virtual memory)
support in VM. Here is SVM design doc link.
http://marc.info/?l=kvm=148049586514120=2

2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
When guest enables IOVA for device, vIOMMU will invalidate all previous
GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
notifier. But if IOVA is disabled, I think we should restore GPA->HPA
mapping for the device otherwise the device won't work again in the VM.

Another option to enable passthough device with intel vIOMMU
We may prevent guest to enable DMA translation function successfully via
gcmd and vIOMMU never set "Translation Enable Status" bit in gsts
register when there is a passthough device.

There are kernel parameter "intel_iommu=on" or Kconfig
CONFIG_INTEL_IOMMU_DEFAULT_ON to enable DMA translation for intel IOMMU
driver. But they aren't set in distribution RHEL, SLES, Oracle and
ubuntu by default. Tha

The iommu driver will trigger a kernel panic when fail to enable DMA
translation via gcmd with log "DMAR hardware is malfunctioning".

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode

2016-09-23 Thread Lan Tianyu

On 2016年09月23日 13:26, Peter Xu wrote:
> On Thu, Sep 22, 2016 at 12:34:36PM +0800, Chao Gao wrote:
>> Hi, we had 3 problems left here.
>> 1. IRQremapping can't work with x2apic_cluster mode.
>> 2. apic_id > 255 can't receive devices interrupts.
>> 3. windows crash when present IRQremapping capability to it.
> 

Hi Peter:

Thanks for your information. We just want to make sure these issues
aren't ignored.

> For (3), I don't know whether it's urgent or not - I've put it into my
> todo list (assuming this is for HPC and mostly we are using Linux
> guests?).
> 
> For (1-2), again we may need to wait for Radim's patches.

Re: [Qemu-devel] [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip is off

2016-03-01 Thread Lan, Tianyu


On 3/1/2016 10:00 PM, Eduardo Habkost wrote:

On Mon, Feb 29, 2016 at 10:56:04AM +0800, Lan Tianyu wrote:

On 2016年02月27日 03:54, Eduardo Habkost wrote:

On Thu, Feb 25, 2016 at 11:15:12PM +0800, Lan Tianyu wrote:

x2apic feature is in the kvm_default_props and automatically added to all
CPU models when KVM is enabled. But userspace devices don't support x2apic
which can't be enabled without the in-kernel irqchip. It will trigger
warning of "host doesn't support requested feature: CPUID.01H:ECX.x2apic
[bit 21]" when kernel_irqchip is off. This patch is to fix it via removing
x2apic feature when kernel_irqchip is off.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
  target-i386/cpu.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index c78f824..298fb62 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2125,6 +2125,10 @@ static void x86_cpu_load_def(X86CPU *cpu, 
X86CPUDefinition *def, Error **errp)

  /* Special cases not set in the X86CPUDefinition structs: */
  if (kvm_enabled()) {
+if (!kvm_irqchip_in_kernel()) {
+x86_cpu_change_kvm_default("x2apic", "off");


This should be NULL instead of "off".


I tried "NULL" before. But some cpus modules(E,G SandyBridge,
IvyBridge, haswell) already have x2apic feature in their default
features of struct X86CPUDefinition, passing "NULL" is not to add x2apic
feature to the cpu module and will not help to remove x2apic feature for
these cpu modules. So I changed "NULL" to "off".


In this case, I suggest we remove x2apic from these CPU models to
avoid confusion, as the presence of the flag in the table makes
no difference at all (this can be done in a separate patch).

If we do that, NULL and "off" would have the same results, but
NULL is clearer, IMO. NULL simply disables the KVM-specific hack
(so we don't touch the property anymore), but "off" adds a new
TCG-specific hack. I will submit that as a follow-up patch.


Yes, that sounds reasonable. Thanks for your review.



Reviewed-by: Eduardo Habkost <ehabk...@redhat.com>




Otherwise, the warning will
be disabled if using "-cpu ...,+x2apic".



kvm_arch_get_supported_cpuid() always returns no x2apic support when
kernel_irqchip is off and so it still triggers warning with "-cpu
...,+x2apic".

#qemu-system-x86_64 -cpu qemu64,+x2apic -machine kernel-irqchip=off
warning: TCG doesn't support requested feature: CPUID.01H:ECX.x2apic
[bit 21]


You are right. The +x2apic flag is applied after
x86_cpu_load_def() runs. My mistake.

Re: [Qemu-devel] [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip is off

2016-02-28 Thread Lan Tianyu

On 2016年02月27日 03:54, Eduardo Habkost wrote:
> On Thu, Feb 25, 2016 at 11:15:12PM +0800, Lan Tianyu wrote:
>> x2apic feature is in the kvm_default_props and automatically added to all
>> CPU models when KVM is enabled. But userspace devices don't support x2apic
>> which can't be enabled without the in-kernel irqchip. It will trigger
>> warning of "host doesn't support requested feature: CPUID.01H:ECX.x2apic
>> [bit 21]" when kernel_irqchip is off. This patch is to fix it via removing
>> x2apic feature when kernel_irqchip is off.
>>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  target-i386/cpu.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index c78f824..298fb62 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -2125,6 +2125,10 @@ static void x86_cpu_load_def(X86CPU *cpu, 
>> X86CPUDefinition *def, Error **errp)
>>  
>>  /* Special cases not set in the X86CPUDefinition structs: */
>>  if (kvm_enabled()) {
>> +if (!kvm_irqchip_in_kernel()) {
>> +x86_cpu_change_kvm_default("x2apic", "off");
> 
> This should be NULL instead of "off".

I tried "NULL" before. But some cpus modules(E,G SandyBridge,
IvyBridge, haswell) already have x2apic feature in their default
features of struct X86CPUDefinition, passing "NULL" is not to add x2apic
feature to the cpu module and will not help to remove x2apic feature for
these cpu modules. So I changed "NULL" to "off".

> Otherwise, the warning will
> be disabled if using "-cpu ...,+x2apic".
> 

kvm_arch_get_supported_cpuid() always returns no x2apic support when
kernel_irqchip is off and so it still triggers warning with "-cpu
...,+x2apic".

#qemu-system-x86_64 -cpu qemu64,+x2apic -machine kernel-irqchip=off
warning: TCG doesn't support requested feature: CPUID.01H:ECX.x2apic
[bit 21]


>> +}
>> +
> 
> This function runs multiple times (once for each VCPU being
> created). Should be harmless, though. Eventually we could move
> the whole kvm_default_props logic outside cpu.c, to a KVM-x86
> accel class or to common PC initialization code.

Good suggestion and we can try that later.

> 
> 
>>  x86_cpu_apply_props(cpu, kvm_default_props);
>>  }
>>  
>> -- 
>> 1.9.3
>>
> 


-- 
Best regards
Tianyu Lan

[Qemu-devel] [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when kernel_irqchip is off

2016-02-25 Thread Lan Tianyu

x2apic feature is in the kvm_default_props and automatically added to all
CPU models when KVM is enabled. But userspace devices don't support x2apic
which can't be enabled without the in-kernel irqchip. It will trigger
warning of "host doesn't support requested feature: CPUID.01H:ECX.x2apic
[bit 21]" when kernel_irqchip is off. This patch is to fix it via removing
x2apic feature when kernel_irqchip is off.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 target-i386/cpu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index c78f824..298fb62 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2125,6 +2125,10 @@ static void x86_cpu_load_def(X86CPU *cpu, 
X86CPUDefinition *def, Error **errp)
 
 /* Special cases not set in the X86CPUDefinition structs: */
 if (kvm_enabled()) {
+if (!kvm_irqchip_in_kernel()) {
+x86_cpu_change_kvm_default("x2apic", "off");
+}
+
 x86_cpu_apply_props(cpu, kvm_default_props);
 }
 
-- 
1.9.3

Re: [Qemu-devel] kvm: "warning: host doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]"

2016-02-25 Thread Lan Tianyu

2016-02-25 16:39 GMT+08:00 Jan Kiszka <jan.kis...@web.de>:
> On 2016-02-25 09:33, Lan Tianyu wrote:
>> 2016-02-20 17:00 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>>>
>>>
>>> - Original Message -
>>>> From: "Jan Kiszka" <jan.kis...@web.de>
>>>> To: "Eduardo Habkost" <ehabk...@redhat.com>, "Paolo Bonzini" 
>>>> <pbonz...@redhat.com>
>>>> Cc: "qemu-devel" <qemu-devel@nongnu.org>, "kvm" <k...@vger.kernel.org>
>>>> Sent: Saturday, February 20, 2016 9:09:32 AM
>>>> Subject: kvm: "warning: host doesn't support requested feature: 
>>>> CPUID.01H:ECX.x2apic [bit 21]"
>>>>
>>>> Hi all,
>>>>
>>>> I suppose 5120901a37 introduced this: qemu with kernel_irqchip=off now
>>>> generates these warnings, one per VCPU, during QEMU startup. Is the plan
>>>> to live with them until we finally have x2APIC emulation in userspace
>>>> (ie. also MSR vmexiting to there), or should we otherwise avoid it?
>>>
>>> I think it's a bug, x2apic should be auto-suppressed with 
>>> kernel_irqchip=off.
>>>
>>
>> The patch is to fix the issue.
>> ->8
>> From 58f2a3a94c8e7bf9f3474bcafb6c59cc4f8bcbd9 Mon Sep 17 00:00:00 2001
>> From: Lan Tianyu <tianyu@intel.com>
>> Date: Sun, 15 Jul 2001 01:40:17 -0400
>> Subject: [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when
>>  kernel_irqchip=off
>>
>> x2apic feature is in the kvm_default_props and automatically added to all
>> CPU models when KVM is enabled regardless of kernel_irqchip=off. This will
>> trigger "warning: host doesn't support requested feature: CPUID.01H:
>> ECX.x2apic [bit 21]" when kernel_irqchip=off. This patch is to remove x2apic
>> feature when kernel_irqchip=off.
>
> We know this, but it's probably worth to mention the underlying reason
> here: userspace devices don't support x2APIC.
>
>>
>> Signed-off-by: Lan Tianyu <tianyu@intel.com>
>> ---
>>  target-i386/cpu.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 0d447b5..2ec7eb7 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -2105,6 +2105,9 @@ static void x86_cpu_load_def(X86CPU *cpu,
>> X86CPUDefinition *def, Error **errp)
>>
>>  /* Special cases not set in the X86CPUDefinition structs: */
>>  if (kvm_enabled()) {
>> +   if (!kvm_irqchip_in_kernel())
>> +   x86_cpu_change_kvm_default("x2apic", "off");
>> +
>>  x86_cpu_apply_props(cpu, kvm_default_props);
>>  }
>>
>> --
>> 1.9.3
>>
>
> Make sure to comply with the coding style (there is a checkpatch.pl also
> in QEMU). And please post as a new thread with proper subject, otherwise
> people (and tools) will not find your patch as such.
>
> Jan

Thanks for comments. Will update soon..



-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] kvm: "warning: host doesn't support requested feature: CPUID.01H:ECX.x2apic [bit 21]"

2016-02-25 Thread Lan Tianyu

2016-02-20 17:00 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>
>
> - Original Message -
>> From: "Jan Kiszka" <jan.kis...@web.de>
>> To: "Eduardo Habkost" <ehabk...@redhat.com>, "Paolo Bonzini" 
>> <pbonz...@redhat.com>
>> Cc: "qemu-devel" <qemu-devel@nongnu.org>, "kvm" <k...@vger.kernel.org>
>> Sent: Saturday, February 20, 2016 9:09:32 AM
>> Subject: kvm: "warning: host doesn't support requested feature: 
>> CPUID.01H:ECX.x2apic [bit 21]"
>>
>> Hi all,
>>
>> I suppose 5120901a37 introduced this: qemu with kernel_irqchip=off now
>> generates these warnings, one per VCPU, during QEMU startup. Is the plan
>> to live with them until we finally have x2APIC emulation in userspace
>> (ie. also MSR vmexiting to there), or should we otherwise avoid it?
>
> I think it's a bug, x2apic should be auto-suppressed with kernel_irqchip=off.
>

The patch is to fix the issue.
->8
>From 58f2a3a94c8e7bf9f3474bcafb6c59cc4f8bcbd9 Mon Sep 17 00:00:00 2001
From: Lan Tianyu <tianyu@intel.com>
Date: Sun, 15 Jul 2001 01:40:17 -0400
Subject: [PATCH] Qemu/KVM: Remove x2apic feature from CPU model when
 kernel_irqchip=off

x2apic feature is in the kvm_default_props and automatically added to all
CPU models when KVM is enabled regardless of kernel_irqchip=off. This will
trigger "warning: host doesn't support requested feature: CPUID.01H:
ECX.x2apic [bit 21]" when kernel_irqchip=off. This patch is to remove x2apic
feature when kernel_irqchip=off.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 target-i386/cpu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 0d447b5..2ec7eb7 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2105,6 +2105,9 @@ static void x86_cpu_load_def(X86CPU *cpu,
X86CPUDefinition *def, Error **errp)

 /* Special cases not set in the X86CPUDefinition structs: */
 if (kvm_enabled()) {
+   if (!kvm_irqchip_in_kernel())
+   x86_cpu_change_kvm_default("x2apic", "off");
+
 x86_cpu_apply_props(cpu, kvm_default_props);
 }

--
1.9.3

Re: [Qemu-devel] live migration vs device assignment (motivation)

2016-01-03 Thread Lan Tianyu

On 2015年12月30日 00:46, Michael S. Tsirkin wrote:
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?
> 
> If so that is wrong - clearly should reconfigure the old IP
> back without playing with dhcp. For testing, just set up
> a static IP.

MAC and IP are migrated with VM to target machine and not need to
reconfigure IP after migration.

>From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up()
consumes 55ms during migration.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Lan, Tianyu




On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.



I measured network service downtime for "keep device alive"(RFC patch V1 
presented) and "put down and up network interface"(RFC patch V2 
presented) during migration with some optimizations.


The former is around 140ms and the later is around 240ms.

My patchset relies on the maibox irq which doesn't work in the suspend 
state and so can't get downtime for suspend/resume cases. Will try to 
get the result later.






There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.


Sure. Will do that.


I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-24 Thread Lan Tianyu

Merry Christmas.
Sorry for later response due to personal affair.

On 2015年12月14日 03:30, Alexander Duyck wrote:
>> > These sounds we need to add a faked bridge for migration and adding a
>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>> > driver to do pause/resume other devices, right?
>> >
>> > My concern is still that whether we can change PCI bus/hotplug like that
>> > without spec change.
>> >
>> > IRQ should be general for any devices and we may extend it for
>> > migration. Device driver also can make decision to support migration
>> > or not.
> The device should have no say in the matter.  Either we are going to
> migrate or we will not.  This is why I have suggested my approach as
> it allows for the least amount of driver intrusion while providing the
> maximum number of ways to still perform migration even if the device
> doesn't support it.

Even if the device driver doesn't support migration, you still want to
migrate VM? That maybe risk and we should add the "bad path" for the
driver at least.

> 
> The solution I have proposed is simple:
> 
> 1.  Extend swiotlb to allow for a page dirtying functionality.
> 
>  This part is pretty straight forward.  I'll submit a few patches
> later today as RFC that can provided the minimal functionality needed
> for this.

Very appreciate to do that.

> 
> 2.  Provide a vendor specific configuration space option on the QEMU
> implementation of a PCI bridge to act as a bridge between direct
> assigned devices and the host bridge.
> 
>  My thought was to add some vendor specific block that includes a
> capabilities, status, and control register so you could go through and
> synchronize things like the DMA page dirtying feature.  The bridge
> itself could manage the migration capable bit inside QEMU for all
> devices assigned to it.  So if you added a VF to the bridge it would
> flag that you can support migration in QEMU, while the bridge would
> indicate you cannot until the DMA page dirtying control bit is set by
> the guest.
> 
>  We could also go through and optimize the DMA page dirtying after
> this is added so that we can narrow down the scope of use, and as a
> result improve the performance for other devices that don't need to
> support migration.  It would then be a matter of adding an interrupt
> in the device to handle an event such as the DMA page dirtying status
> bit being set in the config space status register, while the bit is
> not set in the control register.  If it doesn't get set then we would
> have to evict the devices before the warm-up phase of the migration,
> otherwise we can defer it until the end of the warm-up phase.
> 
> 3.  Extend existing shpc driver to support the optional "pause"
> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> hot-plug specification.

Since your solution has added a faked PCI bridge. Why not notify the
bridge directly during migration via irq and call device driver's
callback in the new bridge driver?

Otherwise, the new bridge driver also can check whether the device
driver provides migration callback or not and call them to improve the
passthough device's performance during migration.

> 
>  Note I call out "extend" here instead of saying to add this.
> Basically what we should do is provide a means of quiescing the device
> without unloading the driver.  This is called out as something the OS
> vendor can optionally implement in the PCI hot-plug specification.  On
> OSes that wouldn't support this it would just be treated as a standard
> hot-plug event.   We could add a capability, status, and control bit
> in the vendor specific configuration block for this as well and if we
> set the status bit would indicate the host wants to pause instead of
> remove and the control bit would indicate the guest supports "pause"
> in the OS.  We then could optionally disable guest migration while the
> VF is present and pause is not supported.
> 
>  To support this we would need to add a timer and if a new device
> is not inserted in some period of time (60 seconds for example), or if
> a different device is inserted,
> we need to unload the original driver
> from the device.  In addition we would need to verify if drivers can
> call the remove function after having called suspend without resume.
> If not, we could look at adding a recovery function to remove the
> driver from the device in the case of a suspend with either a failed
> resume or no resume call.  Once again it would probably be useful to
> have for those cases where power management suspend/resume runs into
> an issue like somebody causing a surprise removal while a device was
> suspended.


-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-13 Thread Lan, Tianyu




On 12/11/2015 1:16 AM, Alexander Duyck wrote:

On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu@intel.com> wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:


Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver
in
hypervisor or qemu to handle the device on behalf of guest driver.


Can you answer the question of when do you use your code -
 at the start of migration or
 just before the end?



Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.


From my opinion, if VF driver is not loaded and hardware doesn't start
to work, the device state doesn't need to be migrated.

We may add a flag for driver to check whether migration happened during 
it's down and reinitialize the hardware and clear the flag when system 
try to put it up.


We may add migration core in the Linux kernel and provide some helps 
functions to facilitate to add migration support for drivers.

Migration core is in charge to sync status with Qemu.

Example.
migration_register()
Driver provides
- Callbacks to be called before and after migration or for bad path
- Its irq which it prefers to deal with migration event.

migration_event_check()
Driver calls it in the irq handler. Migration core code will check
migration status and call its callbacks when migration happens.




My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.



These sounds we need to add a faked bridge for migration and adding a
driver in the guest for it. It also needs to extend PCI bus/hotplug
driver to do pause/resume other devices, right?

My concern is still that whether we can change PCI bus/hotplug like that
without spec change.

IRQ should be general for any devices and we may extend it for
migration. Device driver also can make decision to support migration
or not.






It would be great if we could avoid changing the guest; but at least
your guest
driver changes don't actually seem to be that hardware specific;
could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?




It is impossible to use one common solution for all devices unless the
PCIE
spec documents it clearly and i think one day it will be there. But
before
that, we need some workarounds on guest driver to make it work even it
looks
ugly.



Yes, so far there is not hardware migration support and it's hard to modify
bus level code. It also will block implementation on the Windows

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu




On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver in
hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.



Yes, we need such a way to notify driver about migration status and do
reset or restore operation on the destination machine. My original
design is to take advantage of device's irq to do that. Driver can tell
Qemu that which irq it prefers to handle such task and whether the irq
is enabled or bound with handler. We may discuss the detail in the other
thread.




It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the PCIE
spec documents it clearly and i think one day it will be there. But before
that, we need some workarounds on guest driver to make it work even it looks
ugly.


Yes, so far there is not hardware migration support


VT-D supports setting dirty bit in the PTE in hardware.


Actually, this doesn't support in the current hardware.
VTD spec documents the dirty bit for first level translation which
requires devices to support DMA request with PASID(process
address space identifier). Most device don't support the feature.




and it's hard to modify
bus level code.


Why is it hard?


As Yang said, the concern is that PCI Spec doesn't document about how to 
do migration.





It also will block implementation on the Windows.


Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.


Dave

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu


On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote:

Let's assume you do save state and do have a way to detect
whether state matches a given hardware. For example,
driver could store firmware and hardware versions
in the state, and then on destination, retrieve them
and compare. It will be pretty common that you have a mismatch,
and you must not just fail migration. You need a way to recover,
maybe with more downtime.


Second, you can change the driver but you can not be sure it will have
the chance to run at all. Host overload is a common reason to migrate
out of the host.  You also can not trust guest to do the right thing.
So how long do you want to wait until you decide guest is not
cooperating and kill it?  Most people will probably experiment a bit and
then add a bit of a buffer. This is not robust at all.

Again, maybe you ask driver to save state, and if it does
not respond for a while, then you still migrate,
and driver has to recover on destination.


With the above in mind, you need to support two paths:
1. "good path": driver stores state on source, checks it on destination
detects a match and restores state into the device
2. "bad path": driver does not store state, or detects a mismatch
on destination. driver has to assume device was lost,
and reset it

So what I am saying is, implement bad path first. Then good path
is an optimization - measure whether it's faster, and by how much.



These sound reasonable. Driver should have ability to do such check
to ensure hardware or firmware coherence after migration and reset 
device when migration happens at some unexpected position.




Also, it would be nice if on the bad path there was a way
to switch to another driver entirely, even if that means
a bit more downtime. For example, have a way for driver to
tell Linux it has to re-do probing for the device.


Just glace the code of device core. device_reprobe() does what you said.

/**
 * device_reprobe - remove driver for a device and probe for a new  driver
 * @dev: the device to reprobe
 *
 * This function detaches the attached driver (if any) for the given
 * device and restarts the driver probing process.  It is intended
 * to use if probing criteria changed during a devices lifetime and
 * driver attachment should change accordingly.
 */
int device_reprobe(struct device *dev)

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu

On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the
>hypervisor or qemu to aware the device which means we may need a driver in
>hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?

Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.

> >It would be great if we could avoid changing the guest; but at least your 
guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?

>
>It is impossible to use one common solution for all devices unless the PCIE
>spec documents it clearly and i think one day it will be there. But before
>that, we need some workarounds on guest driver to make it work even it looks
>ugly.

Yes, so far there is not hardware migration support and it's hard to 
modify bus level code. It also will block implementation on the Windows.

Dave

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu




On 12/8/2015 1:12 AM, Alexander Duyck wrote:

On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu@intel.com> wrote:

On 12/5/2015 1:07 AM, Alexander Duyck wrote:



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.



That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.



This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.


Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.


We are also adding such migration support for QAT device and so our
solution will not just be limit to NIC. Now just is the beginning.

We can't limit user to only use Linux guest. So the migration feature
should work for both Windows and Linux guest.




If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.


The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.


If we can benefit both migration and suspend, that would be wonderful.
But migration and system pm is still different. Just for example,
driver doesn't need to put device into deep D-status during migration
and host can do this after migration while it's essential for
system sleep. PCI configure space and interrupt config is emulated by
Qemu and Qemu can migrate these configures to new machine. Driver
doesn't need to deal with such thing. So I think migration still needs a
different callback or different code path than device suspend/resume.

Another concern is that we have to rework PM core ore PCI bus driver
to call suspend/resume for passthrough devices during migration. This
also blocks new feature works on the Windows.



Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.



The main problem is tracking DMA issue. I will repose my solution in the
new thread for discussion. If not way to mark DMA page dirty when
DMA is enabled, we have to stop DMA for a small time to do that at the
last stage.








Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.



The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
   What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.



This is why I hope to reserve a piece of space in the dma page to do dummy
write. This can help to mark page dirty while not require to stop DMA and
not race with DMA data.


You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the device to
write into you cannot just poke

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu


On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:

On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.


It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html



No, I will reply it separately and according your suggestion to snip it 
into 3 thread.



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.



Sure. Duplicated code should be encapsulated and make it able to reuse
by other drivers. Just like you said the dummy write part.

I meant the framework should not require to change Windows kernel code
(such as PM core or PCI bus driver)and this will block implementation on
the Windows.

I think it's not problem to duplicate code in the Windows drivers.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.



This sounds reasonable. We may discuss it detail in the separate thread.

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu


On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote:

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.


I am not sure whether we can write a bus driver for Windows to support
migration. But I think device vendor who want to support migration will 
improve their driver if we provide such framework in the

hypervisor which just need them to change their driver.

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu


On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist 
Technology to offload servers from handling compute-intensive 
operations, such as:
1) Symmetric cryptography functions including cipher operations and 
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic 
curve cryptography

3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these 
operations will fail when unplug device.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu




On 12/10/2015 1:14 AM, Alexander Duyck wrote:

On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu@intel.com> wrote:


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.


Sure. The suspend/resume job is under way.
Actually, we have enabled QAT work for migration internally. Doing more 
test and fixing bugs.




- Alex

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-09 Thread Lan, Tianyu



On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.


This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.


This is a combination. Because it's not able to migrate device state in
the current world during migration(What we are doing), Exist solutions
of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF
before starting migration and then switch network from VF NIC to PV NIC
in order to maintain the network connection. Plug VF again after
migration and then switch from PV back to VF. Bond driver provides a way 
to switch between PV and VF NIC automatically with save IP and MAC and 
so bond driver is more preferred.





But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF


OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.


It also requires guest to do switch configuration.


That's just wrong. if you want a switch, you need to
configure a switch.


I meant the config of switching operation between PV and VF.




These are hard to
manage and deploy from our customers.


So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.


This is one factor.




To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.


No idea what this means.


These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Everyone needs performance and scalability.



For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.

So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).


No, we can find a PV NIC as backup for VF NIC during migration but it 
doesn't work for other kinds of device since there is no backup for 
them. E,G When migration happens during users compresses files via QAT, 
it's impossible to remove QAT at that point. If do that, the compress 
operation will fail and affect user experience.




I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.



Yes, saving and restoring device state relies on the driver and so we 
reworks driver and make it more friend to migration.

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Lan, Tianyu


On 12/5/2015 1:07 AM, Alexander Duyck wrote:


We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.


This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.

If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.




Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.


The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
  What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.


This is why I hope to reserve a piece of space in the dma page to do 
dummy write. This can help to mark page dirty while not require to stop 
DMA and not race with DMA data.


If can't do that, we have to stop DMA in a short time to mark all dma
pages dirty and then reenable it. I am not sure how much we can get by
this way to track all DMA memory with device running during migration. I
need to do some tests and compare results with stop DMA diretly at last
stage during migration.




The question is how we would go about triggering it.  I really don't
think the PCI configuration space approach is the right idea.
 I wonder
if we couldn't get away with some sort of ACPI event instead.  We
already require ACPI support in order to shut down the system
gracefully, I wonder if we couldn't get away with something similar in
order to suspend/resume the direct assigned devices gracefully.



I don't think there is such events in the current spec.
Otherwise, There are two kinds of suspend/resume callbacks.
1) System suspend/resume called during S2RAM and S2DISK.
2) Runtime suspend/resume called by pm core when device is idle.
If you want to do what you mentioned, you have to change PM core and
ACPI spec.


The dma page allocated by VF driver also needs to reserve space
to do dummy write.


No, this will not work.  If for example you have a VF driver allocating
memory for a 9K receive how will that work?  It isn't as if you can poke
a hole in the contiguous memory.

Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-04 Thread Lan, Tianyu



On 12/4/2015 4:05 PM, Michael S. Tsirkin wrote:

I haven't read it, but I would like to note you can't rely on research
papers.  If you propose a patch to be merged you need to measure what is
its actual effect on modern linux at the end of 2015.


Sure. Will do that.

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-04 Thread Lan, Tianyu


Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.

We still need to support Windows guest for migration and this is why our 
patches keep all changes in the driver since it's impossible to change 
Windows kernel.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote:

On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin  wrote:

On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:



There are several components to this:
- dma_map_* needs to prevent page from
   being migrated while device is running.
   For example, expose some kind of bitmap from guest
   to host, set bit there while page is mapped.
   What happens if we stop the guest and some
   bits are still set? See dma_alloc_coherent below
   for some ideas.


Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?


I have some ideas. Will post later.


I look forward to it.


- dma_unmap_* needs to mark page as dirty
   This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
   This is trickier as we can not change the data.
   One solution is using atomics.
   For example:
 int x = ACCESS_ONCE(*p);
 cmpxchg(p, x, x);
   Seems to do a write without changing page
   contents.


Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.


Really almost any atomic that has no side effect will do.
atomic or with 0
atomic and with 

It's just that cmpxchg already happens to have a portable
wrapper.


I was originally thinking maybe an atomic_add with 0 would be the way
to go.


cmpxchg with any value too.


  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.


I agree. It's likely not necessary for everyone
to be doing this: only people that both
run within the VM and want migration to work
need to do this logging.

So set some module option to have driver tell hypervisor that it
supports logging.  If bus mastering is enabled before this, migration is
blocked.  Or even pass some flag from hypervisor so
driver can detect it needs to log writes.
I guess this could be put in device config somewhere,
though in practice it's a global thing, not a per device one, so
maybe we need some new channel to
pass this flag to guest. CPUID?
Or maybe we can put some kind of agent in the initrd
and use the existing guest agent channel after all.
agent in initrd could open up a lot of new possibilities.



- dma_alloc_coherent memory (e.g. device rings)
   must be migrated after device stopped modifying it.
   Just stopping the VCPU is not enough:
   you must make sure device is not changing it.

   Or maybe the device has some kind of ring flush operation,
   if there was a reasonably portable way to do this
   (e.g. a flush capability could maybe be added to SRIOV)
   then hypervisor could do this.


This is where things start to get messy. I was suggesting the
suspend/resume to resolve this bit, but it might be possible to also
deal with this via something like this via clearing the bus master
enable bit for the VF.  If I am not mistaken that should disable MSI-X
interrupts and halt any DMA.  That should work as long as you have
some mechanism that is tracking the pages in use for DMA.


A bigger issue is recovering afterwards.


Agreed.


   In case you need to resume on source, you
   really need to follow the same path
   as on destination, preferably detecting
   device reset and restoring the device
   state.


The problem with detecting the reset is that you would likely have to
be polling to do something like that.


We could some event to guest to notify it about this event
through a new or

Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-03 Thread Lan, Tianyu

On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:

>We hope
>to find a better way to make SRIOV NIC work in these cases and this is
>worth to do since SRIOV NIC provides better network performance compared
>with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.

OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)

>Current patches have some issues. I think we can find
>solution for them andimprove them step by step.

Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-12-03 Thread Lan, Tianyu




On 12/3/2015 6:25 AM, Alex Williamson wrote:

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:

This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.


What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
with a migration=on feature?  Thanks,


Sounds reasonable and will update it.



Alex

Re: [Qemu-devel] [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition

2015-12-03 Thread Lan, Tianyu



On 12/3/2015 6:25 AM, Alex Williamson wrote:

I didn't seen a matching kernel patch series for this, but why is the
kernel more capable of doing this than userspace is already?

The following link is the kernel patch.
http://marc.info/?l=kvm=144837328920989=2


These seem
like pointless ioctls, we're creating a purely virtual PCI capability,
the kernel doesn't really need to participate in that.


VFIO kernel driver has pci_config_map which indicates the PCI capability 
position and length which helps to find free PCI config regs. Qemu side 
doesn't have such info and can't get the exact table size of PCI 
capability. If we want to add such support in the Qemu, needs duplicates 
a lot of code of vfio_pci_configs.c in the Qemu.



Also, why are we
restricting ourselves to standard capabilities?


This version is to check whether it's on the right way and We can extend
this to pci extended capability later.


That's often a crowded
space and we can't always know whether an area is free or not based only
on it being covered by a capability.  Some capabilities can also appear
more than once, so there's context that isn't being passed to the kernel
here.  Thanks,


The region outside of PCI capability are not passed to kernel or used by 
Qemu for MSI/MSIX . It's possible to use these places for new 
capability. One concerns is that guest driver may abuse them and quirk 
of masking some special regs outside of capability maybe helpful.

Re: [Qemu-devel] [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability

2015-12-03 Thread Lan, Tianyu




On 12/3/2015 6:25 AM, Alex Williamson wrote:

This will of course break if the PCI SIG defines that capability index.
Couldn't this be done within a vendor defined capability?  Thanks,


Yes, it should work and thanks for suggestion.

Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-03 Thread Lan, Tianyu

On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:

>We hope
>to find a better way to make SRIOV NIC work in these cases and this is
>worth to do since SRIOV NIC provides better network performance compared
>with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.

OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)

>Current patches have some issues. I think we can find
>solution for them andimprove them step by step.

Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-02 Thread Lan, Tianyu


On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:

But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows.
 From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.


That's really up to guest. You don't need to do bonding,
you can just move the IP and mac from userspace, that's
possible on most OS-es.

Or write something in guest kernel that is more lightweight if you are
so inclined. What we are discussing here is the host-guest interface,
not the in-guest interface.


Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration.
It's easy to deploy



Except of course these patches don't even work properly yet.

And when they do, even minor changes in host side NIC hardware across
migration will break guests in hard to predict ways.


Switching between PV and VF NIC will introduce network stop and the
latency of hotplug VF is measurable. For some user cases(cloud service
and OPNFV) which are sensitive to network stabilization and performance,
these are not friend and blocks SRIOV NIC usage in these case. We hope
to find a better way to make SRIOV NIC work in these cases and this is
worth to do since SRIOV NIC provides better network performance compared
with PV NIC. Current patches have some issues. I think we can find
solution for them andimprove them step by step.

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Lan, Tianyu

On 12/1/2015 12:07 AM, Alexander Duyck wrote:

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.

>Doing suspend and resume() may help to do migration easily but some
>devices requires low service down time. Especially network and I got
>that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.

I think the issue is that the content of rx package delivered to stack 
maybe changed during migration because the piece of memory won't be 
migrated to new machine. This may confuse applications or stack. Current 
dummy write solution can ensure the content of package won't change 
after doing dummy write while the content maybe not received data if 
migration happens before that point. We can recheck the content via 
checksum or crc in the protocol after dummy write to ensure the content 
is what VF received. I think stack has already done such checks and the 
package will be abandoned if failed to pass through the check.

Another way is to tell all memory driver are using to Qemu and let Qemu 
to migrate these memory after stopping VCPU and the device. This seems 
safe but implementation maybe complex.

Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-30 Thread Lan, Tianyu




On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
   boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
happens) and start migration



This looks like the bonding driver solution which put pv nic and VF
in one bonded interface under active-backup mode. The bonding driver
will switch from VF to PV nic automatically when VF is unplugged during
migration. This is the only available solution for VF NIC migration. But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows. From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.

Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration. It's easy to deploy and
all changes are in the NIC driver, NIC vendor can implement migration
support just in the their driver.

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-29 Thread Lan, Tianyu


On 11/26/2015 11:56 AM, Alexander Duyck wrote:

> I am not saying you cannot modify the drivers, however what you are
doing is far too invasive.  Do you seriously plan on modifying all of
the PCI device drivers out there in order to allow any device that
might be direct assigned to a port to support migration?  I certainly
hope not.  That is why I have said that this solution will not scale.


Current drivers are not migration friendly. If the driver wants to
support migration, it's necessary to be changed.

RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
DMA tracking during migration. These are common for most drivers and
they maybe problematic in the previous version but can be corrected later.

Doing suspend and resume() may help to do migration easily but some
devices requires low service down time. Especially network and I got
that some cloud company promised less than 500ms network service downtime.

So I think performance effect also should be taken into account when we 
design the framework.





What I am counter proposing seems like a very simple proposition.  It
can be implemented in two steps.

1.  Look at modifying dma_mark_clean().  It is a function called in
the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
modify it to take care of marking the pages you unmap for Rx as being
dirty it will get you a good way towards your goal as it will allow
you to continue to do DMA while you are migrating the VM.

2.  Look at making use of the existing PCI suspend/resume calls that
are there to support PCI power management.  They have everything
needed to allow you to pause and resume DMA for the device before and
after the migration while retaining the driver state.  If you can
implement something that allows you to trigger these calls from the
PCI subsystem such as hot-plug then you would have a generic solution
that can be easily reproduced for multiple drivers beyond those
supported by ixgbevf.


Glanced at PCI hotplug code. The hotplug events are triggered by PCI 
hotplug controller and these event are defined in the controller spec.
It's hard to extend more events. Otherwise, we also need to add some 
specific codes in the PCI hotplug core since it's only add and remove
PCI device when it gets events. It's also a challenge to modify Windows 
hotplug codes. So we may need to find another way.






Thanks.

- Alex

Re: [Qemu-devel] [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Lan, Tianyu


On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote:

Frankly, I don't really see what this short term hack buys us,
and if it goes in, we'll have to maintain it forever.



The framework of how to notify VF about migration status won't be
changed regardless of stopping VF or not before doing migration.
We hope to reach agreement on this first. Tracking dirty memory still
need to more discussions and we will continue working on it. Stop VF may
help to work around the issue and make tracking easier.



Also, assuming you just want to do ifdown/ifup for some reason, it's
easy enough to do using a guest agent, in a completely generic way.



Just ifdown/ifup is not enough for migration. It needs to restore some 
PCI settings before doing ifup on the target machine

Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-25 Thread Lan, Tianyu

On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:

>+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
>+  uint32_t val, int len)
>+{
>+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>+
>+if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
>+&& val == PCI_VF_READY_FOR_MIGRATION) {
>+qemu_event_set(_event);

This would wake migration so it can proceed -
except it needs QEMU lock to run, and that's
taken by the migration thread.

Sorry, I seem to miss something.
Which lock may cause dead lock when calling vfio_migration_cap_handle()
and run migration?
The function is called when VF accesses faked PCI capability.

It seems unlikely that this ever worked - how
did you test this?

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Lan Tianyu

On 2015年11月25日 13:30, Alexander Duyck wrote:
> No, what I am getting at is that you can't go around and modify the
> configuration space for every possible device out there.  This
> solution won't scale.


PCI config space regs are emulation by Qemu and so We can find the free
PCI config space regs for the faked PCI capability. Its position can be
not permanent.


>  If you instead moved the logic for notifying
> the device into a separate mechanism such as making it a part of the
> hot-plug logic then you only have to write the code once per OS in
> order to get the hot-plug capability to pause/resume the device.  What
> I am talking about is not full hot-plug, but rather to extend the
> existing hot-plug in Qemu and the Linux kernel to support a
> "pause/resume" functionality.  The PCI hot-plug specification calls
> out the option of implementing something like this, but we don't
> currently have support for it.
>

Could you elaborate the part of PCI hot-plug specification you mentioned?

My concern is whether it needs to change PCI spec or not.



> I just feel doing it through PCI hot-plug messages will scale much
> better as you could likely make use of the power management
> suspend/resume calls to take care of most of the needed implementation
> details.
> 
> - Alex
-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu

On 2015年11月24日 22:20, Alexander Duyck wrote:
> I'm still not a fan of this approach.  I really feel like this is
> something that should be resolved by extending the existing PCI hot-plug
> rather than trying to instrument this per driver.  Then you will get the
> goodness for multiple drivers and multiple OSes instead of just one.  An
> added advantage to dealing with this in the PCI hot-plug environment
> would be that you could then still do a hot-plug even if the guest
> didn't load a driver for the VF since you would be working with the PCI
> slot instead of the device itself.
> 
> - Alex

Hi Alex:
What's you mentioned seems the bonding driver solution.
Paper "Live Migration with Pass-through Device for Linux VM" describes
it. It does VF hotplug during migration. In order to maintain Network
connection when VF is out, it takes advantage of Linux bonding driver to
switch between VF NIC and emulated NIC. But the side affects, that
requires VM to do additional configure and the performance during
switching two NIC is not good.

-- 
Best regards
Tianyu Lan

Re: [Qemu-devel] [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu

On 2015年11月25日 05:20, Michael S. Tsirkin wrote:
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

No, our finial target is to keep VF active before doing
migration and tracking dirty memory is essential. But this
seems not easy to do that in short term for upstream. As
starters, stop VF before migration.

After deep thinking, the way of stopping VF still needs tracking
DMA-accessed dirty memory to make sure the received data buffer
before stopping VF migrated. It's easier to do that via dummy writing
data buffer when receive packet.

-- 
Best regards
Tianyu Lan

[Qemu-devel] [RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct.

2015-11-24 Thread Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 137 +-
 hw/vfio/pci.h | 158 ++
 2 files changed, 159 insertions(+), 136 deletions(-)
 create mode 100644 hw/vfio/pci.h

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e0e339a..5c3f8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,138 +42,7 @@
 #include "trace.h"
 #include "hw/vfio/vfio.h"
 #include "hw/vfio/vfio-common.h"
-
-struct VFIOPCIDevice;
-
-typedef struct VFIOQuirk {
-MemoryRegion mem;
-struct VFIOPCIDevice *vdev;
-QLIST_ENTRY(VFIOQuirk) next;
-struct {
-uint32_t base_offset:TARGET_PAGE_BITS;
-uint32_t address_offset:TARGET_PAGE_BITS;
-uint32_t address_size:3;
-uint32_t bar:3;
-
-uint32_t address_match;
-uint32_t address_mask;
-
-uint32_t address_val:TARGET_PAGE_BITS;
-uint32_t data_offset:TARGET_PAGE_BITS;
-uint32_t data_size:3;
-
-uint8_t flags;
-uint8_t read_flags;
-uint8_t write_flags;
-} data;
-} VFIOQuirk;
-
-typedef struct VFIOBAR {
-VFIORegion region;
-bool ioport;
-bool mem64;
-QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOBAR;
-
-typedef struct VFIOVGARegion {
-MemoryRegion mem;
-off_t offset;
-int nr;
-QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOVGARegion;
-
-typedef struct VFIOVGA {
-off_t fd_offset;
-int fd;
-VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
-} VFIOVGA;
-
-typedef struct VFIOINTx {
-bool pending; /* interrupt pending */
-bool kvm_accel; /* set when QEMU bypass through KVM enabled */
-uint8_t pin; /* which pin to pull for qemu_set_irq */
-EventNotifier interrupt; /* eventfd triggered on interrupt */
-EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
-PCIINTxRoute route; /* routing info for QEMU bypass */
-uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
-} VFIOINTx;
-
-typedef struct VFIOMSIVector {
-/*
- * Two interrupt paths are configured per vector.  The first, is only used
- * for interrupts injected via QEMU.  This is typically the non-accel path,
- * but may also be used when we want QEMU to handle masking and pending
- * bits.  The KVM path bypasses QEMU and is therefore higher performance,
- * but requires masking at the device.  virq is used to track the MSI route
- * through KVM, thus kvm_interrupt is only available when virq is set to a
- * valid (>= 0) value.
- */
-EventNotifier interrupt;
-EventNotifier kvm_interrupt;
-struct VFIOPCIDevice *vdev; /* back pointer to device */
-int virq;
-bool use;
-} VFIOMSIVector;
-
-enum {
-VFIO_INT_NONE = 0,
-VFIO_INT_INTx = 1,
-VFIO_INT_MSI  = 2,
-VFIO_INT_MSIX = 3,
-};
-
-/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
-typedef struct VFIOMSIXInfo {
-uint8_t table_bar;
-uint8_t pba_bar;
-uint16_t entries;
-uint32_t table_offset;
-uint32_t pba_offset;
-MemoryRegion mmap_mem;
-void *mmap;
-} VFIOMSIXInfo;
-
-typedef struct VFIOPCIDevice {
-PCIDevice pdev;
-VFIODevice vbasedev;
-VFIOINTx intx;
-unsigned int config_size;
-uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
-off_t config_offset; /* Offset of config space region within device fd */
-unsigned int rom_size;
-off_t rom_offset; /* Offset of ROM region within device fd */
-void *rom;
-int msi_cap_size;
-VFIOMSIVector *msi_vectors;
-VFIOMSIXInfo *msix;
-int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
-int interrupt; /* Current interrupt type */
-VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
-VFIOVGA vga; /* 0xa, 0x3b0, 0x3c0 */
-PCIHostDeviceAddress host;
-EventNotifier err_notifier;
-EventNotifier req_notifier;
-int (*resetfn)(struct VFIOPCIDevice *);
-uint32_t features;
-#define VFIO_FEATURE_ENABLE_VGA_BIT 0
-#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
-#define VFIO_FEATURE_ENABLE_REQ_BIT 1
-#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
-int32_t bootindex;
-uint8_t pm_cap;
-bool has_vga;
-bool pci_aer;
-bool req_enabled;
-bool has_flr;
-bool has_pm_reset;
-bool rom_read_failed;
-} VFIOPCIDevice;
-
-typedef struct VFIORomBlacklistEntry {
-uint16_t vendor_id;
-uint16_t device_id;
-} VFIORomBlacklistEntry;
+#include "hw/vfio/pci.h"
 
 /*
  * List of device ids/vendor ids for which to disable
@@ -193,12 +62,8 @@ static const VFIORomBlacklistEntry romblacklist[] = {
 { 0x14e4, 0x168e }
 };
 
-#define MSIX_CAP_LENGTH 12
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
-static uint32_t vfio_pci_

[Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.   

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (10):
  Qemu/VFIO: Create head file pci.h to share data struct.
  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  Qemu/VFIO: Rework vfio_std_cap_max_size() function
  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
regs
  Qemu/VFIO: Expose PCI config space read/write and msix functions
  Qemu/PCI: Add macros for faked PCI migration capability
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Add save_before_stop callback to run just before stopping VCPU
during migration
  Qemu/VFIO: Add SRIOV VF migration support
  Qemu/VFIO: Misc change for enable migration with VFIO

 hw/vfio/Makefile.objs   |   2 +-
 hw/vfio/pci.c   | 196 +---
 hw/vfio/pci.h   | 168 +
 hw/vfio/sriov.c | 178 
 include/hw/pci/pci_regs.h   |  19 +
 include/migration/vmstate.h |   5 ++
 include/sysemu/sysemu.h |   1 +
 linux-headers/linux/vfio.h  |  16 
 migration/migration.c   |   3 +-
 migration/savevm.c  |  28 +++
 10 files changed, 459 insertions(+), 157 deletions(-)
 create mode 100644 hw/vfio/pci.h
 create mode 100644 hw/vfio/sriov.c

-- 
1.9.3

[Qemu-devel] [RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs

2015-11-24 Thread Lan Tianyu

This patch is to add ioctl wrap to find free PCI config sapce regs.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/pci.c | 19 +++
 hw/vfio/pci.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29845e3..d0354a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2508,6 +2508,25 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, 
uint8_t pos)
 }
 }
 
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size)
+{
+struct vfio_pci_cap_info reg_info = {
+.argsz = sizeof(reg_info),
+.offset = pos,
+.index = VFIO_PCI_CAP_GET_FREE_REGION,
+.size = size,
+};
+int ret;
+
+ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info);
+if (ret || reg_info.offset == 0) { 
+error_report("vfio: Failed to find free PCI config reg: %m\n");
+return -EFAULT;
+}
+
+return reg_info.offset; 
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
 PCIDevice *pdev = >pdev;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9f360bf..6083300 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -156,3 +156,5 @@ typedef struct VFIORomBlacklistEntry {
 } VFIORomBlacklistEntry;
 
 #define MSIX_CAP_LENGTH 12
+
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
-- 
1.9.3

[Qemu-devel] [RFC PATCH V2 1/3] VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO

2015-11-24 Thread Lan Tianyu

This patch is to add new ioctl cmd VFIO_GET_PCI_CAP_INFO to get
PCI cap table size and get free PCI config space regs according
pos and size.

Qemu will add faked PCI capability for migration and need such
info.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 21 
 drivers/vfio/pci/vfio_pci_config.c  | 38 +++--
 drivers/vfio/pci/vfio_pci_private.h |  5 +
 include/uapi/linux/vfio.h   | 12 
 4 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 69fab0f..2e42de0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -784,6 +784,27 @@ hot_reset_release:
 
kfree(groups);
return ret;
+   } else if (cmd == VFIO_GET_PCI_CAP_INFO) {
+   struct vfio_pci_cap_info info;
+   int offset;
+
+   if (copy_from_user(, (void __user *)arg, sizeof(info)))
+   return -EFAULT;
+
+   switch (info.index) {
+   case VFIO_PCI_CAP_GET_SIZE:
+   info.size = vfio_get_cap_size(vdev, info.cap, 
info.offset);
+   break;
+   case VFIO_PCI_CAP_GET_FREE_REGION:
+   offset = vfio_find_free_pci_config_reg(vdev,
+   info.offset, info.size);
+   info.offset = offset;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return copy_to_user((void __user *)arg, , sizeof(info));
}
 
return -ENOTTY;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..8afbda4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -841,6 +841,21 @@ static int vfio_find_cap_start(struct vfio_pci_device 
*vdev, int pos)
return pos;
 }
 
+int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size)
+{
+   int i, offset = pos;
+
+   for (i = pos; i < PCI_CFG_SPACE_SIZE; i++) {
+   if (vdev->pci_config_map[i] != PCI_CAP_ID_INVALID)
+   offset = i + 1;
+   else if (i - offset + 1 == size)
+   return offset;
+   }
+
+   return 0;
+}
+
 static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
@@ -1199,6 +1214,20 @@ static int vfio_fill_vconfig_bytes(struct 
vfio_pci_device *vdev,
return ret;
 }
 
+int vfio_get_cap_size(struct vfio_pci_device *vdev, u8 cap, int pos)
+{
+   int len;
+
+   len = pci_cap_length[cap];
+   if (len == 0xFF) { /* Variable length */
+   len = vfio_cap_len(vdev, cap, pos);
+   if (len < 0)
+   return len;
+   }
+
+   return len;
+}
+
 static int vfio_cap_init(struct vfio_pci_device *vdev)
 {
struct pci_dev *pdev = vdev->pdev;
@@ -1238,12 +1267,9 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
return ret;
 
if (cap <= PCI_CAP_ID_MAX) {
-   len = pci_cap_length[cap];
-   if (len == 0xFF) { /* Variable length */
-   len = vfio_cap_len(vdev, cap, pos);
-   if (len < 0)
-   return len;
-   }
+   len = vfio_get_cap_size(vdev, cap, pos);
+   if (len < 0)
+   return len;
}
 
if (!len) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index ae0e1b4..91b4f9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -89,4 +89,9 @@ extern void vfio_pci_uninit_perm_bits(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
+extern int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size);
+extern int vfio_get_cap_size(struct vfio_pci_device *vdev,
+   u8 cap, int pos);
+
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b57b750..dfa7023 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -495,6 +495,18 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+#define VFIO_GET_PCI_CAP_INFO  _IO(VFIO_TYPE, VFIO_BASE + 22)
+struct vfio_pci_cap_info {
+   __u32   argsz;
+   __u32   flags;
+#define VFIO_PCI_CAP_GET_SIZE  (1 << 0)
+#define VFIO_

[Qemu-devel] [RFC PATCH V2 2/3] PCI: Add macros for faked PCI migration capability

2015-11-24 Thread Lan Tianyu

This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during 
migration.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 include/uapi/linux/pci_regs.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index efe3443..9defb6f 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -216,7 +216,8 @@
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
 #define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
-#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
+#define  PCI_CAP_ID_MIGRATION  0X14   
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_MIGRATION
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -904,4 +905,19 @@
 #define PCI_TPH_CAP_ST_SHIFT   16  /* st table shift */
 #define PCI_TPH_BASE_SIZEOF12  /* size with no st table */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP   0x04
+#define PCI_VF_MIGRATION_VMM_STATUS0x05
+#define PCI_VF_MIGRATION_VF_STATUS 0x06
+#define PCI_VF_MIGRATION_IRQ   0x07
+
+#define PCI_VF_MIGRATION_DISABLE   0x00
+#define PCI_VF_MIGRATION_ENABLE0x01
+
+#define VMM_MIGRATION_END0x00
+#define VMM_MIGRATION_START  0x01
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00
+#define PCI_VF_READY_FOR_MIGRATION  0x01
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

[Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-24 Thread Lan Tianyu

This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.

Signed-off-by: Lan Tianyu <tianyu@intel.com>
---
 hw/vfio/Makefile.objs |   2 +-
 hw/vfio/pci.c |   6 ++
 hw/vfio/pci.h |   4 ++
 hw/vfio/sriov.c   | 178 ++
 4 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/sriov.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d540c9d..9cf0178 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_PCI) += pci.o sriov.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7c43fc1..e7583b5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t 
addr,
 } else if (was_enabled && !is_enabled) {
 vfio_disable_msix(vdev);
 }
+} else if (vdev->migration_cap &&
+ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
+/* Write everything to QEMU to keep emulated bits correct */
+pci_default_write_config(pdev, addr, val, len);
+vfio_migration_cap_handle(pdev, addr, val, len);
 } else {
 /* Write everything to QEMU to keep emulated bits correct */
 pci_default_write_config(pdev, addr, val, len);
@@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn(vdev);
+vfio_add_migration_capability(vdev);
 
 return 0;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6c00575..ee6ca5e 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
 PCIHostDeviceAddress host;
 EventNotifier err_notifier;
 EventNotifier req_notifier;
+uint16_tmigration_cap;
 int (*resetfn)(struct VFIOPCIDevice *);
 uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
@@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t 
addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
uint32_t val, int len);
 void vfio_enable_msix(VFIOPCIDevice *vdev);
+void vfio_add_migration_capability(VFIOPCIDevice *vdev);
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+   uint32_t val, int len);
diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
new file mode 100644
index 000..3109538
--- /dev/null
+++ b/hw/vfio/sriov.c
@@ -0,0 +1,178 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hw/hw.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+#define TYPE_VFIO_SRIOV "vfio-sriov"
+
+#define SRIOV_LM_SETUP 0x01
+#define SRIOV_LM_COMPLETE 0x02
+
+QemuEvent migration_event;
+
+static void vfio_dev_post_load(void *opaque)
+{
+struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+MSIMessage msg;
+int vector;
+
+if (vfio_pci_read_config(pdev,
+vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+!= PCI_VF_MIGRATION_ENABLE)
+return;
+
+vector = vfio_pci_read_config(pdev,
+vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+msg = msix_get_message(pdev, vector);
+kvm_irqchip_send_msi(kvm_state, msg);
+}
+
+static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
+{
+struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+int ret;
+
+if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
+return 0;
+
+ret = pci_device_load(pdev, f);
+if (ret) {
+error_report("Faild to load PCI config space.\n");
+return ret;
+}
+
+if (msix_enabled(pdev)) {
+vfio_enable_msix(vdev);
+msix_load(pdev, f);
+}
+
+vfio_pci_write_config(pdev,vdev->migration_cap +
+PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
+vfio_pci_write_config(pdev,vdev->migration_cap +
+PCI_VF_MIGRATION_VF_STATUS, P

1 2 >

1 - 100 of 141 matches

Mail list logo