Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Alex Williamson
The discussion on this patch seems to have fizzled, with no clear short
term solution.  Jan gave a Reviewed-by and Michael an Acked-by for this
patch.  There's work left to do, but I request that we pull in this fix
now.  Thanks,

Alex

On Wed, 2012-04-04 at 21:42 -0600, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.
 
 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.
 
 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.
 
  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Avi Kivity
On 04/16/2012 05:03 PM, Alex Williamson wrote:
 The discussion on this patch seems to have fizzled, with no clear short
 term solution.  Jan gave a Reviewed-by and Michael an Acked-by for this
 patch.  There's work left to do, but I request that we pull in this fix
 now.  Thanks,


I agree (but am not committing this week).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Michael S. Tsirkin
On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
 The discussion on this patch seems to have fizzled, with no clear short
 term solution.

I think we are in concensus, it's just that there are
multiple bugs still left to fix.

First, we need to prevent guest from touching command
register except for the bus master bit. Something like
the below? Compiled only.

device-assignment: don't touch pci command register

Real command register is under kernel control:
it includes bits for triggering SERR, marking
BARs as invalid and such which are under host
kernel control. Don't touch any except bus master
which is ok to put under guest control.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---
diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 89823f1..9ebce49 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
uint16_t r_seg,
 FILE *f;
 unsigned long long start, end, size, flags;
 uint16_t id;
-struct stat statbuf;
 PCIRegion *rp;
 PCIDevRegions *dev = pci_dev-real_device;
 
@@ -610,12 +609,8 @@ again:
 pci_dev-dev.config[2] = id  0xff;
 pci_dev-dev.config[3] = (id  0xff00)  8;
 
-/* dealing with virtual function device */
-snprintf(name, sizeof(name), %sphysfn/, dir);
-if (!stat(name, statbuf)) {
-/* always provide the written value on readout */
-assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
-}
+/* Pass bus master writes to device. */
+pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER;
 
 dev-region_number = r;
 return 0;
@@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
 cause host memory corruption if the device issues DMA write 
 requests!\n);
 }
-if (dev-features  ASSIGNED_DEVICE_SHARE_INTX_MASK 
-kvm_has_intx_set_mask()) {
-assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
-
-/* hide host-side INTx masking from the guest */
-dev-emulate_config_read[PCI_COMMAND + 1] |=
-PCI_COMMAND_INTX_DISABLE  8;
-}
 
 r = kvm_assign_pci_device(kvm_state, assigned_dev_data);
 if (r  0) {
@@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
 }
 
 /*
- * When a 0 is written to the command register, the device is logically
+ * When a 0 is written to the bus master register, the device is logically
  * disconnected from the PCI bus. This avoids further DMA transfers.
  */
-assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
+assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
 }
 
 static int assigned_initfn(struct PCIDevice *pci_dev)
@@ -1658,7 +1645,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
  * device initialization.
  */
 assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
-assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
 assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
 assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
 assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Jan Kiszka
On 2012-04-16 17:06, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
 The discussion on this patch seems to have fizzled, with no clear short
 term solution.
 
 I think we are in concensus, it's just that there are
 multiple bugs still left to fix.
 
 First, we need to prevent guest from touching command
 register except for the bus master bit. Something like

INTx is harmless as well. Denying access breaks masking for the guest.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Michael S. Tsirkin
On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
 On 2012-04-16 17:06, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
  The discussion on this patch seems to have fizzled, with no clear short
  term solution.
  
  I think we are in concensus, it's just that there are
  multiple bugs still left to fix.
  
  First, we need to prevent guest from touching command
  register except for the bus master bit. Something like
 
 INTx is harmless as well. Denying access breaks masking for the guest.

We currently mark intx as emulated, no?
Anyway - care fixing it properly?

 Jan
 
 -- 
 Siemens AG, Corporate Technology, CT T DE IT 1
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Jason Baron
On Mon, Apr 16, 2012 at 06:06:40PM +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
  The discussion on this patch seems to have fizzled, with no clear short
  term solution.
 
 I think we are in concensus, it's just that there are
 multiple bugs still left to fix.
 
 First, we need to prevent guest from touching command
 register except for the bus master bit. Something like
 the below? Compiled only.
 
 device-assignment: don't touch pci command register
 
 Real command register is under kernel control:
 it includes bits for triggering SERR, marking
 BARs as invalid and such which are under host
 kernel control. Don't touch any except bus master
 which is ok to put under guest control.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 ---
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..9ebce49 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
 uint16_t r_seg,
  FILE *f;
  unsigned long long start, end, size, flags;
  uint16_t id;
 -struct stat statbuf;
  PCIRegion *rp;
  PCIDevRegions *dev = pci_dev-real_device;
  
 @@ -610,12 +609,8 @@ again:
  pci_dev-dev.config[2] = id  0xff;
  pci_dev-dev.config[3] = (id  0xff00)  8;
  
 -/* dealing with virtual function device */
 -snprintf(name, sizeof(name), %sphysfn/, dir);
 -if (!stat(name, statbuf)) {
 -/* always provide the written value on readout */
 -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
 -}
 +/* Pass bus master writes to device. */
 +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER;
  
  dev-region_number = r;
  return 0;
 @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
  cause host memory corruption if the device issues DMA write 
 
  requests!\n);
  }
 -if (dev-features  ASSIGNED_DEVICE_SHARE_INTX_MASK 
 -kvm_has_intx_set_mask()) {
 -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
 -
 -/* hide host-side INTx masking from the guest */
 -dev-emulate_config_read[PCI_COMMAND + 1] |=
 -PCI_COMMAND_INTX_DISABLE  8;
 -}
  
  r = kvm_assign_pci_device(kvm_state, assigned_dev_data);
  if (r  0) {
 @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
  }
  
  /*
 - * When a 0 is written to the command register, the device is logically
 + * When a 0 is written to the bus master register, the device is 
 logically
   * disconnected from the PCI bus. This avoids further DMA transfers.
   */
 -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
 +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
  }

This is still going to disable mmio, is the intent to just clear the bus
master bit, ie bit 2? Or is this patch meant to be in addition to the
one Alex posted?

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Jan Kiszka
On 2012-04-16 18:08, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
 On 2012-04-16 17:06, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
 The discussion on this patch seems to have fizzled, with no clear short
 term solution.

 I think we are in concensus, it's just that there are
 multiple bugs still left to fix.

 First, we need to prevent guest from touching command
 register except for the bus master bit. Something like

 INTx is harmless as well. Denying access breaks masking for the guest.
 
 We currently mark intx as emulated, no?

If we use the mask also in the kernel for IRQ sharing, we *monitor* it.

 Anyway - care fixing it properly?

There is nothing to fix.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Michael S. Tsirkin
On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
 On 2012-04-16 18:08, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
  On 2012-04-16 17:06, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
  The discussion on this patch seems to have fizzled, with no clear short
  term solution.
 
  I think we are in concensus, it's just that there are
  multiple bugs still left to fix.
 
  First, we need to prevent guest from touching command
  register except for the bus master bit. Something like
 
  INTx is harmless as well. Denying access breaks masking for the guest.
  
  We currently mark intx as emulated, no?
 
 If we use the mask also in the kernel for IRQ sharing, we *monitor* it.
 
  Anyway - care fixing it properly?
 
 There is nothing to fix.
 
 Jan

Guest access  to command register is a bug.
E.g. it has no business enabling SERR as SERR crashes
host when triggered.

 -- 
 Siemens AG, Corporate Technology, CT T DE IT 1
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Jan Kiszka
On 2012-04-16 18:36, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
 On 2012-04-16 18:08, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
 On 2012-04-16 17:06, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
 The discussion on this patch seems to have fizzled, with no clear short
 term solution.

 I think we are in concensus, it's just that there are
 multiple bugs still left to fix.

 First, we need to prevent guest from touching command
 register except for the bus master bit. Something like

 INTx is harmless as well. Denying access breaks masking for the guest.

 We currently mark intx as emulated, no?

 If we use the mask also in the kernel for IRQ sharing, we *monitor* it.

 Anyway - care fixing it properly?

 There is nothing to fix.

 Jan
 
 Guest access  to command register is a bug.
 E.g. it has no business enabling SERR as SERR crashes
 host when triggered.

Maybe a misunderstanding. I agree that other bits are not guest
business. But INTx should be excluded from this filtering and handled as
is, i.e. passed through.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Michael S. Tsirkin
On Mon, Apr 16, 2012 at 06:38:22PM +0200, Jan Kiszka wrote:
 On 2012-04-16 18:36, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
  On 2012-04-16 18:08, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
  On 2012-04-16 17:06, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
  The discussion on this patch seems to have fizzled, with no clear short
  term solution.
 
  I think we are in concensus, it's just that there are
  multiple bugs still left to fix.
 
  First, we need to prevent guest from touching command
  register except for the bus master bit. Something like
 
  INTx is harmless as well. Denying access breaks masking for the guest.
 
  We currently mark intx as emulated, no?
 
  If we use the mask also in the kernel for IRQ sharing, we *monitor* it.
 
  Anyway - care fixing it properly?
 
  There is nothing to fix.
 
  Jan
  
  Guest access  to command register is a bug.
  E.g. it has no business enabling SERR as SERR crashes
  host when triggered.
 
 Maybe a misunderstanding. I agree that other bits are not guest
 business. But INTx should be excluded from this filtering and handled as
 is, i.e. passed through.
 
 Jan

Ah, I see what you mean.
So let's add
pci_dev-emulate_config_write[PCI_COMMAND + 1] = 
~(PCI_COMMAND_INTX_DISABLE  8);
on top?


 -- 
 Siemens AG, Corporate Technology, CT T DE IT 1
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Michael S. Tsirkin
On Mon, Apr 16, 2012 at 12:12:52PM -0400, Jason Baron wrote:
 On Mon, Apr 16, 2012 at 06:06:40PM +0300, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
   The discussion on this patch seems to have fizzled, with no clear short
   term solution.
  
  I think we are in concensus, it's just that there are
  multiple bugs still left to fix.
  
  First, we need to prevent guest from touching command
  register except for the bus master bit. Something like
  the below? Compiled only.
  
  device-assignment: don't touch pci command register
  
  Real command register is under kernel control:
  it includes bits for triggering SERR, marking
  BARs as invalid and such which are under host
  kernel control. Don't touch any except bus master
  which is ok to put under guest control.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  ---
  diff --git a/hw/device-assignment.c b/hw/device-assignment.c
  index 89823f1..9ebce49 100644
  --- a/hw/device-assignment.c
  +++ b/hw/device-assignment.c
  @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
  uint16_t r_seg,
   FILE *f;
   unsigned long long start, end, size, flags;
   uint16_t id;
  -struct stat statbuf;
   PCIRegion *rp;
   PCIDevRegions *dev = pci_dev-real_device;
   
  @@ -610,12 +609,8 @@ again:
   pci_dev-dev.config[2] = id  0xff;
   pci_dev-dev.config[3] = (id  0xff00)  8;
   
  -/* dealing with virtual function device */
  -snprintf(name, sizeof(name), %sphysfn/, dir);
  -if (!stat(name, statbuf)) {
  -/* always provide the written value on readout */
  -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
  -}
  +/* Pass bus master writes to device. */
  +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER;
   
   dev-region_number = r;
   return 0;
  @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
   cause host memory corruption if the device issues DMA 
  write 
   requests!\n);
   }
  -if (dev-features  ASSIGNED_DEVICE_SHARE_INTX_MASK 
  -kvm_has_intx_set_mask()) {
  -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
  -
  -/* hide host-side INTx masking from the guest */
  -dev-emulate_config_read[PCI_COMMAND + 1] |=
  -PCI_COMMAND_INTX_DISABLE  8;
  -}
   
   r = kvm_assign_pci_device(kvm_state, assigned_dev_data);
   if (r  0) {
  @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
   }
   
   /*
  - * When a 0 is written to the command register, the device is logically
  + * When a 0 is written to the bus master register, the device is 
  logically
* disconnected from the PCI bus. This avoids further DMA transfers.
*/
  -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
  +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
   }
 
 This is still going to disable mmio,

I think it won't since all other bits are marked as emulated now.

 is the intent to just clear the bus
 master bit, ie bit 2? Or is this patch meant to be in addition to the
 one Alex posted?
 
 Thanks,
 
 -Jason
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Jan Kiszka
On 2012-04-16 19:12, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 06:38:22PM +0200, Jan Kiszka wrote:
 On 2012-04-16 18:36, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote:
 On 2012-04-16 18:08, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote:
 On 2012-04-16 17:06, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
 The discussion on this patch seems to have fizzled, with no clear short
 term solution.

 I think we are in concensus, it's just that there are
 multiple bugs still left to fix.

 First, we need to prevent guest from touching command
 register except for the bus master bit. Something like

 INTx is harmless as well. Denying access breaks masking for the guest.

 We currently mark intx as emulated, no?

 If we use the mask also in the kernel for IRQ sharing, we *monitor* it.

 Anyway - care fixing it properly?

 There is nothing to fix.

 Jan

 Guest access  to command register is a bug.
 E.g. it has no business enabling SERR as SERR crashes
 host when triggered.

 Maybe a misunderstanding. I agree that other bits are not guest
 business. But INTx should be excluded from this filtering and handled as
 is, i.e. passed through.

 Jan
 
 Ah, I see what you mean.
 So let's add
 pci_dev-emulate_config_write[PCI_COMMAND + 1] = 
 ~(PCI_COMMAND_INTX_DISABLE  8);
 on top?

Looks good.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Alex Williamson
On Mon, 2012-04-16 at 18:06 +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
  The discussion on this patch seems to have fizzled, with no clear short
  term solution.
 
 I think we are in concensus, it's just that there are
 multiple bugs still left to fix.
 
 First, we need to prevent guest from touching command
 register except for the bus master bit. Something like
 the below? Compiled only.
 
 device-assignment: don't touch pci command register
 
 Real command register is under kernel control:
 it includes bits for triggering SERR, marking
 BARs as invalid and such which are under host
 kernel control. Don't touch any except bus master
 which is ok to put under guest control.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 ---
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..9ebce49 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
 uint16_t r_seg,
  FILE *f;
  unsigned long long start, end, size, flags;
  uint16_t id;
 -struct stat statbuf;
  PCIRegion *rp;
  PCIDevRegions *dev = pci_dev-real_device;
  
 @@ -610,12 +609,8 @@ again:
  pci_dev-dev.config[2] = id  0xff;
  pci_dev-dev.config[3] = (id  0xff00)  8;
  
 -/* dealing with virtual function device */
 -snprintf(name, sizeof(name), %sphysfn/, dir);
 -if (!stat(name, statbuf)) {
 -/* always provide the written value on readout */
 -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
 -}
 +/* Pass bus master writes to device. */
 +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER;
  
  dev-region_number = r;
  return 0;
 @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
  cause host memory corruption if the device issues DMA write 
 
  requests!\n);
  }
 -if (dev-features  ASSIGNED_DEVICE_SHARE_INTX_MASK 
 -kvm_has_intx_set_mask()) {
 -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;

This doesn't look right ^^^, we'll never make use of host side INTx
disable support that way.  Thanks,

Alex

 -
 -/* hide host-side INTx masking from the guest */
 -dev-emulate_config_read[PCI_COMMAND + 1] |=
 -PCI_COMMAND_INTX_DISABLE  8;
 -}
  
  r = kvm_assign_pci_device(kvm_state, assigned_dev_data);
  if (r  0) {
 @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev)
  }
  
  /*
 - * When a 0 is written to the command register, the device is logically
 + * When a 0 is written to the bus master register, the device is 
 logically
   * disconnected from the PCI bus. This avoids further DMA transfers.
   */
 -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
 +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
  }
  
  static int assigned_initfn(struct PCIDevice *pci_dev)
 @@ -1658,7 +1645,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
   * device initialization.
   */
  assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
 -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
  assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
  assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
  assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Michael S. Tsirkin
On Mon, Apr 16, 2012 at 01:07:56PM -0600, Alex Williamson wrote:
 On Mon, 2012-04-16 at 18:06 +0300, Michael S. Tsirkin wrote:
  On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote:
   The discussion on this patch seems to have fizzled, with no clear short
   term solution.
  
  I think we are in concensus, it's just that there are
  multiple bugs still left to fix.
  
  First, we need to prevent guest from touching command
  register except for the bus master bit. Something like
  the below? Compiled only.
  
  device-assignment: don't touch pci command register
  
  Real command register is under kernel control:
  it includes bits for triggering SERR, marking
  BARs as invalid and such which are under host
  kernel control. Don't touch any except bus master
  which is ok to put under guest control.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  ---
  diff --git a/hw/device-assignment.c b/hw/device-assignment.c
  index 89823f1..9ebce49 100644
  --- a/hw/device-assignment.c
  +++ b/hw/device-assignment.c
  @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
  uint16_t r_seg,
   FILE *f;
   unsigned long long start, end, size, flags;
   uint16_t id;
  -struct stat statbuf;
   PCIRegion *rp;
   PCIDevRegions *dev = pci_dev-real_device;
   
  @@ -610,12 +609,8 @@ again:
   pci_dev-dev.config[2] = id  0xff;
   pci_dev-dev.config[3] = (id  0xff00)  8;
   
  -/* dealing with virtual function device */
  -snprintf(name, sizeof(name), %sphysfn/, dir);
  -if (!stat(name, statbuf)) {
  -/* always provide the written value on readout */
  -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
  -}
  +/* Pass bus master writes to device. */
  +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER;
   
   dev-region_number = r;
   return 0;
  @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev)
   cause host memory corruption if the device issues DMA 
  write 
   requests!\n);
   }
  -if (dev-features  ASSIGNED_DEVICE_SHARE_INTX_MASK 
  -kvm_has_intx_set_mask()) {
  -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
 
 This doesn't look right ^^^, we'll never make use of host side INTx
 disable support that way.  Thanks,
 
 Alex

Right.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-16 Thread Marcelo Tosatti
On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.
 
 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.
 
 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-10 Thread Alex Williamson
On Mon, 2012-04-09 at 11:35 +0300, Avi Kivity wrote:
 On 04/08/2012 08:37 PM, Jan Kiszka wrote:
  The core problem is not the ordering. The problem is that the kernel is
  susceptible to ordering mistakes of userspace. And that is because the
  kernel panics on PCI errors of devices that are in user hands - a
  critical kernel bug IMHO. 
 
 Certainly.  But this userspace patch won't fix it.

No, it won't in general and I don't think it makes sense to mangle
pci-sysfs config space support to the nuances of a user space driver.
We really need a userspace driver interface that limits the config space
interactions and provides a channel to support error reporting and
userspace recovery.  This type of thing can be done with VFIO if we
could ever get off the ground and get some consensus around it.  Please
feel free to contribute to that discussion if you ever want to get away
from this clunky device assignment interface we have now.

  Proper reset of MSI or even the whole PCI
  config space is another issue, but one the kernel should not worry about
  - still, it should be fixed (therefore this patch).
 
 And I was asking what is the right way to do it.  Reset the device and
 read back the register values, or do an emulated reset and push down the
 register values.

Reading back the register values is currently a noop since the kernel
restores the config space to the incoming state after reset.  KVM does
stash away the original config space of the device to be restored prior
to releasing the device.  We could restore to that each time, but that
would mean implementing a device reset ioctl in kvm, and we'd still need
this patch for compatibility and we still have the issues Michael brings
up with the config restore updating things like MSI that we need to then
manually sync with kvm.  I fear suggesting it, but perhaps another way
to achieve this result would be to de-assign and re-assign the device in
reset.

  But even if we disallowed userland to disable MMIO and PIO access to the
  device, we would be be able to exclude that there are secrete channels
  in the device's interface having the same effect. So we likely need to
  enhance PCI error handling to catch and handle faults for certain
  devices differently - those we cannot trust to behave properly while
  they are under userland/guest control.
 
 Why not all of them?

I think Jan is probably suggesting that we do need user space error
handling for all userland/guest controlled devices, but some classes of
errors on certain devices may be handled automatically by the userspace
interface layer... which we could do with VFIO (well, assuming the APEI
spec let's us nak the bios reporting a fatal error).  So do we want to
invent new solutions for each of these or do we want to move to a new
interface?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-09 Thread Avi Kivity
On 04/08/2012 08:37 PM, Jan Kiszka wrote:
 The core problem is not the ordering. The problem is that the kernel is
 susceptible to ordering mistakes of userspace. And that is because the
 kernel panics on PCI errors of devices that are in user hands - a
 critical kernel bug IMHO. 

Certainly.  But this userspace patch won't fix it.

 Proper reset of MSI or even the whole PCI
 config space is another issue, but one the kernel should not worry about
 - still, it should be fixed (therefore this patch).

And I was asking what is the right way to do it.  Reset the device and
read back the register values, or do an emulated reset and push down the
register values.

 But even if we disallowed userland to disable MMIO and PIO access to the
 device, we would be be able to exclude that there are secrete channels
 in the device's interface having the same effect. So we likely need to
 enhance PCI error handling to catch and handle faults for certain
 devices differently - those we cannot trust to behave properly while
 they are under userland/guest control.

Why not all of them?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/05/2012 06:42 AM, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.

 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109

 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.

 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.

 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.

 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---

 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.

  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)

 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be running.  We want to return the device to a known state on
 + * reset, so disable those here.  We especially do not want MSI-X
 + * enabled since it lives in MMIO space, which is about to get
 + * disabled.
 + */
 +if 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:14:29PM +0300, Avi Kivity wrote:
 On 04/05/2012 06:42 AM, Alex Williamson wrote:
  We've hit a kernel host panic, when issuing a 'system_reset' with an
  82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
  [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
  32993
  [Hardware Error]: APEI generic hardware error status
  [Hardware Error]: severity: 1, fatal
  [Hardware Error]: section: 0, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  [Hardware Error]: section: 1, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  Kernel panic - not syncing: Fatal hardware error!
  Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
  Call Trace:
   NMI  [814f2fe5] ? panic+0xa0/0x168
   [812f919c] ? ghes_notify_nmi+0x17c/0x180
   [814f91d5] ? notifier_call_chain+0x55/0x80
   [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
   [8109667e] ? notify_die+0x2e/0x30
   [814f6e81] ? do_nmi+0x1a1/0x2b0
   [814f6760] ? nmi+0x20/0x30
   [8103762b] ? native_safe_halt+0xb/0x10
   EOE  [8101495d] ? default_idle+0x4d/0xb0
   [81009e06] ? cpu_idle+0xb6/0x110
   [814da63a] ? rest_init+0x7a/0x80
   [81c1ff7b] ? start_kernel+0x424/0x430
   [81c1f33a] ? x86_64_start_reservations+0x125/0x129
   [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
  The root cause of the problem is that the 'reset_assigned_device()' code
  first writes a 0 to the command register. Then, when qemu subsequently does
  a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
  the kernel ends up calling '__msix_mask_irq()', which performs a write to
  the memory mapped msi vector space. Since, we've explicitly told the device
  to disallow mmio access (via the 0 write to the command register), we end
  up with the above 'Unsupported Request'.
 
  The fix here is to first disable MSI-X, before doing the reset.  We also
  disable MSI, leaving the device in INTx mode.  In this way, the device is
  a known state after reset, and we avoid touching msi memory mapped space
  on any subsequent 'kvm_deassign_irq()'.
 
  Thanks to Michael S. Tsirkin for help in understanding what was going on
  here and Jason Baron, the original debugger of this problem.
 
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
 
  Jason is out of the office for a couple weeks, so I'll try to resolve
  this while he's away.  Somehow the emulated config updates were lost
  in Jason's original posting, so I've fixed that and taken Jan's suggestion
  to simply call into the update functions instead of open coding the
  interrupt disable.  I think there still may be some disagreements about
  how to handle guest generated errors in the host, but that's a large
  project whereas this is something we should be doing at reset anyway,
  and even if only a workaround, resolves the problem above.
 
   hw/device-assignment.c |   23 +++
   1 files changed, 23 insertions(+), 0 deletions(-)
 
  diff --git a/hw/device-assignment.c b/hw/device-assignment.c
  index 89823f1..2e6b93e 100644
  --- a/hw/device-assignment.c
  +++ b/hw/device-assignment.c
  @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
   const char reset[] = 1;
   int fd, ret;
   
  +/*
  + * If a guest is reset without being shutdown, MSI/MSI-X can still
  + * be running.  We want to return the device to a known state on
  + * reset, so disable those 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
  
  Don't we FLR the device, which ought to disable MSI on the real device?

 AFAIK we call pci_reset, which saves device state, does an FLR
 and then restores the state. I think this might include msi as well.



Then that is wrong as well, no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
   
   Don't we FLR the device, which ought to disable MSI on the real device?
 
  AFAIK we call pci_reset, which saves device state, does an FLR
  and then restores the state. I think this might include msi as well.
 
 
 
 Then that is wrong as well, no?

Not as such assuming we disable msi/msix first :)

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:

Don't we FLR the device, which ought to disable MSI on the real device?
  
   AFAIK we call pci_reset, which saves device state, does an FLR
   and then restores the state. I think this might include msi as well.
  
  
  
  Then that is wrong as well, no?

 Not as such assuming we disable msi/msix first :)

I think we need to fix both, no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
 
 Don't we FLR the device, which ought to disable MSI on the real 
 device?
   
AFAIK we call pci_reset, which saves device state, does an FLR
and then restores the state. I think this might include msi as well.
   
   
   
   Then that is wrong as well, no?
 
  Not as such assuming we disable msi/msix first :)
 
 I think we need to fix both, no?

Isn't this what this patch does?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
  
  Don't we FLR the device, which ought to disable MSI on the real 
  device?

 AFAIK we call pci_reset, which saves device state, does an FLR
 and then restores the state. I think this might include msi as well.



Then that is wrong as well, no?
  
   Not as such assuming we disable msi/msix first :)
  
  I think we need to fix both, no?

 Isn't this what this patch does?

If we change pci_reset() (or a variant that we call) to reset MSI, and
update qemu to synchronize from the device after pci_reset(), then we
achieve the same result, in a different way.

Since reset can change other config space registers, we achieve
correctness for more of them.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
   
   Don't we FLR the device, which ought to disable MSI on the real 
   device?
 
  AFAIK we call pci_reset, which saves device state, does an FLR
  and then restores the state. I think this might include msi as well.
 
 
 
 Then that is wrong as well, no?
   
Not as such assuming we disable msi/msix first :)
   
   I think we need to fix both, no?
 
  Isn't this what this patch does?
 
 If we change pci_reset() (or a variant that we call) to reset MSI, and
 update qemu to synchronize from the device after pci_reset(), then we
 achieve the same result, in a different way.

MSI vectors are set up by kvm in the host. So we should not
abruptly drop that by a sysfs write: would need to
synchronise with kvm. Once we do, there's nothing left
for pci_reset to do.


 Since reset can change other config space registers, we achieve
 correctness for more of them.

Which other registers do you have in mind?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:

Don't we FLR the device, which ought to disable MSI on the real 
device?
  
   AFAIK we call pci_reset, which saves device state, does an FLR
   and then restores the state. I think this might include msi as 
   well.
  
  
  
  Then that is wrong as well, no?

 Not as such assuming we disable msi/msix first :)

I think we need to fix both, no?
  
   Isn't this what this patch does?
  
  If we change pci_reset() (or a variant that we call) to reset MSI, and
  update qemu to synchronize from the device after pci_reset(), then we
  achieve the same result, in a different way.

 MSI vectors are set up by kvm in the host. So we should not
 abruptly drop that by a sysfs write: would need to
 synchronise with kvm. Once we do, there's nothing left
 for pci_reset to do.

I'm thinking about this flow:

  FLR the device
  for each emulated register
 read it from the hardware
 if different from emulated register:
update the internal model (for example, disabling MSI in kvm if
needed)
set emulated register to hardware value

  Since reset can change other config space registers, we achieve
  correctness for more of them.

 Which other registers do you have in mind?

BARs for example.  We may have our own reset for this, but isn't copying
the hardware values more trustworthy?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
 
 Don't we FLR the device, which ought to disable MSI on the 
 real device?
   
AFAIK we call pci_reset, which saves device state, does an FLR
and then restores the state. I think this might include msi as 
well.
   
   
   
   Then that is wrong as well, no?
 
  Not as such assuming we disable msi/msix first :)
 
 I think we need to fix both, no?
   
Isn't this what this patch does?
   
   If we change pci_reset() (or a variant that we call) to reset MSI, and
   update qemu to synchronize from the device after pci_reset(), then we
   achieve the same result, in a different way.
 
  MSI vectors are set up by kvm in the host. So we should not
  abruptly drop that by a sysfs write: would need to
  synchronise with kvm. Once we do, there's nothing left
  for pci_reset to do.
 
 I'm thinking about this flow:
 
   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm if
 needed)

If we do it this way we get back the problem this patch
is trying to solve: MSIX assigned while device
memory is disabled would cause unsupported request errors.

 set emulated register to hardware value

Yes, I see what you are trying to say now.
Unfortunately that's not enough: we also
need to restore the registers afterwards for
device to become useful again.
Doing this in kernel seems more robust, otherwise
we risk losing the device if qemu gets killed
before it has restored the registers.


   Since reset can change other config space registers, we achieve
   correctness for more of them.
 
  Which other registers do you have in mind?
 
 BARs for example.  We may have our own reset for this, but isn't copying
 the hardware values more trustworthy?

BAR values in host and guest are unrelated.
If pci_reset didn't restore BAR values we won't
be able to operate the device.


 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
  
  Don't we FLR the device, which ought to disable MSI on the 
  real device?

 AFAIK we call pci_reset, which saves device state, does an FLR
 and then restores the state. I think this might include msi 
 as well.



Then that is wrong as well, no?
  
   Not as such assuming we disable msi/msix first :)
  
  I think we need to fix both, no?

 Isn't this what this patch does?

If we change pci_reset() (or a variant that we call) to reset MSI, and
update qemu to synchronize from the device after pci_reset(), then we
achieve the same result, in a different way.
  
   MSI vectors are set up by kvm in the host. So we should not
   abruptly drop that by a sysfs write: would need to
   synchronise with kvm. Once we do, there's nothing left
   for pci_reset to do.
  
  I'm thinking about this flow:
  
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in kvm if
  needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

Why is that?  FLR would presumably disable MSI in the device, and this
line would disable it in kvm as well.

  set emulated register to hardware value

 Yes, I see what you are trying to say now.
 Unfortunately that's not enough: we also
 need to restore the registers afterwards for
 device to become useful again.

I guess this is correct for the MSIX BAR.  But is it correct for MSIX
enable/disable?

 Doing this in kernel seems more robust, otherwise
 we risk losing the device if qemu gets killed
 before it has restored the registers.

Doesn't the driver have to enable MSIX if it attaches to the device at
that point, anyway?


Since reset can change other config space registers, we achieve
correctness for more of them.
  
   Which other registers do you have in mind?
  
  BARs for example.  We may have our own reset for this, but isn't copying
  the hardware values more trustworthy?

 BAR values in host and guest are unrelated.
 If pci_reset didn't restore BAR values we won't
 be able to operate the device.


Right.

I guess candidates are those that are initialized with
assigned_dev_emulate_config_*()?  Hard to see which ones because they're
mass initialized.



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 06:26:28PM +0300, Avi Kivity wrote:
 On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
   
   Don't we FLR the device, which ought to disable MSI on 
   the real device?
 
  AFAIK we call pci_reset, which saves device state, does an 
  FLR
  and then restores the state. I think this might include msi 
  as well.
 
 
 
 Then that is wrong as well, no?
   
Not as such assuming we disable msi/msix first :)
   
   I think we need to fix both, no?
 
  Isn't this what this patch does?
 
 If we change pci_reset() (or a variant that we call) to reset MSI, and
 update qemu to synchronize from the device after pci_reset(), then we
 achieve the same result, in a different way.
   
MSI vectors are set up by kvm in the host. So we should not
abruptly drop that by a sysfs write: would need to
synchronise with kvm. Once we do, there's nothing left
for pci_reset to do.
   
   I'm thinking about this flow:
   
 FLR the device
 for each emulated register
read it from the hardware
if different from emulated register:
   update the internal model (for example, disabling MSI in kvm if
   needed)
 
  If we do it this way we get back the problem this patch
  is trying to solve: MSIX assigned while device
  memory is disabled would cause unsupported request errors.
 
 Why is that?  FLR would presumably disable MSI in the device, and this
 line would disable it in kvm as well.

The bug is that device memory is disabled (FLR would do that)
while MSI is enabled in kvm. The fix is to
disable MSI in kvm first.

   set emulated register to hardware value
 
  Yes, I see what you are trying to say now.
  Unfortunately that's not enough: we also
  need to restore the registers afterwards for
  device to become useful again.
 
 I guess this is correct for the MSIX BAR.  But is it correct for MSIX
 enable/disable?

Probably not.

  Doing this in kernel seems more robust, otherwise
  we risk losing the device if qemu gets killed
  before it has restored the registers.
 
 Doesn't the driver have to enable MSIX if it attaches to the device at
 that point, anyway?

Yes. I'm talking about things like enabling memory, setting up irq register,
etc though. Most of this setup is done by bios.

 Since reset can change other config space registers, we achieve
 correctness for more of them.
   
Which other registers do you have in mind?
   
   BARs for example.  We may have our own reset for this, but isn't copying
   the hardware values more trustworthy?
 
  BAR values in host and guest are unrelated.
  If pci_reset didn't restore BAR values we won't
  be able to operate the device.
 
 
 Right.
 
 I guess candidates are those that are initialized with
 assigned_dev_emulate_config_*()?  Hard to see which ones because they're
 mass initialized.
 
 
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:

I'm thinking about this flow:

  FLR the device
  for each emulated register
 read it from the hardware
 if different from emulated register:
update the internal model (for example, disabling MSI in kvm if
needed)
  
   If we do it this way we get back the problem this patch
   is trying to solve: MSIX assigned while device
   memory is disabled would cause unsupported request errors.
  
  Why is that?  FLR would presumably disable MSI in the device, and this
  line would disable it in kvm as well.

 The bug is that device memory is disabled (FLR would do that)
 while MSI is enabled in kvm. The fix is to
 disable MSI in kvm first.

Yes, no need to repeat.  My question is whether my pseudo-code does the
same and whether or not if it is better (when applied to all emulated
config space).

   Doing this in kernel seems more robust, otherwise
   we risk losing the device if qemu gets killed
   before it has restored the registers.
  
  Doesn't the driver have to enable MSIX if it attaches to the device at
  that point, anyway?

 Yes. I'm talking about things like enabling memory, setting up irq register,
 etc though. Most of this setup is done by bios.

I see.  So should we have a pci_reset_function() variant that limits
itself to restoring just those bits?



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
 
 I'm thinking about this flow:
 
   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm 
 if
 needed)
   
If we do it this way we get back the problem this patch
is trying to solve: MSIX assigned while device
memory is disabled would cause unsupported request errors.
   
   Why is that?  FLR would presumably disable MSI in the device, and this
   line would disable it in kvm as well.
 
  The bug is that device memory is disabled (FLR would do that)
  while MSI is enabled in kvm. The fix is to
  disable MSI in kvm first.
 
 Yes, no need to repeat.  My question is whether my pseudo-code does the
 same

It doesn't seem to: FLR (disabling memory) is followed
by MSI disable in kvm instead of the reverse.

 and whether or not if it is better (when applied to all emulated
 config space).

I'm not sure.
I would like to see an example of a register that you have
in mind.

Doing this in kernel seems more robust, otherwise
we risk losing the device if qemu gets killed
before it has restored the registers.
   
   Doesn't the driver have to enable MSIX if it attaches to the device at
   that point, anyway?
 
  Yes. I'm talking about things like enabling memory, setting up irq register,
  etc though. Most of this setup is done by bios.
 
 I see.  So should we have a pci_reset_function() variant that limits
 itself to restoring just those bits?

We only need kernel to restore whatever qemu emulates, but
kernel doesn't know what that is.
What kind of interface do you have in mind?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
  
  I'm thinking about this flow:
  
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in 
  kvm if
  needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

Why is that?  FLR would presumably disable MSI in the device, and this
line would disable it in kvm as well.
  
   The bug is that device memory is disabled (FLR would do that)
   while MSI is enabled in kvm. The fix is to
   disable MSI in kvm first.
  
  Yes, no need to repeat.  My question is whether my pseudo-code does the
  same

 It doesn't seem to: FLR (disabling memory) is followed
 by MSI disable in kvm instead of the reverse.

Ah, so the problem is the ordering?  I see.

  and whether or not if it is better (when applied to all emulated
  config space).

 I'm not sure.
 I would like to see an example of a register that you have
 in mind.

I went over the PCI registers and saw none that would be affected.

  
   Yes. I'm talking about things like enabling memory, setting up irq 
   register,
   etc though. Most of this setup is done by bios.
  
  I see.  So should we have a pci_reset_function() variant that limits
  itself to restoring just those bits?

 We only need kernel to restore whatever qemu emulates, but
 kernel doesn't know what that is.
 What kind of interface do you have in mind?


The same as pci_reset_function(), but leaves MSI clear.

I guess it's not worth it if the ordering problem is there.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Jan Kiszka
On 2012-04-08 18:08, Avi Kivity wrote:
 On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:

 I'm thinking about this flow:

   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm if
 needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

 Why is that?  FLR would presumably disable MSI in the device, and this
 line would disable it in kvm as well.

 The bug is that device memory is disabled (FLR would do that)
 while MSI is enabled in kvm. The fix is to
 disable MSI in kvm first.

 Yes, no need to repeat.  My question is whether my pseudo-code does the
 same

 It doesn't seem to: FLR (disabling memory) is followed
 by MSI disable in kvm instead of the reverse.
 
 Ah, so the problem is the ordering?  I see.
 
 and whether or not if it is better (when applied to all emulated
 config space).

 I'm not sure.
 I would like to see an example of a register that you have
 in mind.
 
 I went over the PCI registers and saw none that would be affected.
 

 Yes. I'm talking about things like enabling memory, setting up irq 
 register,
 etc though. Most of this setup is done by bios.

 I see.  So should we have a pci_reset_function() variant that limits
 itself to restoring just those bits?

 We only need kernel to restore whatever qemu emulates, but
 kernel doesn't know what that is.
 What kind of interface do you have in mind?

 
 The same as pci_reset_function(), but leaves MSI clear.
 
 I guess it's not worth it if the ordering problem is there.

The core problem is not the ordering. The problem is that the kernel is
susceptible to ordering mistakes of userspace. And that is because the
kernel panics on PCI errors of devices that are in user hands - a
critical kernel bug IMHO. Proper reset of MSI or even the whole PCI
config space is another issue, but one the kernel should not worry about
- still, it should be fixed (therefore this patch).

But even if we disallowed userland to disable MMIO and PIO access to the
device, we would be be able to exclude that there are secrete channels
in the device's interface having the same effect. So we likely need to
enhance PCI error handling to catch and handle faults for certain
devices differently - those we cannot trust to behave properly while
they are under userland/guest control.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
 On 2012-04-08 18:08, Avi Kivity wrote:
  On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
 
  I'm thinking about this flow:
 
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in kvm 
  if
  needed)
 
  If we do it this way we get back the problem this patch
  is trying to solve: MSIX assigned while device
  memory is disabled would cause unsupported request errors.
 
  Why is that?  FLR would presumably disable MSI in the device, and this
  line would disable it in kvm as well.
 
  The bug is that device memory is disabled (FLR would do that)
  while MSI is enabled in kvm. The fix is to
  disable MSI in kvm first.
 
  Yes, no need to repeat.  My question is whether my pseudo-code does the
  same
 
  It doesn't seem to: FLR (disabling memory) is followed
  by MSI disable in kvm instead of the reverse.
  
  Ah, so the problem is the ordering?  I see.
  
  and whether or not if it is better (when applied to all emulated
  config space).
 
  I'm not sure.
  I would like to see an example of a register that you have
  in mind.
  
  I went over the PCI registers and saw none that would be affected.
  
 
  Yes. I'm talking about things like enabling memory, setting up irq 
  register,
  etc though. Most of this setup is done by bios.
 
  I see.  So should we have a pci_reset_function() variant that limits
  itself to restoring just those bits?
 
  We only need kernel to restore whatever qemu emulates, but
  kernel doesn't know what that is.
  What kind of interface do you have in mind?
 
  
  The same as pci_reset_function(), but leaves MSI clear.
  
  I guess it's not worth it if the ordering problem is there.
 
 The core problem is not the ordering. The problem is that the kernel is
 susceptible to ordering mistakes of userspace.
 And that is because the
 kernel panics on PCI errors of devices that are in user hands - a
 critical kernel bug IMHO.

I'm not sure. The pci sysfs interface
is by design not secured against malicious users,
isn't it?

 Proper reset of MSI or even the whole PCI
 config space is another issue, but one the kernel should not worry about
 - still, it should be fixed (therefore this patch).
 But even if we disallowed userland to disable MMIO and PIO access to the
 device, we would be be able to exclude that there are secrete channels
 in the device's interface having the same effect.

I'm not sure I agree here.  If there are secret channels to the device
that let it violate the PCI express spec, it can probably break the SRIOV
security model. And then you can do much more than just crash the host.

 So we likely need to
 enhance PCI error handling to catch and handle faults for certain
 devices differently - those we cannot trust to behave properly while
 they are under userland/guest control.
 
 Jan
 


I agree - forwarding errors to the guest would actually be very useful - but
I think we also need to analyse the problem carefully,
and prevent as many ways as we can for guest to cause trouble.

And there is another issue here: unsuppported request errors
should not cause kernel panics IMO.

There's also the issue that qemu let guest control the MMIO/PIO
bits in the command register.

So there are multiple bugs.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Jan Kiszka
On 2012-04-08 20:18, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
 On 2012-04-08 18:08, Avi Kivity wrote:
 On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:

 I'm thinking about this flow:

   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm 
 if
 needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

 Why is that?  FLR would presumably disable MSI in the device, and this
 line would disable it in kvm as well.

 The bug is that device memory is disabled (FLR would do that)
 while MSI is enabled in kvm. The fix is to
 disable MSI in kvm first.

 Yes, no need to repeat.  My question is whether my pseudo-code does the
 same

 It doesn't seem to: FLR (disabling memory) is followed
 by MSI disable in kvm instead of the reverse.

 Ah, so the problem is the ordering?  I see.

 and whether or not if it is better (when applied to all emulated
 config space).

 I'm not sure.
 I would like to see an example of a register that you have
 in mind.

 I went over the PCI registers and saw none that would be affected.


 Yes. I'm talking about things like enabling memory, setting up irq 
 register,
 etc though. Most of this setup is done by bios.

 I see.  So should we have a pci_reset_function() variant that limits
 itself to restoring just those bits?

 We only need kernel to restore whatever qemu emulates, but
 kernel doesn't know what that is.
 What kind of interface do you have in mind?


 The same as pci_reset_function(), but leaves MSI clear.

 I guess it's not worth it if the ordering problem is there.

 The core problem is not the ordering. The problem is that the kernel is
 susceptible to ordering mistakes of userspace.
 And that is because the
 kernel panics on PCI errors of devices that are in user hands - a
 critical kernel bug IMHO.
 
 I'm not sure. The pci sysfs interface
 is by design not secured against malicious users,
 isn't it?

That's surely true for devices outside of IOMMU protection. But do we
really have to give up when we encapsulate and isolate them that way?
Provided we moderate access to the sysfs resources via libvirt or some
other management service.

 
 Proper reset of MSI or even the whole PCI
 config space is another issue, but one the kernel should not worry about
 - still, it should be fixed (therefore this patch).
 But even if we disallowed userland to disable MMIO and PIO access to the
 device, we would be be able to exclude that there are secrete channels
 in the device's interface having the same effect.
 
 I'm not sure I agree here.  If there are secret channels to the device
 that let it violate the PCI express spec, it can probably break the SRIOV
 security model. And then you can do much more than just crash the host.

Maybe, but there are also other devices. And if a guest reprograms it
(firmware update...) and makes it stop reacting on requests, we may get
the same effect. That would also be some kind of a secrete channel.

 
 So we likely need to
 enhance PCI error handling to catch and handle faults for certain
 devices differently - those we cannot trust to behave properly while
 they are under userland/guest control.

 Jan

 
 
 I agree - forwarding errors to the guest would actually be very useful - but
 I think we also need to analyse the problem carefully,
 and prevent as many ways as we can for guest to cause trouble.

If possible, the protection should target userspace which would
automatically include guests. Only if that is not feasible with
reasonable effort, we have to rely on QEMU to save the host.

 
 And there is another issue here: unsuppported request errors
 should not cause kernel panics IMO.
 
 There's also the issue that qemu let guest control the MMIO/PIO
 bits in the command register.
 
 So there are multiple bugs.
 

Yep, that's true.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 08:39:35PM +0200, Jan Kiszka wrote:
 On 2012-04-08 20:18, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
  On 2012-04-08 18:08, Avi Kivity wrote:
  On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
 
  I'm thinking about this flow:
 
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in 
  kvm if
  needed)
 
  If we do it this way we get back the problem this patch
  is trying to solve: MSIX assigned while device
  memory is disabled would cause unsupported request errors.
 
  Why is that?  FLR would presumably disable MSI in the device, and this
  line would disable it in kvm as well.
 
  The bug is that device memory is disabled (FLR would do that)
  while MSI is enabled in kvm. The fix is to
  disable MSI in kvm first.
 
  Yes, no need to repeat.  My question is whether my pseudo-code does the
  same
 
  It doesn't seem to: FLR (disabling memory) is followed
  by MSI disable in kvm instead of the reverse.
 
  Ah, so the problem is the ordering?  I see.
 
  and whether or not if it is better (when applied to all emulated
  config space).
 
  I'm not sure.
  I would like to see an example of a register that you have
  in mind.
 
  I went over the PCI registers and saw none that would be affected.
 
 
  Yes. I'm talking about things like enabling memory, setting up irq 
  register,
  etc though. Most of this setup is done by bios.
 
  I see.  So should we have a pci_reset_function() variant that limits
  itself to restoring just those bits?
 
  We only need kernel to restore whatever qemu emulates, but
  kernel doesn't know what that is.
  What kind of interface do you have in mind?
 
 
  The same as pci_reset_function(), but leaves MSI clear.
 
  I guess it's not worth it if the ordering problem is there.
 
  The core problem is not the ordering. The problem is that the kernel is
  susceptible to ordering mistakes of userspace.
  And that is because the
  kernel panics on PCI errors of devices that are in user hands - a
  critical kernel bug IMHO.
  
  I'm not sure. The pci sysfs interface
  is by design not secured against malicious users,
  isn't it?
 
 That's surely true for devices outside of IOMMU protection. But do we
 really have to give up when we encapsulate and isolate them that way?
 Provided we moderate access to the sysfs resources via libvirt or some
 other management service.

We don't have to give up but we'd have to build such an
interface: /config attribute is not it.

  
  Proper reset of MSI or even the whole PCI
  config space is another issue, but one the kernel should not worry about
  - still, it should be fixed (therefore this patch).
  But even if we disallowed userland to disable MMIO and PIO access to the
  device, we would be be able to exclude that there are secrete channels
  in the device's interface having the same effect.
  
  I'm not sure I agree here.  If there are secret channels to the device
  that let it violate the PCI express spec, it can probably break the SRIOV
  security model. And then you can do much more than just crash the host.
 
 Maybe, but there are also other devices. And if a guest reprograms it
 (firmware update...) and makes it stop reacting on requests, we may get
 the same effect. That would also be some kind of a secrete channel.

Right. So it looks like SRIOV VF is the only type of device that
is safe to assign to a guest:

Presumably, SRIOV VFs don't let driver program the firmware.
And I think SRIOV VFs don't have MMIO/PIO enable bits either,
and the BAR isn't programmable through the VF...

  
  So we likely need to
  enhance PCI error handling to catch and handle faults for certain
  devices differently - those we cannot trust to behave properly while
  they are under userland/guest control.
 
  Jan
 
  
  
  I agree - forwarding errors to the guest would actually be very useful - but
  I think we also need to analyse the problem carefully,
  and prevent as many ways as we can for guest to cause trouble.
 
 If possible, the protection should target userspace which would
 automatically include guests. Only if that is not feasible with
 reasonable effort, we have to rely on QEMU to save the host.

Defence in depth is best, right?

  
  And there is another issue here: unsuppported request errors
  should not cause kernel panics IMO.
  
  There's also the issue that qemu let guest control the MMIO/PIO
  bits in the command register.
  
  So there are multiple bugs.
  
 
 Yep, that's true.
 
 Jan
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Jan Kiszka
On 2012-04-05 05:42, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.
 
 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.
 
 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.
 
  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be running.  We want to return the device to a known state on
 + * reset, so disable those here.  We especially do not want MSI-X

To be precised: It is the specified state after reset which we fail to
restore so far.

 + * enabled since it lives 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Michael S. Tsirkin
On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.
 
 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.
 
 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.

I don't think there's a disagreement: don't we all agree
they should be forwarded to qemu and on the guest if possible,
ignored if not?

  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be running.  We want to return the device to a known state on
 + * reset, so disable those 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Alex Williamson
On Thu, 2012-04-05 at 12:34 +0300, Michael S. Tsirkin wrote:
 On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
  We've hit a kernel host panic, when issuing a 'system_reset' with an
  82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
  
  [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
  32993
  [Hardware Error]: APEI generic hardware error status
  [Hardware Error]: severity: 1, fatal
  [Hardware Error]: section: 0, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  [Hardware Error]: section: 1, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  Kernel panic - not syncing: Fatal hardware error!
  Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
  Call Trace:
   NMI  [814f2fe5] ? panic+0xa0/0x168
   [812f919c] ? ghes_notify_nmi+0x17c/0x180
   [814f91d5] ? notifier_call_chain+0x55/0x80
   [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
   [8109667e] ? notify_die+0x2e/0x30
   [814f6e81] ? do_nmi+0x1a1/0x2b0
   [814f6760] ? nmi+0x20/0x30
   [8103762b] ? native_safe_halt+0xb/0x10
   EOE  [8101495d] ? default_idle+0x4d/0xb0
   [81009e06] ? cpu_idle+0xb6/0x110
   [814da63a] ? rest_init+0x7a/0x80
   [81c1ff7b] ? start_kernel+0x424/0x430
   [81c1f33a] ? x86_64_start_reservations+0x125/0x129
   [81c1f438] ? x86_64_start_kernel+0xfa/0x109
  
  The root cause of the problem is that the 'reset_assigned_device()' code
  first writes a 0 to the command register. Then, when qemu subsequently does
  a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
  the kernel ends up calling '__msix_mask_irq()', which performs a write to
  the memory mapped msi vector space. Since, we've explicitly told the device
  to disallow mmio access (via the 0 write to the command register), we end
  up with the above 'Unsupported Request'.
  
  The fix here is to first disable MSI-X, before doing the reset.  We also
  disable MSI, leaving the device in INTx mode.  In this way, the device is
  a known state after reset, and we avoid touching msi memory mapped space
  on any subsequent 'kvm_deassign_irq()'.
  
  Thanks to Michael S. Tsirkin for help in understanding what was going on
  here and Jason Baron, the original debugger of this problem.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
  Jason is out of the office for a couple weeks, so I'll try to resolve
  this while he's away.  Somehow the emulated config updates were lost
  in Jason's original posting, so I've fixed that and taken Jan's suggestion
  to simply call into the update functions instead of open coding the
  interrupt disable.  I think there still may be some disagreements about
  how to handle guest generated errors in the host, but that's a large
  project whereas this is something we should be doing at reset anyway,
  and even if only a workaround, resolves the problem above.
 
 I don't think there's a disagreement: don't we all agree
 they should be forwarded to qemu and on the guest if possible,
 ignored if not?

With AER perhaps, but I'm not sure we have that flexibility with APEI,
but I'd need to read up on the spec.  It seems APEI defines that the
BIOS gets first shot at handling the error and gets to dictate how the
OS handles it.  In this case, deciding that it's fatal.

   hw/device-assignment.c |   23 +++
   1 files changed, 23 insertions(+), 0 deletions(-)
  
  diff --git a/hw/device-assignment.c 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Michael S. Tsirkin
On Thu, Apr 05, 2012 at 08:42:03AM -0600, Alex Williamson wrote:
  So far so good.
  
   + * We especially do not want MSI-X
   + * enabled since it lives in MMIO space, which is about to get
   + * disabled.
  
  I think we are better off dropping the above, because it's
  a bug that we disable MMIO space on the physical device:
  we did not enable it (kvm did) let's not disable.
 
 I'd like to investigate removing the command register clearing, but it
 did solve a bug at the time it went in.  That will be a separate patch,
 so we can remove the above sentence when that's removed.  It may be a
 bug, but it's describing the current behavior.

Fair enough.
Acked-by: Michael S. Tsirkin m...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-04 Thread Alex Williamson
We've hit a kernel host panic, when issuing a 'system_reset' with an
82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.

[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 0, PCIe end point
[Hardware Error]: version: 1.0
[Hardware Error]: command: 0x, status: 0x0010
[Hardware Error]: device_id: :08:00.0
[Hardware Error]: slot: 1
[Hardware Error]: secondary_bus: 0x00
[Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
[Hardware Error]: class_code: 02
[Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
[Hardware Error]: Unsupported Request
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x00067011
[Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
[Hardware Error]: section: 1, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 0, PCIe end point
[Hardware Error]: version: 1.0
[Hardware Error]: command: 0x, status: 0x0010
[Hardware Error]: device_id: :08:00.0
[Hardware Error]: slot: 1
[Hardware Error]: secondary_bus: 0x00
[Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
[Hardware Error]: class_code: 02
[Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
[Hardware Error]: Unsupported Request
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x00067011
[Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
Kernel panic - not syncing: Fatal hardware error!
Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
Call Trace:
 NMI  [814f2fe5] ? panic+0xa0/0x168
 [812f919c] ? ghes_notify_nmi+0x17c/0x180
 [814f91d5] ? notifier_call_chain+0x55/0x80
 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
 [8109667e] ? notify_die+0x2e/0x30
 [814f6e81] ? do_nmi+0x1a1/0x2b0
 [814f6760] ? nmi+0x20/0x30
 [8103762b] ? native_safe_halt+0xb/0x10
 EOE  [8101495d] ? default_idle+0x4d/0xb0
 [81009e06] ? cpu_idle+0xb6/0x110
 [814da63a] ? rest_init+0x7a/0x80
 [81c1ff7b] ? start_kernel+0x424/0x430
 [81c1f33a] ? x86_64_start_reservations+0x125/0x129
 [81c1f438] ? x86_64_start_kernel+0xfa/0x109

The root cause of the problem is that the 'reset_assigned_device()' code
first writes a 0 to the command register. Then, when qemu subsequently does
a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
the kernel ends up calling '__msix_mask_irq()', which performs a write to
the memory mapped msi vector space. Since, we've explicitly told the device
to disallow mmio access (via the 0 write to the command register), we end
up with the above 'Unsupported Request'.

The fix here is to first disable MSI-X, before doing the reset.  We also
disable MSI, leaving the device in INTx mode.  In this way, the device is
a known state after reset, and we avoid touching msi memory mapped space
on any subsequent 'kvm_deassign_irq()'.

Thanks to Michael S. Tsirkin for help in understanding what was going on
here and Jason Baron, the original debugger of this problem.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

Jason is out of the office for a couple weeks, so I'll try to resolve
this while he's away.  Somehow the emulated config updates were lost
in Jason's original posting, so I've fixed that and taken Jan's suggestion
to simply call into the update functions instead of open coding the
interrupt disable.  I think there still may be some disagreements about
how to handle guest generated errors in the host, but that's a large
project whereas this is something we should be doing at reset anyway,
and even if only a workaround, resolves the problem above.

 hw/device-assignment.c |   23 +++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 89823f1..2e6b93e 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
 const char reset[] = 1;
 int fd, ret;
 
+/*
+ * If a guest is reset without being shutdown, MSI/MSI-X can still
+ * be running.  We want to return the device to a known state on
+ * reset, so disable those here.  We especially do not want MSI-X
+ * enabled since it lives in MMIO space, which is about to get
+ * disabled.
+ */
+if (adev-irq_requested_type  KVM_DEV_IRQ_GUEST_MSIX) {
+uint16_t ctrl = pci_get_word(pci_dev-config +
+ pci_dev-msix_cap +