Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
The discussion on this patch seems to have fizzled, with no clear short term solution. Jan gave a Reviewed-by and Michael an Acked-by for this patch. There's work left to do, but I request that we pull in this fix now. Thanks, Alex On Wed, 2012-04-04 at 21:42 -0600, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/16/2012 05:03 PM, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. Jan gave a Reviewed-by and Michael an Acked-by for this patch. There's work left to do, but I request that we pull in this fix now. Thanks, I agree (but am not committing this week). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like the below? Compiled only. device-assignment: don't touch pci command register Real command register is under kernel control: it includes bits for triggering SERR, marking BARs as invalid and such which are under host kernel control. Don't touch any except bus master which is ok to put under guest control. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..9ebce49 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, FILE *f; unsigned long long start, end, size, flags; uint16_t id; -struct stat statbuf; PCIRegion *rp; PCIDevRegions *dev = pci_dev-real_device; @@ -610,12 +609,8 @@ again: pci_dev-dev.config[2] = id 0xff; pci_dev-dev.config[3] = (id 0xff00) 8; -/* dealing with virtual function device */ -snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) { -/* always provide the written value on readout */ -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); -} +/* Pass bus master writes to device. */ +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER; dev-region_number = r; return 0; @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } -if (dev-features ASSIGNED_DEVICE_SHARE_INTX_MASK -kvm_has_intx_set_mask()) { -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; - -/* hide host-side INTx masking from the guest */ -dev-emulate_config_read[PCI_COMMAND + 1] |= -PCI_COMMAND_INTX_DISABLE 8; -} r = kvm_assign_pci_device(kvm_state, assigned_dev_data); if (r 0) { @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev) } /* - * When a 0 is written to the command register, the device is logically + * When a 0 is written to the bus master register, the device is logically * disconnected from the PCI bus. This avoids further DMA transfers. */ -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); } static int assigned_initfn(struct PCIDevice *pci_dev) @@ -1658,7 +1645,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) * device initialization. */ assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); assigned_dev_direct_config_read(dev, PCI_STATUS, 2); assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote: On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. We currently mark intx as emulated, no? Anyway - care fixing it properly? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 06:06:40PM +0300, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like the below? Compiled only. device-assignment: don't touch pci command register Real command register is under kernel control: it includes bits for triggering SERR, marking BARs as invalid and such which are under host kernel control. Don't touch any except bus master which is ok to put under guest control. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..9ebce49 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, FILE *f; unsigned long long start, end, size, flags; uint16_t id; -struct stat statbuf; PCIRegion *rp; PCIDevRegions *dev = pci_dev-real_device; @@ -610,12 +609,8 @@ again: pci_dev-dev.config[2] = id 0xff; pci_dev-dev.config[3] = (id 0xff00) 8; -/* dealing with virtual function device */ -snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) { -/* always provide the written value on readout */ -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); -} +/* Pass bus master writes to device. */ +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER; dev-region_number = r; return 0; @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } -if (dev-features ASSIGNED_DEVICE_SHARE_INTX_MASK -kvm_has_intx_set_mask()) { -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; - -/* hide host-side INTx masking from the guest */ -dev-emulate_config_read[PCI_COMMAND + 1] |= -PCI_COMMAND_INTX_DISABLE 8; -} r = kvm_assign_pci_device(kvm_state, assigned_dev_data); if (r 0) { @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev) } /* - * When a 0 is written to the command register, the device is logically + * When a 0 is written to the bus master register, the device is logically * disconnected from the PCI bus. This avoids further DMA transfers. */ -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); } This is still going to disable mmio, is the intent to just clear the bus master bit, ie bit 2? Or is this patch meant to be in addition to the one Alex posted? Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-16 18:08, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote: On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. We currently mark intx as emulated, no? If we use the mask also in the kernel for IRQ sharing, we *monitor* it. Anyway - care fixing it properly? There is nothing to fix. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote: On 2012-04-16 18:08, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote: On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. We currently mark intx as emulated, no? If we use the mask also in the kernel for IRQ sharing, we *monitor* it. Anyway - care fixing it properly? There is nothing to fix. Jan Guest access to command register is a bug. E.g. it has no business enabling SERR as SERR crashes host when triggered. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-16 18:36, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote: On 2012-04-16 18:08, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote: On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. We currently mark intx as emulated, no? If we use the mask also in the kernel for IRQ sharing, we *monitor* it. Anyway - care fixing it properly? There is nothing to fix. Jan Guest access to command register is a bug. E.g. it has no business enabling SERR as SERR crashes host when triggered. Maybe a misunderstanding. I agree that other bits are not guest business. But INTx should be excluded from this filtering and handled as is, i.e. passed through. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 06:38:22PM +0200, Jan Kiszka wrote: On 2012-04-16 18:36, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote: On 2012-04-16 18:08, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote: On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. We currently mark intx as emulated, no? If we use the mask also in the kernel for IRQ sharing, we *monitor* it. Anyway - care fixing it properly? There is nothing to fix. Jan Guest access to command register is a bug. E.g. it has no business enabling SERR as SERR crashes host when triggered. Maybe a misunderstanding. I agree that other bits are not guest business. But INTx should be excluded from this filtering and handled as is, i.e. passed through. Jan Ah, I see what you mean. So let's add pci_dev-emulate_config_write[PCI_COMMAND + 1] = ~(PCI_COMMAND_INTX_DISABLE 8); on top? -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 12:12:52PM -0400, Jason Baron wrote: On Mon, Apr 16, 2012 at 06:06:40PM +0300, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like the below? Compiled only. device-assignment: don't touch pci command register Real command register is under kernel control: it includes bits for triggering SERR, marking BARs as invalid and such which are under host kernel control. Don't touch any except bus master which is ok to put under guest control. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..9ebce49 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, FILE *f; unsigned long long start, end, size, flags; uint16_t id; -struct stat statbuf; PCIRegion *rp; PCIDevRegions *dev = pci_dev-real_device; @@ -610,12 +609,8 @@ again: pci_dev-dev.config[2] = id 0xff; pci_dev-dev.config[3] = (id 0xff00) 8; -/* dealing with virtual function device */ -snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) { -/* always provide the written value on readout */ -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); -} +/* Pass bus master writes to device. */ +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER; dev-region_number = r; return 0; @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } -if (dev-features ASSIGNED_DEVICE_SHARE_INTX_MASK -kvm_has_intx_set_mask()) { -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; - -/* hide host-side INTx masking from the guest */ -dev-emulate_config_read[PCI_COMMAND + 1] |= -PCI_COMMAND_INTX_DISABLE 8; -} r = kvm_assign_pci_device(kvm_state, assigned_dev_data); if (r 0) { @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev) } /* - * When a 0 is written to the command register, the device is logically + * When a 0 is written to the bus master register, the device is logically * disconnected from the PCI bus. This avoids further DMA transfers. */ -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); } This is still going to disable mmio, I think it won't since all other bits are marked as emulated now. is the intent to just clear the bus master bit, ie bit 2? Or is this patch meant to be in addition to the one Alex posted? Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-16 19:12, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 06:38:22PM +0200, Jan Kiszka wrote: On 2012-04-16 18:36, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 06:13:03PM +0200, Jan Kiszka wrote: On 2012-04-16 18:08, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 05:10:07PM +0200, Jan Kiszka wrote: On 2012-04-16 17:06, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like INTx is harmless as well. Denying access breaks masking for the guest. We currently mark intx as emulated, no? If we use the mask also in the kernel for IRQ sharing, we *monitor* it. Anyway - care fixing it properly? There is nothing to fix. Jan Guest access to command register is a bug. E.g. it has no business enabling SERR as SERR crashes host when triggered. Maybe a misunderstanding. I agree that other bits are not guest business. But INTx should be excluded from this filtering and handled as is, i.e. passed through. Jan Ah, I see what you mean. So let's add pci_dev-emulate_config_write[PCI_COMMAND + 1] = ~(PCI_COMMAND_INTX_DISABLE 8); on top? Looks good. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, 2012-04-16 at 18:06 +0300, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like the below? Compiled only. device-assignment: don't touch pci command register Real command register is under kernel control: it includes bits for triggering SERR, marking BARs as invalid and such which are under host kernel control. Don't touch any except bus master which is ok to put under guest control. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..9ebce49 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, FILE *f; unsigned long long start, end, size, flags; uint16_t id; -struct stat statbuf; PCIRegion *rp; PCIDevRegions *dev = pci_dev-real_device; @@ -610,12 +609,8 @@ again: pci_dev-dev.config[2] = id 0xff; pci_dev-dev.config[3] = (id 0xff00) 8; -/* dealing with virtual function device */ -snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) { -/* always provide the written value on readout */ -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); -} +/* Pass bus master writes to device. */ +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER; dev-region_number = r; return 0; @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } -if (dev-features ASSIGNED_DEVICE_SHARE_INTX_MASK -kvm_has_intx_set_mask()) { -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; This doesn't look right ^^^, we'll never make use of host side INTx disable support that way. Thanks, Alex - -/* hide host-side INTx masking from the guest */ -dev-emulate_config_read[PCI_COMMAND + 1] |= -PCI_COMMAND_INTX_DISABLE 8; -} r = kvm_assign_pci_device(kvm_state, assigned_dev_data); if (r 0) { @@ -1631,10 +1618,10 @@ static void reset_assigned_device(DeviceState *dev) } /* - * When a 0 is written to the command register, the device is logically + * When a 0 is written to the bus master register, the device is logically * disconnected from the PCI bus. This avoids further DMA transfers. */ -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); } static int assigned_initfn(struct PCIDevice *pci_dev) @@ -1658,7 +1645,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) * device initialization. */ assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); assigned_dev_direct_config_read(dev, PCI_STATUS, 2); assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, Apr 16, 2012 at 01:07:56PM -0600, Alex Williamson wrote: On Mon, 2012-04-16 at 18:06 +0300, Michael S. Tsirkin wrote: On Mon, Apr 16, 2012 at 08:03:17AM -0600, Alex Williamson wrote: The discussion on this patch seems to have fizzled, with no clear short term solution. I think we are in concensus, it's just that there are multiple bugs still left to fix. First, we need to prevent guest from touching command register except for the bus master bit. Something like the below? Compiled only. device-assignment: don't touch pci command register Real command register is under kernel control: it includes bits for triggering SERR, marking BARs as invalid and such which are under host kernel control. Don't touch any except bus master which is ok to put under guest control. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..9ebce49 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, FILE *f; unsigned long long start, end, size, flags; uint16_t id; -struct stat statbuf; PCIRegion *rp; PCIDevRegions *dev = pci_dev-real_device; @@ -610,12 +609,8 @@ again: pci_dev-dev.config[2] = id 0xff; pci_dev-dev.config[3] = (id 0xff00) 8; -/* dealing with virtual function device */ -snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) { -/* always provide the written value on readout */ -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); -} +/* Pass bus master writes to device. */ +pci_dev-emulate_config_write[PCI_COMMAND] = ~PCI_COMMAND_MASTER; dev-region_number = r; return 0; @@ -782,14 +777,6 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } -if (dev-features ASSIGNED_DEVICE_SHARE_INTX_MASK -kvm_has_intx_set_mask()) { -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; This doesn't look right ^^^, we'll never make use of host side INTx disable support that way. Thanks, Alex Right. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Mon, 2012-04-09 at 11:35 +0300, Avi Kivity wrote: On 04/08/2012 08:37 PM, Jan Kiszka wrote: The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. Certainly. But this userspace patch won't fix it. No, it won't in general and I don't think it makes sense to mangle pci-sysfs config space support to the nuances of a user space driver. We really need a userspace driver interface that limits the config space interactions and provides a channel to support error reporting and userspace recovery. This type of thing can be done with VFIO if we could ever get off the ground and get some consensus around it. Please feel free to contribute to that discussion if you ever want to get away from this clunky device assignment interface we have now. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). And I was asking what is the right way to do it. Reset the device and read back the register values, or do an emulated reset and push down the register values. Reading back the register values is currently a noop since the kernel restores the config space to the incoming state after reset. KVM does stash away the original config space of the device to be restored prior to releasing the device. We could restore to that each time, but that would mean implementing a device reset ioctl in kvm, and we'd still need this patch for compatibility and we still have the issues Michael brings up with the config restore updating things like MSI that we need to then manually sync with kvm. I fear suggesting it, but perhaps another way to achieve this result would be to de-assign and re-assign the device in reset. But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Why not all of them? I think Jan is probably suggesting that we do need user space error handling for all userland/guest controlled devices, but some classes of errors on certain devices may be handled automatically by the userspace interface layer... which we could do with VFIO (well, assuming the APEI spec let's us nak the bios reporting a fatal error). So do we want to invent new solutions for each of these or do we want to move to a new interface? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 08:37 PM, Jan Kiszka wrote: The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. Certainly. But this userspace patch won't fix it. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). And I was asking what is the right way to do it. Reset the device and read back the register values, or do an emulated reset and push down the register values. But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Why not all of them? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/05/2012 06:42 AM, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those here. We especially do not want MSI-X + * enabled since it lives in MMIO space, which is about to get + * disabled. + */ +if
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:14:29PM +0300, Avi Kivity wrote: On 04/05/2012 06:42 AM, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. Since reset can change other config space registers, we achieve correctness for more of them. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) set emulated register to hardware value Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote: On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. set emulated register to hardware value Yes, I see what you are trying to say now. Unfortunately that's not enough: we also need to restore the registers afterwards for device to become useful again. Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? BAR values in host and guest are unrelated. If pci_reset didn't restore BAR values we won't be able to operate the device. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote: On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. set emulated register to hardware value Yes, I see what you are trying to say now. Unfortunately that's not enough: we also need to restore the registers afterwards for device to become useful again. I guess this is correct for the MSIX BAR. But is it correct for MSIX enable/disable? Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? BAR values in host and guest are unrelated. If pci_reset didn't restore BAR values we won't be able to operate the device. Right. I guess candidates are those that are initialized with assigned_dev_emulate_config_*()? Hard to see which ones because they're mass initialized. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 06:26:28PM +0300, Avi Kivity wrote: On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote: On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. set emulated register to hardware value Yes, I see what you are trying to say now. Unfortunately that's not enough: we also need to restore the registers afterwards for device to become useful again. I guess this is correct for the MSIX BAR. But is it correct for MSIX enable/disable? Probably not. Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? BAR values in host and guest are unrelated. If pci_reset didn't restore BAR values we won't be able to operate the device. Right. I guess candidates are those that are initialized with assigned_dev_emulate_config_*()? Hard to see which ones because they're mass initialized. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same and whether or not if it is better (when applied to all emulated config space). Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote: On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. I'm not sure. The pci sysfs interface is by design not secured against malicious users, isn't it? Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. I'm not sure I agree here. If there are secret channels to the device that let it violate the PCI express spec, it can probably break the SRIOV security model. And then you can do much more than just crash the host. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan I agree - forwarding errors to the guest would actually be very useful - but I think we also need to analyse the problem carefully, and prevent as many ways as we can for guest to cause trouble. And there is another issue here: unsuppported request errors should not cause kernel panics IMO. There's also the issue that qemu let guest control the MMIO/PIO bits in the command register. So there are multiple bugs. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-08 20:18, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote: On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. I'm not sure. The pci sysfs interface is by design not secured against malicious users, isn't it? That's surely true for devices outside of IOMMU protection. But do we really have to give up when we encapsulate and isolate them that way? Provided we moderate access to the sysfs resources via libvirt or some other management service. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. I'm not sure I agree here. If there are secret channels to the device that let it violate the PCI express spec, it can probably break the SRIOV security model. And then you can do much more than just crash the host. Maybe, but there are also other devices. And if a guest reprograms it (firmware update...) and makes it stop reacting on requests, we may get the same effect. That would also be some kind of a secrete channel. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan I agree - forwarding errors to the guest would actually be very useful - but I think we also need to analyse the problem carefully, and prevent as many ways as we can for guest to cause trouble. If possible, the protection should target userspace which would automatically include guests. Only if that is not feasible with reasonable effort, we have to rely on QEMU to save the host. And there is another issue here: unsuppported request errors should not cause kernel panics IMO. There's also the issue that qemu let guest control the MMIO/PIO bits in the command register. So there are multiple bugs. Yep, that's true. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 08:39:35PM +0200, Jan Kiszka wrote: On 2012-04-08 20:18, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote: On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. I'm not sure. The pci sysfs interface is by design not secured against malicious users, isn't it? That's surely true for devices outside of IOMMU protection. But do we really have to give up when we encapsulate and isolate them that way? Provided we moderate access to the sysfs resources via libvirt or some other management service. We don't have to give up but we'd have to build such an interface: /config attribute is not it. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. I'm not sure I agree here. If there are secret channels to the device that let it violate the PCI express spec, it can probably break the SRIOV security model. And then you can do much more than just crash the host. Maybe, but there are also other devices. And if a guest reprograms it (firmware update...) and makes it stop reacting on requests, we may get the same effect. That would also be some kind of a secrete channel. Right. So it looks like SRIOV VF is the only type of device that is safe to assign to a guest: Presumably, SRIOV VFs don't let driver program the firmware. And I think SRIOV VFs don't have MMIO/PIO enable bits either, and the BAR isn't programmable through the VF... So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan I agree - forwarding errors to the guest would actually be very useful - but I think we also need to analyse the problem carefully, and prevent as many ways as we can for guest to cause trouble. If possible, the protection should target userspace which would automatically include guests. Only if that is not feasible with reasonable effort, we have to rely on QEMU to save the host. Defence in depth is best, right? And there is another issue here: unsuppported request errors should not cause kernel panics IMO. There's also the issue that qemu let guest control the MMIO/PIO bits in the command register. So there are multiple bugs. Yep, that's true. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-05 05:42, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those here. We especially do not want MSI-X To be precised: It is the specified state after reset which we fail to restore so far. + * enabled since it lives
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. I don't think there's a disagreement: don't we all agree they should be forwarded to qemu and on the guest if possible, ignored if not? hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Thu, 2012-04-05 at 12:34 +0300, Michael S. Tsirkin wrote: On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. I don't think there's a disagreement: don't we all agree they should be forwarded to qemu and on the guest if possible, ignored if not? With AER perhaps, but I'm not sure we have that flexibility with APEI, but I'd need to read up on the spec. It seems APEI defines that the BIOS gets first shot at handling the error and gets to dictate how the OS handles it. In this case, deciding that it's fatal. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Thu, Apr 05, 2012 at 08:42:03AM -0600, Alex Williamson wrote: So far so good. + * We especially do not want MSI-X + * enabled since it lives in MMIO space, which is about to get + * disabled. I think we are better off dropping the above, because it's a bug that we disable MMIO space on the physical device: we did not enable it (kvm did) let's not disable. I'd like to investigate removing the command register clearing, but it did solve a bug at the time it went in. That will be a separate patch, so we can remove the above sentence when that's removed. It may be a bug, but it's describing the current behavior. Fair enough. Acked-by: Michael S. Tsirkin m...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those here. We especially do not want MSI-X + * enabled since it lives in MMIO space, which is about to get + * disabled. + */ +if (adev-irq_requested_type KVM_DEV_IRQ_GUEST_MSIX) { +uint16_t ctrl = pci_get_word(pci_dev-config + + pci_dev-msix_cap +