Re: source for virt io backend driver
On Sat, Apr 07, 2012, Steven wrote about source for virt io backend driver: Hi, I would like to know where I can find the source for backend driver of virtio device. For example, I know that the front-end net driver is ... Can anyone help to point out where the source of net virtio backend in the host kernel? If you use normal virtio, the backend is in qemu, i.e., host user space, *not* in the host kernel. So you need to look for it in qemu, not kvm. If you want the backend to be in the kernel, then you probably mean vhost-net. This you can find in drivers/vhost/net.c (and drivers/vhost/vhost.c for the generic vhost infrastructure for virtio drivers in the host kernel). Of course, also with vhost-net, qemu is involved in setting up this device. But qemu doesn't need to get involved in the data path (data, interrupts, etc.) which is done completely in the kernel, and therefore much more efficiently. -- Nadav Har'El| Sunday, Apr 8 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Lottery: You need a dollar and a dream. http://nadav.harel.org.il |We'll take the dollar, you keep the dream. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 43068] Operation restricted to levels L0-2 - kerneloops when booting L3
https://bugzilla.kernel.org/show_bug.cgi?id=43068 Nadav Har'El n...@math.technion.ac.il changed: What|Removed |Added CC||n...@math.technion.ac.il --- Comment #1 from Nadav Har'El n...@math.technion.ac.il 2012-04-08 08:18:32 --- Hi, indeed, theoretically L3 should work; And also in practice - in my tests it did work (albeit very very slowly). I'll need to look into this issue, and check why this bug is happening now. You should just be aware that even if it will work, L3 will be extremely slow, likely to the point of not being useful. One of reasons is the lack of nested EPT in the upstream version. An even bigger problem is the exponential explosion of exits described in the Turtles paper, which is made much worse by one particular part of the existing implementation: Right now, on every nested entry from L1 to L2, L0 recreates all the fields of the vmcs (see prepare_vmcs02()) doing a few dozen VMWRITEs. When L0 does it, it's fine - but when we are nested yet deeper, and there is an entry from L2 into L3 and *L1* needs to do prepare_vmcs02(), now L1 calls a lot of VMWRITEs and they all cause exits, and all of this is extremely slow. There are two things we could do in the future to solve this performance problem - if people really need to use L3 (currently, even L2 isn't very popular ;-)). One is to have some sort of exit-less VMREAD/VMWRITE, with or without hardware support for this feature. The second thing is to emulate VMWRITE differently, writing to vmcs02 immediately and doing far fewer VMWRITES on nested entries (doing this correctly is harder than it might seem, as I can explain in a separate thread). Nadav Har'El -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM fixes for 3.4-rc1
On 04/07/2012 07:55 PM, Linus Torvalds wrote: On Thu, Apr 5, 2012 at 9:40 AM, Avi Kivity a...@redhat.com wrote: Linus, please pull a few fixes from git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.4 A bunch of powerpc KVM fixes, a guest and a host RCU fix (unrelated), and a small build fix. Grr. static inline int pull_function(void) { return -ENODIFFSTAT; } Please use the proper scripting to do pull requests, like git request-pull I did, but I left the diffstat out accidentally, sorry. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New git workflow
On 04/05/2012 08:02 PM, Avi Kivity wrote: I'll publish the new branches tomorrow, with any luck. There wasn't any luck, so it's only ready today. To allow chance for review, I'm publishing next as next-candidate. Paul/Alex, please review the powerpc bits. Specifically: system.h is gone, so I moved the prototype of load_up_fpu() to asm/switch_to.h and added a #include (8fae845f4956d). E6500 was added in upstream in parallel with the split of kvm E500/E500MC. I guessed which part of the #ifdef E6500 was to go into, but please verify (73196cd364a2, 06aae86799c1b). Please either post fixup patches atop next-candidate, if needed, or (preferably) post a new tree of the powerpc update based on b6d33834bd4e, to replace b6d33834bd4e..966cd0f3bdd. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Improve iteration through sptes from rmap
On 03/21/2012 04:48 PM, Takuya Yoshikawa wrote: By removing sptep from rmap_iterator, I could achieve 15% performance improvement without inlining. Thanks, applied to next-candidate. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: MMU: Make pte_list_desc fit cache lines well
On 03/21/2012 04:49 PM, Takuya Yoshikawa wrote: From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20 bytes do not fit cache lines well. Furthermore, some allocators may use 64/32-byte objects for the pte_list_desc cache. This patch solves this problem by changing PTE_LIST_EXT from 4 to 3. For shadow paging, the new size is still large enough to hold both the kernel and process mappings for usual anonymous pages. For file mappings, there may be a slight change in the cache usage. Note: with EPT/NPT we almost always have a single spte in each reverse mapping and we will not see any change by this. @@ -135,8 +135,6 @@ module_param(dbg, bool, 0644); #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK \ | PT64_NX_MASK) -#define PTE_LIST_EXT 4 - #define ACC_EXEC_MASK1 #define ACC_WRITE_MASK PT_WRITABLE_MASK #define ACC_USER_MASKPT_USER_MASK @@ -151,6 +149,9 @@ module_param(dbg, bool, 0644); #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) +/* make pte_list_desc fit well in cache line */ +#define PTE_LIST_EXT 3 + struct pte_list_desc { u64 *sptes[PTE_LIST_EXT]; struct pte_list_desc *more; We could go even further and have 4 pointers, and use bit 0 to decide whether it's a next pointer or an sptep. Not sure it's worth the extra complexity. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/05/2012 06:42 AM, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those here. We especially do not want MSI-X + * enabled since it lives in MMIO space, which is about to get + * disabled. + */ +if
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:14:29PM +0300, Avi Kivity wrote: On 04/05/2012 06:42 AM, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. Since reset can change other config space registers, we achieve correctness for more of them. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) set emulated register to hardware value Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote: On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. set emulated register to hardware value Yes, I see what you are trying to say now. Unfortunately that's not enough: we also need to restore the registers afterwards for device to become useful again. Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? BAR values in host and guest are unrelated. If pci_reset didn't restore BAR values we won't be able to operate the device. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: MMU: Make pte_list_desc fit cache lines well
On Sun, 08 Apr 2012 16:09:58 +0300 Avi Kivity a...@redhat.com wrote: +/* make pte_list_desc fit well in cache line */ +#define PTE_LIST_EXT 3 + struct pte_list_desc { u64 *sptes[PTE_LIST_EXT]; struct pte_list_desc *more; We could go even further and have 4 pointers, and use bit 0 to decide whether it's a next pointer or an sptep. Not sure it's worth the extra complexity. May not be so complex if we reuse rmap encoding/decoding but it is for shadow paging only hack ... and mmu is already enough complex; so not sure - I need to think again later. My primary goal is to make the code saner and faster. Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote: On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. set emulated register to hardware value Yes, I see what you are trying to say now. Unfortunately that's not enough: we also need to restore the registers afterwards for device to become useful again. I guess this is correct for the MSIX BAR. But is it correct for MSIX enable/disable? Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? BAR values in host and guest are unrelated. If pci_reset didn't restore BAR values we won't be able to operate the device. Right. I guess candidates are those that are initialized with assigned_dev_emulate_config_*()? Hard to see which ones because they're mass initialized. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On Fri, Apr 06, 2012 at 09:50:50AM +0300, Gleb Natapov wrote: On Fri, Apr 06, 2012 at 10:43:17AM +0530, shashank rachamalla wrote: On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov g...@redhat.com wrote: On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote: On 04/05/2012 04:57 PM, Gleb Natapov wrote: May be it used NMI based profiling. We should ask oprofile developers. As I said I am almost sure my inability to run it on a host is probably PEBKAC, although I ran the same script exactly on the host and the guest (the script is from the first email of this thread) After upgrading the kernel to latest git from whatever it was there the same script works and counts CPU_CLK_UNHALT events. This is even while it violates the Intel guidelines? Yes, but who says the result is correct :) It seems that we handle global ctrl msr wrong. That is counter can be enabled either in global ctrl or in eventsel. Trying to confirm that. if that becomes true then will global ctrl msr have any significance ? When it is in use yes. I was wrong. We do handle global ctrl msr correctly, I just ran my test incorrectly. If I disable global ctrl on all cpus (for i in `seq 0 15`; do wrmsr -p $i 0x38f 0; done) oprofile stops working. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 06:26:28PM +0300, Avi Kivity wrote: On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote: On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote: On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote: On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote: On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote: Don't we FLR the device, which ought to disable MSI on the real device? AFAIK we call pci_reset, which saves device state, does an FLR and then restores the state. I think this might include msi as well. Then that is wrong as well, no? Not as such assuming we disable msi/msix first :) I think we need to fix both, no? Isn't this what this patch does? If we change pci_reset() (or a variant that we call) to reset MSI, and update qemu to synchronize from the device after pci_reset(), then we achieve the same result, in a different way. MSI vectors are set up by kvm in the host. So we should not abruptly drop that by a sysfs write: would need to synchronise with kvm. Once we do, there's nothing left for pci_reset to do. I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. set emulated register to hardware value Yes, I see what you are trying to say now. Unfortunately that's not enough: we also need to restore the registers afterwards for device to become useful again. I guess this is correct for the MSIX BAR. But is it correct for MSIX enable/disable? Probably not. Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. Since reset can change other config space registers, we achieve correctness for more of them. Which other registers do you have in mind? BARs for example. We may have our own reset for this, but isn't copying the hardware values more trustworthy? BAR values in host and guest are unrelated. If pci_reset didn't restore BAR values we won't be able to operate the device. Right. I guess candidates are those that are initialized with assigned_dev_emulate_config_*()? Hard to see which ones because they're mass initialized. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same and whether or not if it is better (when applied to all emulated config space). Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. Doing this in kernel seems more robust, otherwise we risk losing the device if qemu gets killed before it has restored the registers. Doesn't the driver have to enable MSIX if it attaches to the device at that point, anyway? Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: source for virt io backend driver
On Sun, Apr 8, 2012 at 4:13 AM, Nadav Har'El n...@math.technion.ac.il wrote: On Sat, Apr 07, 2012, Steven wrote about source for virt io backend driver: Hi, I would like to know where I can find the source for backend driver of virtio device. For example, I know that the front-end net driver is ... Can anyone help to point out where the source of net virtio backend in the host kernel? If you use normal virtio, the backend is in qemu, i.e., host user space, *not* in the host kernel. So you need to look for it in qemu, not kvm. If you want the backend to be in the kernel, then you probably mean vhost-net. This you can find in drivers/vhost/net.c (and drivers/vhost/vhost.c for the generic vhost infrastructure for virtio drivers in the host kernel). Yes, I am looking for the backend code in the kernel. I found the file for net backend drivers/vhost/net.c. However, the backend code for blk is not there. Could you point out this? Thanks. By the way, I found the front-end blk code in drivers/block/virtio_blk.c Of course, also with vhost-net, qemu is involved in setting up this device. But qemu doesn't need to get involved in the data path (data, interrupts, etc.) which is done completely in the kernel, and therefore much more efficiently. -- Nadav Har'El | Sunday, Apr 8 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Lottery: You need a dollar and a dream. http://nadav.harel.org.il |We'll take the dollar, you keep the dream. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote: On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. I'm not sure. The pci sysfs interface is by design not secured against malicious users, isn't it? Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. I'm not sure I agree here. If there are secret channels to the device that let it violate the PCI express spec, it can probably break the SRIOV security model. And then you can do much more than just crash the host. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan I agree - forwarding errors to the guest would actually be very useful - but I think we also need to analyse the problem carefully, and prevent as many ways as we can for guest to cause trouble. And there is another issue here: unsuppported request errors should not cause kernel panics IMO. There's also the issue that qemu let guest control the MMIO/PIO bits in the command register. So there are multiple bugs. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-08 20:18, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote: On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. I'm not sure. The pci sysfs interface is by design not secured against malicious users, isn't it? That's surely true for devices outside of IOMMU protection. But do we really have to give up when we encapsulate and isolate them that way? Provided we moderate access to the sysfs resources via libvirt or some other management service. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. I'm not sure I agree here. If there are secret channels to the device that let it violate the PCI express spec, it can probably break the SRIOV security model. And then you can do much more than just crash the host. Maybe, but there are also other devices. And if a guest reprograms it (firmware update...) and makes it stop reacting on requests, we may get the same effect. That would also be some kind of a secrete channel. So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan I agree - forwarding errors to the guest would actually be very useful - but I think we also need to analyse the problem carefully, and prevent as many ways as we can for guest to cause trouble. If possible, the protection should target userspace which would automatically include guests. Only if that is not feasible with reasonable effort, we have to rely on QEMU to save the host. And there is another issue here: unsuppported request errors should not cause kernel panics IMO. There's also the issue that qemu let guest control the MMIO/PIO bits in the command register. So there are multiple bugs. Yep, that's true. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Sun, Apr 08, 2012 at 08:39:35PM +0200, Jan Kiszka wrote: On 2012-04-08 20:18, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote: On 2012-04-08 18:08, Avi Kivity wrote: On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote: On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote: On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote: I'm thinking about this flow: FLR the device for each emulated register read it from the hardware if different from emulated register: update the internal model (for example, disabling MSI in kvm if needed) If we do it this way we get back the problem this patch is trying to solve: MSIX assigned while device memory is disabled would cause unsupported request errors. Why is that? FLR would presumably disable MSI in the device, and this line would disable it in kvm as well. The bug is that device memory is disabled (FLR would do that) while MSI is enabled in kvm. The fix is to disable MSI in kvm first. Yes, no need to repeat. My question is whether my pseudo-code does the same It doesn't seem to: FLR (disabling memory) is followed by MSI disable in kvm instead of the reverse. Ah, so the problem is the ordering? I see. and whether or not if it is better (when applied to all emulated config space). I'm not sure. I would like to see an example of a register that you have in mind. I went over the PCI registers and saw none that would be affected. Yes. I'm talking about things like enabling memory, setting up irq register, etc though. Most of this setup is done by bios. I see. So should we have a pci_reset_function() variant that limits itself to restoring just those bits? We only need kernel to restore whatever qemu emulates, but kernel doesn't know what that is. What kind of interface do you have in mind? The same as pci_reset_function(), but leaves MSI clear. I guess it's not worth it if the ordering problem is there. The core problem is not the ordering. The problem is that the kernel is susceptible to ordering mistakes of userspace. And that is because the kernel panics on PCI errors of devices that are in user hands - a critical kernel bug IMHO. I'm not sure. The pci sysfs interface is by design not secured against malicious users, isn't it? That's surely true for devices outside of IOMMU protection. But do we really have to give up when we encapsulate and isolate them that way? Provided we moderate access to the sysfs resources via libvirt or some other management service. We don't have to give up but we'd have to build such an interface: /config attribute is not it. Proper reset of MSI or even the whole PCI config space is another issue, but one the kernel should not worry about - still, it should be fixed (therefore this patch). But even if we disallowed userland to disable MMIO and PIO access to the device, we would be be able to exclude that there are secrete channels in the device's interface having the same effect. I'm not sure I agree here. If there are secret channels to the device that let it violate the PCI express spec, it can probably break the SRIOV security model. And then you can do much more than just crash the host. Maybe, but there are also other devices. And if a guest reprograms it (firmware update...) and makes it stop reacting on requests, we may get the same effect. That would also be some kind of a secrete channel. Right. So it looks like SRIOV VF is the only type of device that is safe to assign to a guest: Presumably, SRIOV VFs don't let driver program the firmware. And I think SRIOV VFs don't have MMIO/PIO enable bits either, and the BAR isn't programmable through the VF... So we likely need to enhance PCI error handling to catch and handle faults for certain devices differently - those we cannot trust to behave properly while they are under userland/guest control. Jan I agree - forwarding errors to the guest would actually be very useful - but I think we also need to analyse the problem carefully, and prevent as many ways as we can for guest to cause trouble. If possible, the protection should target userspace which would automatically include guests. Only if that is not feasible with reasonable effort, we have to rely on QEMU to save the host. Defence in depth is best, right? And there is another issue here: unsuppported request errors should not cause kernel panics IMO. There's also the issue that qemu let guest control the MMIO/PIO bits in the command register. So there are multiple bugs. Yep, that's true. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming
On 04/04/2012 04:01 PM, Michael S. Tsirkin wrote: On Mon, Apr 02, 2012 at 12:00:45PM -0700, Tejun Heo wrote: On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote: So if we're agreed no other devices going forwards should ever use this interface, is there any point unifying the interface? No matter how many caveats you hedge it round with, putting the API in a central place will be a bit like a honey trap for careless bears. It might be safer just to leave it buried in the three current drivers. Yeah, that was my hope but I think it would be easier to enforce to have a common function which is clearly marked legacy so that new driver writers can go look for the naming code in the existing ones, find out they're all using the same function which is marked legacy and explains what to do for newer drivers. I think I'm not the only one to be confused about the preferred direction here. James, do you agree to the approach above? It would be nice to fix virtio block for 3.4, so how about this: - I'll just apply the original bugfix patch for 3.4 - it only affects virtio Sorry, about only affects virtio, I'm not very clear here: 1) Just duplicate the disk name format function in virtio_blk like the original patch: https://lkml.org/lkml/2012/3/28/45 2) Move the disk name format function into block core like this patch series but only affects virtio(not affect mtip32xx). Do you mean the 2) one or something else? - Ren will repost the refactoring patch on top, and we can keep up the discussion Ren if you agree, can you make this a two patch series please? Sure. -- Thanks, Ren -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: PPC: Use clockevent multiplier and shifter for decrementer
Time for which the hrtimer is started for decrementer emulation is calculated using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC reprogramming (if needed) and which calculate timebase ticks using the multiplier and shifter mechanism implemented within clockevent layer. It was observed that this conversion (timebase-time-timebase) are not correct because the mechanism are not consistent. In our setup it adds 2% jitter. With this patch clockevent multiplier and shifter mechanism are used when starting hrtimer for decrementer emulation. Now the jitter is 0.5%. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - decrementer_clockevent is made non-static rather than a seprate API to get mult/shift arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/time.c |2 +- arch/powerpc/kvm/emulate.c |5 +++-- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index 7eb10fb..b3c7959 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern unsigned long tb_ticks_per_jiffy; extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; +extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 567dd7c..e237225 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -105,7 +105,7 @@ static int decrementer_set_next_event(unsigned long evt, static void decrementer_set_mode(enum clock_event_mode mode, struct clock_event_device *dev); -static struct clock_event_device decrementer_clockevent = { +struct clock_event_device decrementer_clockevent = { .name = decrementer, .rating = 200, .irq= 0, diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c index afc9154..c8b5206 100644 --- a/arch/powerpc/kvm/emulate.c +++ b/arch/powerpc/kvm/emulate.c @@ -23,6 +23,7 @@ #include linux/types.h #include linux/string.h #include linux/kvm_host.h +#include linux/clockchips.h #include asm/reg.h #include asm/time.h @@ -104,8 +105,8 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu) */ dec_time = vcpu-arch.dec; - dec_time *= 1000; - do_div(dec_time, tb_ticks_per_usec); + dec_time = dec_time decrementer_clockevent.shift; + do_div(dec_time, decrementer_clockevent.mult); dec_nsec = do_div(dec_time, NSEC_PER_SEC); hrtimer_start(vcpu-arch.dec_timer, ktime_set(dec_time, dec_nsec), HRTIMER_MODE_REL); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New git workflow
On 04/05/2012 08:02 PM, Avi Kivity wrote: I'll publish the new branches tomorrow, with any luck. There wasn't any luck, so it's only ready today. To allow chance for review, I'm publishing next as next-candidate. Paul/Alex, please review the powerpc bits. Specifically: system.h is gone, so I moved the prototype of load_up_fpu() to asm/switch_to.h and added a #include (8fae845f4956d). E6500 was added in upstream in parallel with the split of kvm E500/E500MC. I guessed which part of the #ifdef E6500 was to go into, but please verify (73196cd364a2, 06aae86799c1b). Please either post fixup patches atop next-candidate, if needed, or (preferably) post a new tree of the powerpc update based on b6d33834bd4e, to replace b6d33834bd4e..966cd0f3bdd. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: PPC: Use clockevent multiplier and shifter for decrementer
Time for which the hrtimer is started for decrementer emulation is calculated using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC reprogramming (if needed) and which calculate timebase ticks using the multiplier and shifter mechanism implemented within clockevent layer. It was observed that this conversion (timebase-time-timebase) are not correct because the mechanism are not consistent. In our setup it adds 2% jitter. With this patch clockevent multiplier and shifter mechanism are used when starting hrtimer for decrementer emulation. Now the jitter is 0.5%. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - decrementer_clockevent is made non-static rather than a seprate API to get mult/shift arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/time.c |2 +- arch/powerpc/kvm/emulate.c |5 +++-- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index 7eb10fb..b3c7959 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern unsigned long tb_ticks_per_jiffy; extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; +extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 567dd7c..e237225 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -105,7 +105,7 @@ static int decrementer_set_next_event(unsigned long evt, static void decrementer_set_mode(enum clock_event_mode mode, struct clock_event_device *dev); -static struct clock_event_device decrementer_clockevent = { +struct clock_event_device decrementer_clockevent = { .name = decrementer, .rating = 200, .irq= 0, diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c index afc9154..c8b5206 100644 --- a/arch/powerpc/kvm/emulate.c +++ b/arch/powerpc/kvm/emulate.c @@ -23,6 +23,7 @@ #include linux/types.h #include linux/string.h #include linux/kvm_host.h +#include linux/clockchips.h #include asm/reg.h #include asm/time.h @@ -104,8 +105,8 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu) */ dec_time = vcpu-arch.dec; - dec_time *= 1000; - do_div(dec_time, tb_ticks_per_usec); + dec_time = dec_time decrementer_clockevent.shift; + do_div(dec_time, decrementer_clockevent.mult); dec_nsec = do_div(dec_time, NSEC_PER_SEC); hrtimer_start(vcpu-arch.dec_timer, ktime_set(dec_time, dec_nsec), HRTIMER_MODE_REL); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html