Re: source for virt io backend driver

2012-04-08 Thread Nadav Har'El
On Sat, Apr 07, 2012, Steven wrote about source for virt io backend driver:
 Hi,
 I would like to know where I can find the source for backend driver of
 virtio device. For example, I know that the front-end net driver is
...
 Can anyone help to point out where the source of net virtio backend in
 the host kernel?

If you use normal virtio, the backend is in qemu, i.e., host user
space, *not* in the host kernel. So you need to look for it in qemu,
not kvm.

If you want the backend to be in the kernel, then you probably mean
vhost-net. This you can find in drivers/vhost/net.c (and
drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
drivers in the host kernel).

Of course, also with vhost-net, qemu is involved in setting up this
device. But qemu doesn't need to get involved in the data path (data,
interrupts, etc.) which is done completely in the kernel, and therefore
much more efficiently.

-- 
Nadav Har'El| Sunday, Apr 8 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Lottery: You need a dollar and a dream.
http://nadav.harel.org.il   |We'll take the dollar, you keep the dream.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 43068] Operation restricted to levels L0-2 - kerneloops when booting L3

2012-04-08 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=43068


Nadav Har'El n...@math.technion.ac.il changed:

   What|Removed |Added

 CC||n...@math.technion.ac.il




--- Comment #1 from Nadav Har'El n...@math.technion.ac.il  2012-04-08 
08:18:32 ---
Hi, indeed, theoretically L3 should work; And also in practice - in my tests it
did work (albeit very very slowly). I'll need to look into this issue, and
check why this bug is happening now.

You should just be aware that even if it will work, L3 will be extremely slow,
likely to the point of not being useful. One of reasons is the lack of nested
EPT in the upstream version. An even bigger problem is the exponential
explosion of exits described in the Turtles paper, which is made much worse by
one particular part of the existing implementation: Right now, on every nested
entry from L1 to L2, L0 recreates all the fields of the vmcs (see
prepare_vmcs02()) doing a few dozen VMWRITEs. When L0 does it, it's fine - but
when we are nested yet deeper, and there is an entry from L2 into L3 and *L1*
needs to do prepare_vmcs02(), now L1 calls a lot of VMWRITEs and they all cause
exits, and all of this is extremely slow.

There are two things we could do in the future to solve this performance
problem - if people really need to use L3 (currently, even L2 isn't very
popular ;-)). One is to have some sort of exit-less VMREAD/VMWRITE, with or
without hardware support for this feature. The second thing is to emulate
VMWRITE differently, writing to vmcs02 immediately and doing far fewer VMWRITES
on nested entries (doing this correctly is harder than it might seem, as I can
explain in a separate thread).

Nadav Har'El

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM fixes for 3.4-rc1

2012-04-08 Thread Avi Kivity
On 04/07/2012 07:55 PM, Linus Torvalds wrote:
 On Thu, Apr 5, 2012 at 9:40 AM, Avi Kivity a...@redhat.com wrote:
  Linus, please pull a few fixes from
 
   git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.4
 
  A bunch of powerpc KVM fixes, a guest and a host RCU fix (unrelated),
  and a small build fix.

 Grr.

static inline int pull_function(void)
 {
 return -ENODIFFSTAT;
 }

 Please use the proper scripting to do pull requests, like git request-pull


I did, but I left the diffstat out accidentally, sorry.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New git workflow

2012-04-08 Thread Avi Kivity
On 04/05/2012 08:02 PM, Avi Kivity wrote:
 I'll publish the new branches tomorrow, with any luck.

There wasn't any luck, so it's only ready today.  To allow chance for
review, I'm publishing next as next-candidate.

Paul/Alex, please review the powerpc bits.  Specifically:

  system.h is gone, so I moved the prototype of load_up_fpu() to
asm/switch_to.h and added a #include (8fae845f4956d).
  E6500 was added in upstream in parallel with the split of kvm
E500/E500MC.  I guessed which part of the #ifdef E6500 was to go into,
but please verify (73196cd364a2, 06aae86799c1b).

Please either post fixup patches atop next-candidate, if needed, or
(preferably) post a new tree of the powerpc update based on
b6d33834bd4e, to replace b6d33834bd4e..966cd0f3bdd.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Improve iteration through sptes from rmap

2012-04-08 Thread Avi Kivity
On 03/21/2012 04:48 PM, Takuya Yoshikawa wrote:
 By removing sptep from rmap_iterator, I could achieve 15% performance
 improvement without inlining.



Thanks, applied to next-candidate.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: MMU: Make pte_list_desc fit cache lines well

2012-04-08 Thread Avi Kivity
On 03/21/2012 04:49 PM, Takuya Yoshikawa wrote:
 From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

 We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
 bytes do not fit cache lines well.  Furthermore, some allocators may
 use 64/32-byte objects for the pte_list_desc cache.

 This patch solves this problem by changing PTE_LIST_EXT from 4 to 3.

 For shadow paging, the new size is still large enough to hold both the
 kernel and process mappings for usual anonymous pages.  For file
 mappings, there may be a slight change in the cache usage.

 Note: with EPT/NPT we almost always have a single spte in each reverse
 mapping and we will not see any change by this.

 @@ -135,8 +135,6 @@ module_param(dbg, bool, 0644);
  #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK \
   | PT64_NX_MASK)
  
 -#define PTE_LIST_EXT 4
 -
  #define ACC_EXEC_MASK1
  #define ACC_WRITE_MASK   PT_WRITABLE_MASK
  #define ACC_USER_MASKPT_USER_MASK
 @@ -151,6 +149,9 @@ module_param(dbg, bool, 0644);
  
  #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
  
 +/* make pte_list_desc fit well in cache line */
 +#define PTE_LIST_EXT 3
 +
  struct pte_list_desc {
   u64 *sptes[PTE_LIST_EXT];
   struct pte_list_desc *more;

We could go even further and have 4 pointers, and use bit 0 to decide
whether it's a next pointer or an sptep.

Not sure it's worth the extra complexity.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/05/2012 06:42 AM, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.

 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109

 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.

 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.

 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.

 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---

 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.

  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)

 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be running.  We want to return the device to a known state on
 + * reset, so disable those here.  We especially do not want MSI-X
 + * enabled since it lives in MMIO space, which is about to get
 + * disabled.
 + */
 +if 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:14:29PM +0300, Avi Kivity wrote:
 On 04/05/2012 06:42 AM, Alex Williamson wrote:
  We've hit a kernel host panic, when issuing a 'system_reset' with an
  82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
  [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
  32993
  [Hardware Error]: APEI generic hardware error status
  [Hardware Error]: severity: 1, fatal
  [Hardware Error]: section: 0, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  [Hardware Error]: section: 1, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  Kernel panic - not syncing: Fatal hardware error!
  Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
  Call Trace:
   NMI  [814f2fe5] ? panic+0xa0/0x168
   [812f919c] ? ghes_notify_nmi+0x17c/0x180
   [814f91d5] ? notifier_call_chain+0x55/0x80
   [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
   [8109667e] ? notify_die+0x2e/0x30
   [814f6e81] ? do_nmi+0x1a1/0x2b0
   [814f6760] ? nmi+0x20/0x30
   [8103762b] ? native_safe_halt+0xb/0x10
   EOE  [8101495d] ? default_idle+0x4d/0xb0
   [81009e06] ? cpu_idle+0xb6/0x110
   [814da63a] ? rest_init+0x7a/0x80
   [81c1ff7b] ? start_kernel+0x424/0x430
   [81c1f33a] ? x86_64_start_reservations+0x125/0x129
   [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
  The root cause of the problem is that the 'reset_assigned_device()' code
  first writes a 0 to the command register. Then, when qemu subsequently does
  a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
  the kernel ends up calling '__msix_mask_irq()', which performs a write to
  the memory mapped msi vector space. Since, we've explicitly told the device
  to disallow mmio access (via the 0 write to the command register), we end
  up with the above 'Unsupported Request'.
 
  The fix here is to first disable MSI-X, before doing the reset.  We also
  disable MSI, leaving the device in INTx mode.  In this way, the device is
  a known state after reset, and we avoid touching msi memory mapped space
  on any subsequent 'kvm_deassign_irq()'.
 
  Thanks to Michael S. Tsirkin for help in understanding what was going on
  here and Jason Baron, the original debugger of this problem.
 
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
 
  Jason is out of the office for a couple weeks, so I'll try to resolve
  this while he's away.  Somehow the emulated config updates were lost
  in Jason's original posting, so I've fixed that and taken Jan's suggestion
  to simply call into the update functions instead of open coding the
  interrupt disable.  I think there still may be some disagreements about
  how to handle guest generated errors in the host, but that's a large
  project whereas this is something we should be doing at reset anyway,
  and even if only a workaround, resolves the problem above.
 
   hw/device-assignment.c |   23 +++
   1 files changed, 23 insertions(+), 0 deletions(-)
 
  diff --git a/hw/device-assignment.c b/hw/device-assignment.c
  index 89823f1..2e6b93e 100644
  --- a/hw/device-assignment.c
  +++ b/hw/device-assignment.c
  @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
   const char reset[] = 1;
   int fd, ret;
   
  +/*
  + * If a guest is reset without being shutdown, MSI/MSI-X can still
  + * be running.  We want to return the device to a known state on
  + * reset, so disable those 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
  
  Don't we FLR the device, which ought to disable MSI on the real device?

 AFAIK we call pci_reset, which saves device state, does an FLR
 and then restores the state. I think this might include msi as well.



Then that is wrong as well, no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
   
   Don't we FLR the device, which ought to disable MSI on the real device?
 
  AFAIK we call pci_reset, which saves device state, does an FLR
  and then restores the state. I think this might include msi as well.
 
 
 
 Then that is wrong as well, no?

Not as such assuming we disable msi/msix first :)

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:

Don't we FLR the device, which ought to disable MSI on the real device?
  
   AFAIK we call pci_reset, which saves device state, does an FLR
   and then restores the state. I think this might include msi as well.
  
  
  
  Then that is wrong as well, no?

 Not as such assuming we disable msi/msix first :)

I think we need to fix both, no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
 
 Don't we FLR the device, which ought to disable MSI on the real 
 device?
   
AFAIK we call pci_reset, which saves device state, does an FLR
and then restores the state. I think this might include msi as well.
   
   
   
   Then that is wrong as well, no?
 
  Not as such assuming we disable msi/msix first :)
 
 I think we need to fix both, no?

Isn't this what this patch does?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
  
  Don't we FLR the device, which ought to disable MSI on the real 
  device?

 AFAIK we call pci_reset, which saves device state, does an FLR
 and then restores the state. I think this might include msi as well.



Then that is wrong as well, no?
  
   Not as such assuming we disable msi/msix first :)
  
  I think we need to fix both, no?

 Isn't this what this patch does?

If we change pci_reset() (or a variant that we call) to reset MSI, and
update qemu to synchronize from the device after pci_reset(), then we
achieve the same result, in a different way.

Since reset can change other config space registers, we achieve
correctness for more of them.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
   
   Don't we FLR the device, which ought to disable MSI on the real 
   device?
 
  AFAIK we call pci_reset, which saves device state, does an FLR
  and then restores the state. I think this might include msi as well.
 
 
 
 Then that is wrong as well, no?
   
Not as such assuming we disable msi/msix first :)
   
   I think we need to fix both, no?
 
  Isn't this what this patch does?
 
 If we change pci_reset() (or a variant that we call) to reset MSI, and
 update qemu to synchronize from the device after pci_reset(), then we
 achieve the same result, in a different way.

MSI vectors are set up by kvm in the host. So we should not
abruptly drop that by a sysfs write: would need to
synchronise with kvm. Once we do, there's nothing left
for pci_reset to do.


 Since reset can change other config space registers, we achieve
 correctness for more of them.

Which other registers do you have in mind?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:

Don't we FLR the device, which ought to disable MSI on the real 
device?
  
   AFAIK we call pci_reset, which saves device state, does an FLR
   and then restores the state. I think this might include msi as 
   well.
  
  
  
  Then that is wrong as well, no?

 Not as such assuming we disable msi/msix first :)

I think we need to fix both, no?
  
   Isn't this what this patch does?
  
  If we change pci_reset() (or a variant that we call) to reset MSI, and
  update qemu to synchronize from the device after pci_reset(), then we
  achieve the same result, in a different way.

 MSI vectors are set up by kvm in the host. So we should not
 abruptly drop that by a sysfs write: would need to
 synchronise with kvm. Once we do, there's nothing left
 for pci_reset to do.

I'm thinking about this flow:

  FLR the device
  for each emulated register
 read it from the hardware
 if different from emulated register:
update the internal model (for example, disabling MSI in kvm if
needed)
set emulated register to hardware value

  Since reset can change other config space registers, we achieve
  correctness for more of them.

 Which other registers do you have in mind?

BARs for example.  We may have our own reset for this, but isn't copying
the hardware values more trustworthy?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
 
 Don't we FLR the device, which ought to disable MSI on the 
 real device?
   
AFAIK we call pci_reset, which saves device state, does an FLR
and then restores the state. I think this might include msi as 
well.
   
   
   
   Then that is wrong as well, no?
 
  Not as such assuming we disable msi/msix first :)
 
 I think we need to fix both, no?
   
Isn't this what this patch does?
   
   If we change pci_reset() (or a variant that we call) to reset MSI, and
   update qemu to synchronize from the device after pci_reset(), then we
   achieve the same result, in a different way.
 
  MSI vectors are set up by kvm in the host. So we should not
  abruptly drop that by a sysfs write: would need to
  synchronise with kvm. Once we do, there's nothing left
  for pci_reset to do.
 
 I'm thinking about this flow:
 
   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm if
 needed)

If we do it this way we get back the problem this patch
is trying to solve: MSIX assigned while device
memory is disabled would cause unsupported request errors.

 set emulated register to hardware value

Yes, I see what you are trying to say now.
Unfortunately that's not enough: we also
need to restore the registers afterwards for
device to become useful again.
Doing this in kernel seems more robust, otherwise
we risk losing the device if qemu gets killed
before it has restored the registers.


   Since reset can change other config space registers, we achieve
   correctness for more of them.
 
  Which other registers do you have in mind?
 
 BARs for example.  We may have our own reset for this, but isn't copying
 the hardware values more trustworthy?

BAR values in host and guest are unrelated.
If pci_reset didn't restore BAR values we won't
be able to operate the device.


 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: MMU: Make pte_list_desc fit cache lines well

2012-04-08 Thread Takuya Yoshikawa
On Sun, 08 Apr 2012 16:09:58 +0300
Avi Kivity a...@redhat.com wrote:

  +/* make pte_list_desc fit well in cache line */
  +#define PTE_LIST_EXT 3
  +
   struct pte_list_desc {
  u64 *sptes[PTE_LIST_EXT];
  struct pte_list_desc *more;
 
 We could go even further and have 4 pointers, and use bit 0 to decide
 whether it's a next pointer or an sptep.
 
 Not sure it's worth the extra complexity.
 

May not be so complex if we reuse rmap encoding/decoding but it is
for shadow paging only hack ... and mmu is already enough complex;
so not sure - I need to think again later.

My primary goal is to make the code saner and faster.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
  On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
   On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
  
  Don't we FLR the device, which ought to disable MSI on the 
  real device?

 AFAIK we call pci_reset, which saves device state, does an FLR
 and then restores the state. I think this might include msi 
 as well.



Then that is wrong as well, no?
  
   Not as such assuming we disable msi/msix first :)
  
  I think we need to fix both, no?

 Isn't this what this patch does?

If we change pci_reset() (or a variant that we call) to reset MSI, and
update qemu to synchronize from the device after pci_reset(), then we
achieve the same result, in a different way.
  
   MSI vectors are set up by kvm in the host. So we should not
   abruptly drop that by a sysfs write: would need to
   synchronise with kvm. Once we do, there's nothing left
   for pci_reset to do.
  
  I'm thinking about this flow:
  
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in kvm if
  needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

Why is that?  FLR would presumably disable MSI in the device, and this
line would disable it in kvm as well.

  set emulated register to hardware value

 Yes, I see what you are trying to say now.
 Unfortunately that's not enough: we also
 need to restore the registers afterwards for
 device to become useful again.

I guess this is correct for the MSIX BAR.  But is it correct for MSIX
enable/disable?

 Doing this in kernel seems more robust, otherwise
 we risk losing the device if qemu gets killed
 before it has restored the registers.

Doesn't the driver have to enable MSIX if it attaches to the device at
that point, anyway?


Since reset can change other config space registers, we achieve
correctness for more of them.
  
   Which other registers do you have in mind?
  
  BARs for example.  We may have our own reset for this, but isn't copying
  the hardware values more trustworthy?

 BAR values in host and guest are unrelated.
 If pci_reset didn't restore BAR values we won't
 be able to operate the device.


Right.

I guess candidates are those that are initialized with
assigned_dev_emulate_config_*()?  Hard to see which ones because they're
mass initialized.



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-08 Thread Gleb Natapov
On Fri, Apr 06, 2012 at 09:50:50AM +0300, Gleb Natapov wrote:
 On Fri, Apr 06, 2012 at 10:43:17AM +0530, shashank rachamalla wrote:
  On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov g...@redhat.com wrote:
   On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
   On 04/05/2012 04:57 PM, Gleb Natapov wrote:
 
 May be it used NMI based profiling. We should ask oprofile 
 developers.
 As I said I am almost sure my inability to run it on a host is 
 probably
 PEBKAC, although I ran the same script exactly on the host and the
 guest (the script is from the first email of this thread)

After upgrading the kernel to latest git from whatever it was there the
same script works and counts CPU_CLK_UNHALT events.
   
  
   This is even while it violates the Intel guidelines?
  
   Yes, but who says the result is correct :) It seems that we handle
   global ctrl msr wrong. That is counter can be enabled either in global
   ctrl or in eventsel. Trying to confirm that.
  
  if that becomes true then will global ctrl msr have any significance ?
 When it is in use yes.
 
I was wrong. We do handle global ctrl msr correctly, I just ran my test
incorrectly. If I disable global ctrl on all cpus (for i in `seq 0 15`;
do wrmsr -p $i 0x38f 0; done) oprofile stops working.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 06:26:28PM +0300, Avi Kivity wrote:
 On 04/08/2012 05:42 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 05:01:36PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:53 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:41:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:30 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 04:24:11PM +0300, Avi Kivity wrote:
   On 04/08/2012 04:21 PM, Michael S. Tsirkin wrote:
On Sun, Apr 08, 2012 at 04:18:29PM +0300, Avi Kivity wrote:
 On 04/08/2012 04:17 PM, Michael S. Tsirkin wrote:
   
   Don't we FLR the device, which ought to disable MSI on 
   the real device?
 
  AFAIK we call pci_reset, which saves device state, does an 
  FLR
  and then restores the state. I think this might include msi 
  as well.
 
 
 
 Then that is wrong as well, no?
   
Not as such assuming we disable msi/msix first :)
   
   I think we need to fix both, no?
 
  Isn't this what this patch does?
 
 If we change pci_reset() (or a variant that we call) to reset MSI, and
 update qemu to synchronize from the device after pci_reset(), then we
 achieve the same result, in a different way.
   
MSI vectors are set up by kvm in the host. So we should not
abruptly drop that by a sysfs write: would need to
synchronise with kvm. Once we do, there's nothing left
for pci_reset to do.
   
   I'm thinking about this flow:
   
 FLR the device
 for each emulated register
read it from the hardware
if different from emulated register:
   update the internal model (for example, disabling MSI in kvm if
   needed)
 
  If we do it this way we get back the problem this patch
  is trying to solve: MSIX assigned while device
  memory is disabled would cause unsupported request errors.
 
 Why is that?  FLR would presumably disable MSI in the device, and this
 line would disable it in kvm as well.

The bug is that device memory is disabled (FLR would do that)
while MSI is enabled in kvm. The fix is to
disable MSI in kvm first.

   set emulated register to hardware value
 
  Yes, I see what you are trying to say now.
  Unfortunately that's not enough: we also
  need to restore the registers afterwards for
  device to become useful again.
 
 I guess this is correct for the MSIX BAR.  But is it correct for MSIX
 enable/disable?

Probably not.

  Doing this in kernel seems more robust, otherwise
  we risk losing the device if qemu gets killed
  before it has restored the registers.
 
 Doesn't the driver have to enable MSIX if it attaches to the device at
 that point, anyway?

Yes. I'm talking about things like enabling memory, setting up irq register,
etc though. Most of this setup is done by bios.

 Since reset can change other config space registers, we achieve
 correctness for more of them.
   
Which other registers do you have in mind?
   
   BARs for example.  We may have our own reset for this, but isn't copying
   the hardware values more trustworthy?
 
  BAR values in host and guest are unrelated.
  If pci_reset didn't restore BAR values we won't
  be able to operate the device.
 
 
 Right.
 
 I guess candidates are those that are initialized with
 assigned_dev_emulate_config_*()?  Hard to see which ones because they're
 mass initialized.
 
 
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:

I'm thinking about this flow:

  FLR the device
  for each emulated register
 read it from the hardware
 if different from emulated register:
update the internal model (for example, disabling MSI in kvm if
needed)
  
   If we do it this way we get back the problem this patch
   is trying to solve: MSIX assigned while device
   memory is disabled would cause unsupported request errors.
  
  Why is that?  FLR would presumably disable MSI in the device, and this
  line would disable it in kvm as well.

 The bug is that device memory is disabled (FLR would do that)
 while MSI is enabled in kvm. The fix is to
 disable MSI in kvm first.

Yes, no need to repeat.  My question is whether my pseudo-code does the
same and whether or not if it is better (when applied to all emulated
config space).

   Doing this in kernel seems more robust, otherwise
   we risk losing the device if qemu gets killed
   before it has restored the registers.
  
  Doesn't the driver have to enable MSIX if it attaches to the device at
  that point, anyway?

 Yes. I'm talking about things like enabling memory, setting up irq register,
 etc though. Most of this setup is done by bios.

I see.  So should we have a pci_reset_function() variant that limits
itself to restoring just those bits?



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
 
 I'm thinking about this flow:
 
   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm 
 if
 needed)
   
If we do it this way we get back the problem this patch
is trying to solve: MSIX assigned while device
memory is disabled would cause unsupported request errors.
   
   Why is that?  FLR would presumably disable MSI in the device, and this
   line would disable it in kvm as well.
 
  The bug is that device memory is disabled (FLR would do that)
  while MSI is enabled in kvm. The fix is to
  disable MSI in kvm first.
 
 Yes, no need to repeat.  My question is whether my pseudo-code does the
 same

It doesn't seem to: FLR (disabling memory) is followed
by MSI disable in kvm instead of the reverse.

 and whether or not if it is better (when applied to all emulated
 config space).

I'm not sure.
I would like to see an example of a register that you have
in mind.

Doing this in kernel seems more robust, otherwise
we risk losing the device if qemu gets killed
before it has restored the registers.
   
   Doesn't the driver have to enable MSIX if it attaches to the device at
   that point, anyway?
 
  Yes. I'm talking about things like enabling memory, setting up irq register,
  etc though. Most of this setup is done by bios.
 
 I see.  So should we have a pci_reset_function() variant that limits
 itself to restoring just those bits?

We only need kernel to restore whatever qemu emulates, but
kernel doesn't know what that is.
What kind of interface do you have in mind?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Avi Kivity
On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
  
  I'm thinking about this flow:
  
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in 
  kvm if
  needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

Why is that?  FLR would presumably disable MSI in the device, and this
line would disable it in kvm as well.
  
   The bug is that device memory is disabled (FLR would do that)
   while MSI is enabled in kvm. The fix is to
   disable MSI in kvm first.
  
  Yes, no need to repeat.  My question is whether my pseudo-code does the
  same

 It doesn't seem to: FLR (disabling memory) is followed
 by MSI disable in kvm instead of the reverse.

Ah, so the problem is the ordering?  I see.

  and whether or not if it is better (when applied to all emulated
  config space).

 I'm not sure.
 I would like to see an example of a register that you have
 in mind.

I went over the PCI registers and saw none that would be affected.

  
   Yes. I'm talking about things like enabling memory, setting up irq 
   register,
   etc though. Most of this setup is done by bios.
  
  I see.  So should we have a pci_reset_function() variant that limits
  itself to restoring just those bits?

 We only need kernel to restore whatever qemu emulates, but
 kernel doesn't know what that is.
 What kind of interface do you have in mind?


The same as pci_reset_function(), but leaves MSI clear.

I guess it's not worth it if the ordering problem is there.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: source for virt io backend driver

2012-04-08 Thread Steven
On Sun, Apr 8, 2012 at 4:13 AM, Nadav Har'El n...@math.technion.ac.il wrote:
 On Sat, Apr 07, 2012, Steven wrote about source for virt io backend driver:
 Hi,
 I would like to know where I can find the source for backend driver of
 virtio device. For example, I know that the front-end net driver is
...
 Can anyone help to point out where the source of net virtio backend in
 the host kernel?

 If you use normal virtio, the backend is in qemu, i.e., host user
 space, *not* in the host kernel. So you need to look for it in qemu,
 not kvm.

 If you want the backend to be in the kernel, then you probably mean
 vhost-net. This you can find in drivers/vhost/net.c (and
 drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
 drivers in the host kernel).

Yes, I am looking for the backend code in the kernel. I found the file
for net backend drivers/vhost/net.c. However, the backend code for blk
is not there.
Could you point out this? Thanks.
By the way, I found the front-end blk code in drivers/block/virtio_blk.c


 Of course, also with vhost-net, qemu is involved in setting up this
 device. But qemu doesn't need to get involved in the data path (data,
 interrupts, etc.) which is done completely in the kernel, and therefore
 much more efficiently.

 --
 Nadav Har'El                        |                     Sunday, Apr 8 2012,
 n...@math.technion.ac.il             
 |-
 Phone +972-523-790466, ICQ 13349191 |Lottery: You need a dollar and a dream.
 http://nadav.harel.org.il           |We'll take the dollar, you keep the 
 dream.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Jan Kiszka
On 2012-04-08 18:08, Avi Kivity wrote:
 On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:

 I'm thinking about this flow:

   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm if
 needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

 Why is that?  FLR would presumably disable MSI in the device, and this
 line would disable it in kvm as well.

 The bug is that device memory is disabled (FLR would do that)
 while MSI is enabled in kvm. The fix is to
 disable MSI in kvm first.

 Yes, no need to repeat.  My question is whether my pseudo-code does the
 same

 It doesn't seem to: FLR (disabling memory) is followed
 by MSI disable in kvm instead of the reverse.
 
 Ah, so the problem is the ordering?  I see.
 
 and whether or not if it is better (when applied to all emulated
 config space).

 I'm not sure.
 I would like to see an example of a register that you have
 in mind.
 
 I went over the PCI registers and saw none that would be affected.
 

 Yes. I'm talking about things like enabling memory, setting up irq 
 register,
 etc though. Most of this setup is done by bios.

 I see.  So should we have a pci_reset_function() variant that limits
 itself to restoring just those bits?

 We only need kernel to restore whatever qemu emulates, but
 kernel doesn't know what that is.
 What kind of interface do you have in mind?

 
 The same as pci_reset_function(), but leaves MSI clear.
 
 I guess it's not worth it if the ordering problem is there.

The core problem is not the ordering. The problem is that the kernel is
susceptible to ordering mistakes of userspace. And that is because the
kernel panics on PCI errors of devices that are in user hands - a
critical kernel bug IMHO. Proper reset of MSI or even the whole PCI
config space is another issue, but one the kernel should not worry about
- still, it should be fixed (therefore this patch).

But even if we disallowed userland to disable MMIO and PIO access to the
device, we would be be able to exclude that there are secrete channels
in the device's interface having the same effect. So we likely need to
enhance PCI error handling to catch and handle faults for certain
devices differently - those we cannot trust to behave properly while
they are under userland/guest control.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
 On 2012-04-08 18:08, Avi Kivity wrote:
  On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
 
  I'm thinking about this flow:
 
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in kvm 
  if
  needed)
 
  If we do it this way we get back the problem this patch
  is trying to solve: MSIX assigned while device
  memory is disabled would cause unsupported request errors.
 
  Why is that?  FLR would presumably disable MSI in the device, and this
  line would disable it in kvm as well.
 
  The bug is that device memory is disabled (FLR would do that)
  while MSI is enabled in kvm. The fix is to
  disable MSI in kvm first.
 
  Yes, no need to repeat.  My question is whether my pseudo-code does the
  same
 
  It doesn't seem to: FLR (disabling memory) is followed
  by MSI disable in kvm instead of the reverse.
  
  Ah, so the problem is the ordering?  I see.
  
  and whether or not if it is better (when applied to all emulated
  config space).
 
  I'm not sure.
  I would like to see an example of a register that you have
  in mind.
  
  I went over the PCI registers and saw none that would be affected.
  
 
  Yes. I'm talking about things like enabling memory, setting up irq 
  register,
  etc though. Most of this setup is done by bios.
 
  I see.  So should we have a pci_reset_function() variant that limits
  itself to restoring just those bits?
 
  We only need kernel to restore whatever qemu emulates, but
  kernel doesn't know what that is.
  What kind of interface do you have in mind?
 
  
  The same as pci_reset_function(), but leaves MSI clear.
  
  I guess it's not worth it if the ordering problem is there.
 
 The core problem is not the ordering. The problem is that the kernel is
 susceptible to ordering mistakes of userspace.
 And that is because the
 kernel panics on PCI errors of devices that are in user hands - a
 critical kernel bug IMHO.

I'm not sure. The pci sysfs interface
is by design not secured against malicious users,
isn't it?

 Proper reset of MSI or even the whole PCI
 config space is another issue, but one the kernel should not worry about
 - still, it should be fixed (therefore this patch).
 But even if we disallowed userland to disable MMIO and PIO access to the
 device, we would be be able to exclude that there are secrete channels
 in the device's interface having the same effect.

I'm not sure I agree here.  If there are secret channels to the device
that let it violate the PCI express spec, it can probably break the SRIOV
security model. And then you can do much more than just crash the host.

 So we likely need to
 enhance PCI error handling to catch and handle faults for certain
 devices differently - those we cannot trust to behave properly while
 they are under userland/guest control.
 
 Jan
 


I agree - forwarding errors to the guest would actually be very useful - but
I think we also need to analyse the problem carefully,
and prevent as many ways as we can for guest to cause trouble.

And there is another issue here: unsuppported request errors
should not cause kernel panics IMO.

There's also the issue that qemu let guest control the MMIO/PIO
bits in the command register.

So there are multiple bugs.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Jan Kiszka
On 2012-04-08 20:18, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
 On 2012-04-08 18:08, Avi Kivity wrote:
 On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
 On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
 On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:

 I'm thinking about this flow:

   FLR the device
   for each emulated register
  read it from the hardware
  if different from emulated register:
 update the internal model (for example, disabling MSI in kvm 
 if
 needed)

 If we do it this way we get back the problem this patch
 is trying to solve: MSIX assigned while device
 memory is disabled would cause unsupported request errors.

 Why is that?  FLR would presumably disable MSI in the device, and this
 line would disable it in kvm as well.

 The bug is that device memory is disabled (FLR would do that)
 while MSI is enabled in kvm. The fix is to
 disable MSI in kvm first.

 Yes, no need to repeat.  My question is whether my pseudo-code does the
 same

 It doesn't seem to: FLR (disabling memory) is followed
 by MSI disable in kvm instead of the reverse.

 Ah, so the problem is the ordering?  I see.

 and whether or not if it is better (when applied to all emulated
 config space).

 I'm not sure.
 I would like to see an example of a register that you have
 in mind.

 I went over the PCI registers and saw none that would be affected.


 Yes. I'm talking about things like enabling memory, setting up irq 
 register,
 etc though. Most of this setup is done by bios.

 I see.  So should we have a pci_reset_function() variant that limits
 itself to restoring just those bits?

 We only need kernel to restore whatever qemu emulates, but
 kernel doesn't know what that is.
 What kind of interface do you have in mind?


 The same as pci_reset_function(), but leaves MSI clear.

 I guess it's not worth it if the ordering problem is there.

 The core problem is not the ordering. The problem is that the kernel is
 susceptible to ordering mistakes of userspace.
 And that is because the
 kernel panics on PCI errors of devices that are in user hands - a
 critical kernel bug IMHO.
 
 I'm not sure. The pci sysfs interface
 is by design not secured against malicious users,
 isn't it?

That's surely true for devices outside of IOMMU protection. But do we
really have to give up when we encapsulate and isolate them that way?
Provided we moderate access to the sysfs resources via libvirt or some
other management service.

 
 Proper reset of MSI or even the whole PCI
 config space is another issue, but one the kernel should not worry about
 - still, it should be fixed (therefore this patch).
 But even if we disallowed userland to disable MMIO and PIO access to the
 device, we would be be able to exclude that there are secrete channels
 in the device's interface having the same effect.
 
 I'm not sure I agree here.  If there are secret channels to the device
 that let it violate the PCI express spec, it can probably break the SRIOV
 security model. And then you can do much more than just crash the host.

Maybe, but there are also other devices. And if a guest reprograms it
(firmware update...) and makes it stop reacting on requests, we may get
the same effect. That would also be some kind of a secrete channel.

 
 So we likely need to
 enhance PCI error handling to catch and handle faults for certain
 devices differently - those we cannot trust to behave properly while
 they are under userland/guest control.

 Jan

 
 
 I agree - forwarding errors to the guest would actually be very useful - but
 I think we also need to analyse the problem carefully,
 and prevent as many ways as we can for guest to cause trouble.

If possible, the protection should target userspace which would
automatically include guests. Only if that is not feasible with
reasonable effort, we have to rely on QEMU to save the host.

 
 And there is another issue here: unsuppported request errors
 should not cause kernel panics IMO.
 
 There's also the issue that qemu let guest control the MMIO/PIO
 bits in the command register.
 
 So there are multiple bugs.
 

Yep, that's true.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-08 Thread Michael S. Tsirkin
On Sun, Apr 08, 2012 at 08:39:35PM +0200, Jan Kiszka wrote:
 On 2012-04-08 20:18, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
  On 2012-04-08 18:08, Avi Kivity wrote:
  On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
  On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
  On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
 
  I'm thinking about this flow:
 
FLR the device
for each emulated register
   read it from the hardware
   if different from emulated register:
  update the internal model (for example, disabling MSI in 
  kvm if
  needed)
 
  If we do it this way we get back the problem this patch
  is trying to solve: MSIX assigned while device
  memory is disabled would cause unsupported request errors.
 
  Why is that?  FLR would presumably disable MSI in the device, and this
  line would disable it in kvm as well.
 
  The bug is that device memory is disabled (FLR would do that)
  while MSI is enabled in kvm. The fix is to
  disable MSI in kvm first.
 
  Yes, no need to repeat.  My question is whether my pseudo-code does the
  same
 
  It doesn't seem to: FLR (disabling memory) is followed
  by MSI disable in kvm instead of the reverse.
 
  Ah, so the problem is the ordering?  I see.
 
  and whether or not if it is better (when applied to all emulated
  config space).
 
  I'm not sure.
  I would like to see an example of a register that you have
  in mind.
 
  I went over the PCI registers and saw none that would be affected.
 
 
  Yes. I'm talking about things like enabling memory, setting up irq 
  register,
  etc though. Most of this setup is done by bios.
 
  I see.  So should we have a pci_reset_function() variant that limits
  itself to restoring just those bits?
 
  We only need kernel to restore whatever qemu emulates, but
  kernel doesn't know what that is.
  What kind of interface do you have in mind?
 
 
  The same as pci_reset_function(), but leaves MSI clear.
 
  I guess it's not worth it if the ordering problem is there.
 
  The core problem is not the ordering. The problem is that the kernel is
  susceptible to ordering mistakes of userspace.
  And that is because the
  kernel panics on PCI errors of devices that are in user hands - a
  critical kernel bug IMHO.
  
  I'm not sure. The pci sysfs interface
  is by design not secured against malicious users,
  isn't it?
 
 That's surely true for devices outside of IOMMU protection. But do we
 really have to give up when we encapsulate and isolate them that way?
 Provided we moderate access to the sysfs resources via libvirt or some
 other management service.

We don't have to give up but we'd have to build such an
interface: /config attribute is not it.

  
  Proper reset of MSI or even the whole PCI
  config space is another issue, but one the kernel should not worry about
  - still, it should be fixed (therefore this patch).
  But even if we disallowed userland to disable MMIO and PIO access to the
  device, we would be be able to exclude that there are secrete channels
  in the device's interface having the same effect.
  
  I'm not sure I agree here.  If there are secret channels to the device
  that let it violate the PCI express spec, it can probably break the SRIOV
  security model. And then you can do much more than just crash the host.
 
 Maybe, but there are also other devices. And if a guest reprograms it
 (firmware update...) and makes it stop reacting on requests, we may get
 the same effect. That would also be some kind of a secrete channel.

Right. So it looks like SRIOV VF is the only type of device that
is safe to assign to a guest:

Presumably, SRIOV VFs don't let driver program the firmware.
And I think SRIOV VFs don't have MMIO/PIO enable bits either,
and the BAR isn't programmable through the VF...

  
  So we likely need to
  enhance PCI error handling to catch and handle faults for certain
  devices differently - those we cannot trust to behave properly while
  they are under userland/guest control.
 
  Jan
 
  
  
  I agree - forwarding errors to the guest would actually be very useful - but
  I think we also need to analyse the problem carefully,
  and prevent as many ways as we can for guest to cause trouble.
 
 If possible, the protection should target userspace which would
 automatically include guests. Only if that is not feasible with
 reasonable effort, we have to rely on QEMU to save the host.

Defence in depth is best, right?

  
  And there is another issue here: unsuppported request errors
  should not cause kernel panics IMO.
  
  There's also the issue that qemu let guest control the MMIO/PIO
  bits in the command register.
  
  So there are multiple bugs.
  
 
 Yep, that's true.
 
 Jan
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-08 Thread Ren Mingxin

 On 04/04/2012 04:01 PM, Michael S. Tsirkin wrote:

On Mon, Apr 02, 2012 at 12:00:45PM -0700, Tejun Heo wrote:

On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote:

So if we're agreed no other devices going forwards should ever use this
interface, is there any point unifying the interface?  No matter how
many caveats you hedge it round with, putting the API in a central place
will be a bit like a honey trap for careless bears.   It might be safer
just to leave it buried in the three current drivers.

Yeah, that was my hope but I think it would be easier to enforce to
have a common function which is clearly marked legacy so that new
driver writers can go look for the naming code in the existing ones,
find out they're all using the same function which is marked legacy
and explains what to do for newer drivers.

I think I'm not the only one to be confused about the
preferred direction here.
James, do you agree to the approach above?

It would be nice to fix virtio block for 3.4, so
how about this:
- I'll just apply the original bugfix patch for 3.4 -
   it only affects virtio


Sorry, about only affects virtio, I'm not very clear here:
1) Just duplicate the disk name format function in virtio_blk
like the original patch: https://lkml.org/lkml/2012/3/28/45
2) Move the disk name format function into block core like
this patch series but only affects virtio(not affect mtip32xx).
Do you mean the 2) one or something else?


- Ren will repost the refactoring patch on top, and we can
   keep up the discussion

Ren if you agree, can you make this a two patch series please?



Sure.

--
Thanks,
Ren

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: PPC: Use clockevent multiplier and shifter for decrementer

2012-04-08 Thread Bharat Bhushan
Time for which the hrtimer is started for decrementer emulation is calculated 
using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC 
reprogramming (if needed) and which calculate timebase ticks using the 
multiplier and shifter mechanism implemented within clockevent layer. It was 
observed that this conversion (timebase-time-timebase) are not correct 
because the mechanism are not consistent. In our setup it adds 2% jitter.

With this patch clockevent multiplier and shifter mechanism are used when 
starting hrtimer for decrementer emulation. Now the jitter is  0.5%.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v2:
 - decrementer_clockevent is made non-static rather than a seprate API to get 
mult/shift

 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/time.c  |2 +-
 arch/powerpc/kvm/emulate.c  |5 +++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 7eb10fb..b3c7959 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@
 extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
+extern struct clock_event_device decrementer_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 567dd7c..e237225 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -105,7 +105,7 @@ static int decrementer_set_next_event(unsigned long evt,
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
 
-static struct clock_event_device decrementer_clockevent = {
+struct clock_event_device decrementer_clockevent = {
.name   = decrementer,
.rating = 200,
.irq= 0,
diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
index afc9154..c8b5206 100644
--- a/arch/powerpc/kvm/emulate.c
+++ b/arch/powerpc/kvm/emulate.c
@@ -23,6 +23,7 @@
 #include linux/types.h
 #include linux/string.h
 #include linux/kvm_host.h
+#include linux/clockchips.h
 
 #include asm/reg.h
 #include asm/time.h
@@ -104,8 +105,8 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
 */
 
dec_time = vcpu-arch.dec;
-   dec_time *= 1000;
-   do_div(dec_time, tb_ticks_per_usec);
+   dec_time = dec_time  decrementer_clockevent.shift;
+   do_div(dec_time, decrementer_clockevent.mult);
dec_nsec = do_div(dec_time, NSEC_PER_SEC);
hrtimer_start(vcpu-arch.dec_timer,
ktime_set(dec_time, dec_nsec), HRTIMER_MODE_REL);
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New git workflow

2012-04-08 Thread Avi Kivity
On 04/05/2012 08:02 PM, Avi Kivity wrote:
 I'll publish the new branches tomorrow, with any luck.

There wasn't any luck, so it's only ready today.  To allow chance for
review, I'm publishing next as next-candidate.

Paul/Alex, please review the powerpc bits.  Specifically:

  system.h is gone, so I moved the prototype of load_up_fpu() to
asm/switch_to.h and added a #include (8fae845f4956d).
  E6500 was added in upstream in parallel with the split of kvm
E500/E500MC.  I guessed which part of the #ifdef E6500 was to go into,
but please verify (73196cd364a2, 06aae86799c1b).

Please either post fixup patches atop next-candidate, if needed, or
(preferably) post a new tree of the powerpc update based on
b6d33834bd4e, to replace b6d33834bd4e..966cd0f3bdd.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: PPC: Use clockevent multiplier and shifter for decrementer

2012-04-08 Thread Bharat Bhushan
Time for which the hrtimer is started for decrementer emulation is calculated 
using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC 
reprogramming (if needed) and which calculate timebase ticks using the 
multiplier and shifter mechanism implemented within clockevent layer. It was 
observed that this conversion (timebase-time-timebase) are not correct 
because the mechanism are not consistent. In our setup it adds 2% jitter.

With this patch clockevent multiplier and shifter mechanism are used when 
starting hrtimer for decrementer emulation. Now the jitter is  0.5%.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v2:
 - decrementer_clockevent is made non-static rather than a seprate API to get 
mult/shift

 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/time.c  |2 +-
 arch/powerpc/kvm/emulate.c  |5 +++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 7eb10fb..b3c7959 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@
 extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
+extern struct clock_event_device decrementer_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 567dd7c..e237225 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -105,7 +105,7 @@ static int decrementer_set_next_event(unsigned long evt,
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
 
-static struct clock_event_device decrementer_clockevent = {
+struct clock_event_device decrementer_clockevent = {
.name   = decrementer,
.rating = 200,
.irq= 0,
diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
index afc9154..c8b5206 100644
--- a/arch/powerpc/kvm/emulate.c
+++ b/arch/powerpc/kvm/emulate.c
@@ -23,6 +23,7 @@
 #include linux/types.h
 #include linux/string.h
 #include linux/kvm_host.h
+#include linux/clockchips.h
 
 #include asm/reg.h
 #include asm/time.h
@@ -104,8 +105,8 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
 */
 
dec_time = vcpu-arch.dec;
-   dec_time *= 1000;
-   do_div(dec_time, tb_ticks_per_usec);
+   dec_time = dec_time  decrementer_clockevent.shift;
+   do_div(dec_time, decrementer_clockevent.mult);
dec_nsec = do_div(dec_time, NSEC_PER_SEC);
hrtimer_start(vcpu-arch.dec_timer,
ktime_set(dec_time, dec_nsec), HRTIMER_MODE_REL);
-- 
1.7.0.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html