Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Michael S. Tsirkin
On Wed, Dec 05, 2012 at 08:39:26PM +, Ben Hutchings wrote:
 On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
  Add RFS support to virtio network device.
  Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
  configuration field max_virtqueue_pairs to detect supported number of
  virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
  packet steering for unidirectional protocols.
 [...]
  +Programming of the receive flow classificator is implicit.
  + Transmitting a packet of a specific flow on transmitqX will cause incoming
  + packets for this flow to be steered to receiveqX.
  + For uni-directional protocols, or where no packets have been transmitted
  + yet, device will steer a packet to a random queue out of the specified
  + receiveq0..receiveqn.
 [...]
 
 It doesn't seem like this is usable to implement accelerated RFS in the
 guest, though perhaps that doesn't matter.

What is the issue? Could you be more explicit please?

It seems to work pretty well: if we have
# of queues = # of cpus, incoming TCP_STREAM into
guest scales very nicely without manual tweaks in guest.

The way it works is, when guest sends a packet driver
select the rx queue that we want to use for incoming
packets for this slow, and transmit on the matching tx queue.
This is exactly what text above suggests no?

  On the host side, presumably
 you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
 only without a socket?  But in any case, that requires an rxhash, so I
 don't see how this is supposed to work.
 
 Ben.

Host should just do what guest tells it to.
On the host side we build up the steering table as we get packets
to transmit. See the code in drivers/net/tun.c in recent
kernels.

Again this actually works fine - what are the problems that you see?
Could you give an example please?

 -- 
 Ben Hutchings, Staff Engineer, Solarflare
 Not speaking for my employer; that's the marketing department's job.
 They asked us to note that Solarflare product names are trademarked.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian

2012-12-06 Thread Simon O'Riordan

Sound works badly with hda.
There is a slight improvement with Pulseaudio in the host;
the solution, however, is to use ALSA with AC97.

Since qemu-kvm 1.1, ac97 emulation has been reworked
 and now works with Windows 7 x64.

The windows update mechanism will install the drivers
 automatically, but whatcan be seen is that the drivers
 are actually only the Realtek 6305-series OEM
drivers, so if you have an isolated Win 7 system, 
simply find the 6305 driver set(available as a zip, 
sorry no link at the moment, google '6305_Vista_Win7_PG537.zip'), 
run the setup programme and you should have perfect sound.
I know we do.

The same applies to Qemu-kvm 1.2.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/4] x86, apicv: add virtual interrupt delivery support

2012-12-06 Thread Gleb Natapov
On Wed, Dec 05, 2012 at 08:38:59PM -0200, Marcelo Tosatti wrote:
 On Wed, Dec 05, 2012 at 01:14:38PM +0200, Gleb Natapov wrote:
  On Wed, Dec 05, 2012 at 03:43:41AM +, Zhang, Yang Z wrote:
@@ -5657,12 +5673,20 @@ static int vcpu_enter_guest(struct kvm_vcpu
*vcpu)
   }
 
   if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
+  /* update archtecture specific hints for APIC
+   * virtual interrupt delivery */
+  if (kvm_x86_ops-update_irq)
+  kvm_x86_ops-update_irq(vcpu);
+
   inject_pending_event(vcpu);
 
   /* enable NMI/IRQ window open exits if needed */
   if (vcpu-arch.nmi_pending)
   kvm_x86_ops-enable_nmi_window(vcpu);
-  else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
+  else if (kvm_apic_vid_enabled(vcpu)) {
+  if (kvm_cpu_has_extint(vcpu))
+  kvm_x86_ops-enable_irq_window(vcpu);

If RVI is non-zero, then interrupt window should not be enabled,
accordingly to 29.2.2:

If a virtual interrupt has been recognized (see Section 29.2.1), it 
will
be delivered at an instruction boundary when the following conditions 
all
hold: (1) RFLAGS.IF = 1; (2) there is no blocking by STI; (3) there is 
no
blocking by MOV SS or by POP SS; and (4) the “interrupt-window exiting”
VM-execution control is 0.
   Right. Must check RVI here.
   
  Why? We request interrupt window here because there is ExtINT interrupt
  pending. ExtINT interrupt has a precedence over APIC interrupts (our
  current code is incorrect!), so we want vmexit as soon as interrupts are
  allowed to inject ExtINT and we do not want virtual interrupt to be
  delivered. I think the (4) there is exactly for this situation.
  
  --
  Gleb.
 
 Right. BTW, delivery of ExtINT has no EOI, so there is no evaluation
 of pending virtual interrupts. Therefore, shouldnt interrupt window be
 enabled when injecting ExtINT so that evaluation of pending virtual
 interrupts is performed on next vm-entry?
 
Good question and I think, luckily for us, the answer is no. Spec uses
two different terms when it talks about virtual interrupts Evaluation
of Pending Virtual Interrupts and Virtual-Interrupt Delivery. As far
as my reading of the spec goes they are not necessary happen at the same
time. So during ExtINT injection evaluation will happen (due to vmentry)
and virtual interrupt will be recognized, but not delivered. It will
be delivered when condition described in section 29.2.2 will be met i.e
when interrupts will be enabled.

Yang, can you confirm this?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 0/2] x86: vmclear vmcss on all cpus when doing kdump if necessary

2012-12-06 Thread Gleb Natapov
Can you regenerate against current queue branch in
git://git.kernel.org/pub/scm/virt/kvm/kvm.git please?

On Wed, Dec 05, 2012 at 03:58:50PM +0800, Zhang Yanfei wrote:
 Currently, kdump just makes all the logical processors leave VMX operation by
 executing VMXOFF instruction, so any VMCSs active on the logical processors 
 may
 be corrupted. But, sometimes, we need the VMCSs to debug guest images 
 contained
 in the host vmcore. To prevent the corruption, we should VMCLEAR the VMCSs 
 before
 executing the VMXOFF instruction.
 
 The patch set provides a way to VMCLEAR vmcss related to guests on all cpus 
 before
 executing the VMXOFF when doing kdump. This is used to ensure the VMCSs in the
 vmcore updated and non-corrupted.
 
 Changelog from v9 to v10:
 1. add rcu protect to the callback function
 
 Changelog from v8 to v9:
 1. KEXEC: use a callback function instead of a notifier.
 2. KVM-INTEL: use a new vmclear function instead of just calling 
vmclear_local_loaded_vmcss to make sure we just do the core vmclear
operation in kdump.
 
 Changelog from v7 to v8:
 1. KEXEC: regression for using name crash_notifier_list
and remove comments related to KVM
and just call function atomic_notifier_call_chain directly.
 
 Changelog from v6 to v7:
 1. KVM-INTEL: in hardware_disable, we needn't disable the
vmclear, so remove it.
 
 Changelog from v5 to v6:
 1. KEXEC: the atomic notifier list renamed:
crash_notifier_list -- vmclear_notifier_list
 2. KVM-INTEL: provide empty functions if CONFIG_KEXEC is
not defined and remove unnecessary #ifdef's.
 
 Changelog from v4 to v5:
 1. use an atomic notifier instead of function call, so
have all the vmclear codes in vmx.c.
 
 Changelog from v3 to v4:
 1. add a new percpu variable vmclear_skipped to skip
vmclear in kdump in some conditions.
 
 Changelog from v2 to v3:
 1. remove unnecessary conditions in function
cpu_emergency_clear_loaded_vmcss as Marcelo suggested.
 
 Changelog from v1 to v2:
 1. remove the sysctl and clear VMCSs unconditionally.
 
 Zhang Yanfei (2):
   x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
   KVM-INTEL: provide the vmclear function and a bitmap to support
 VMCLEAR in kdump
 
  arch/x86/include/asm/kexec.h |2 +
  arch/x86/kernel/crash.c  |   32 
  arch/x86/kvm/vmx.c   |   67 
 ++
  3 files changed, 101 insertions(+), 0 deletions(-)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 3/4] x86, apicv: add virtual interrupt delivery support

2012-12-06 Thread Zhang, Yang Z
Gleb Natapov wrote on 2012-12-06:
 On Wed, Dec 05, 2012 at 08:38:59PM -0200, Marcelo Tosatti wrote:
 On Wed, Dec 05, 2012 at 01:14:38PM +0200, Gleb Natapov wrote:
 On Wed, Dec 05, 2012 at 03:43:41AM +, Zhang, Yang Z wrote:
 @@ -5657,12 +5673,20 @@ static int vcpu_enter_guest(struct kvm_vcpu
 *vcpu)
  }
  
  if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win)
 {
 +/* update archtecture specific hints for APIC
 + * virtual interrupt delivery */
 +if (kvm_x86_ops-update_irq)
 +kvm_x86_ops-update_irq(vcpu);
 +
  inject_pending_event(vcpu);
  
  /* enable NMI/IRQ window open exits if needed */
  if (vcpu-arch.nmi_pending)
  kvm_x86_ops-enable_nmi_window(vcpu);
 -else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
 +else if (kvm_apic_vid_enabled(vcpu)) {
 +if (kvm_cpu_has_extint(vcpu))
 +kvm_x86_ops-enable_irq_window(vcpu);
 
 If RVI is non-zero, then interrupt window should not be enabled,
 accordingly to 29.2.2:
 
 If a virtual interrupt has been recognized (see Section 29.2.1), it will
 be delivered at an instruction boundary when the following conditions all
 hold: (1) RFLAGS.IF = 1; (2) there is no blocking by STI; (3) there is no
 blocking by MOV SS or by POP SS; and (4) the interrupt-window exiting
 VM-execution control is 0.
 Right. Must check RVI here.
 
 Why? We request interrupt window here because there is ExtINT interrupt
 pending. ExtINT interrupt has a precedence over APIC interrupts (our
 current code is incorrect!), so we want vmexit as soon as interrupts are
 allowed to inject ExtINT and we do not want virtual interrupt to be
 delivered. I think the (4) there is exactly for this situation.
 
 --
 Gleb.
 
 Right. BTW, delivery of ExtINT has no EOI, so there is no evaluation
 of pending virtual interrupts. Therefore, shouldnt interrupt window be
 enabled when injecting ExtINT so that evaluation of pending virtual
 interrupts is performed on next vm-entry?
 
 Good question and I think, luckily for us, the answer is no. Spec uses
 two different terms when it talks about virtual interrupts Evaluation
 of Pending Virtual Interrupts and Virtual-Interrupt Delivery. As far
 as my reading of the spec goes they are not necessary happen at the same
 time. So during ExtINT injection evaluation will happen (due to vmentry)
 and virtual interrupt will be recognized, but not delivered. It will
 be delivered when condition described in section 29.2.2 will be met i.e
 when interrupts will be enabled.
 
 Yang, can you confirm this?
Right. 
Vmentry causes the evaluation of pending virtual interrupt even during ExtINT 
injection. If RVI[7:4]  VPPR[7:4], the logical process recognizes a pending 
virtual interrupt. Then it will be delivery when condition is met.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] x86info: dump kvm cpuid's

2012-12-06 Thread Michael S. Tsirkin
On Wed, Sep 05, 2012 at 08:33:35PM +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 30, 2012 at 05:38:35PM +0300, Michael S. Tsirkin wrote:
  The following makes 'x86info -r' dump hypervisor leaf cpu ids
  (for kvm this is signature+features) when running in a vm.
  
  On the guest we see the signature and the features:
  eax in: 0x4000, eax =  ebx = 4b4d564b ecx = 564b4d56 edx = 
  004d
  eax in: 0x4001, eax = 017b ebx =  ecx =  edx = 
  
  
  Hypervisor flag is checked to avoid output changes when not
  running on a VM.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  Changes from v1:
  Make work on non KVM hypervisors (only KVM was tested).
  Avi Kivity said kvm will in the future report
  max HV leaf in eax. For now it reports eax = 0
  so add a work around for that.
 
 Ping.
 Davej, any comments?
 Would be nice to have this in.

Is this the right address?
Davej do you maintain x86info?
Thanks,
MST

 
  ---
  
  diff --git a/identify.c b/identify.c
  index 33f35de..a4a3763 100644
  --- a/identify.c
  +++ b/identify.c
  @@ -9,8 +9,8 @@
   
   void get_cpu_info_basics(struct cpudata *cpu)
   {
  -   unsigned int maxi, maxei, vendor, address_bits;
  -   unsigned int eax;
  +   unsigned int maxi, maxei, maxhv, vendor, address_bits;
  +   unsigned int eax, ebx, ecx;
   
  cpuid(cpu-number, 0, maxi, vendor, NULL, NULL);
  maxi = 0x; /* The high-order word is non-zero on some 
  Cyrix CPUs */
  @@ -19,7 +19,7 @@ void get_cpu_info_basics(struct cpudata *cpu)
  return;
   
  /* Everything that supports cpuid supports these. */
  -   cpuid(cpu-number, 1, eax, NULL, NULL, NULL);
  +   cpuid(cpu-number, 1, eax, ebx, ecx, NULL);
  cpu-stepping = eax  0xf;
  cpu-model = (eax  4)  0xf;
  cpu-family = (eax  8)  0xf;
  @@ -29,6 +29,19 @@ void get_cpu_info_basics(struct cpudata *cpu)
   
  cpuid(cpu-number, 0xC000, maxei, NULL, NULL, NULL);
  cpu-maxei2 = maxei;
  +   if (ecx  0x8000) {
  +   cpuid(cpu-number, 0x4000, maxhv, NULL, NULL, NULL);
  +   /*
  +* KVM up to linux 3.4 reports 0 as the max hypervisor leaf,
  +* where it really means 0x4001.
  +* Most (all?) hypervisors have at least one CPUID besides
  +* the vendor ID so assume that.
  +*/
  +   cpu-maxhv = maxhv ? maxhv : 0x4001;
  +   } else {
  +   /* Suppress hypervisor cpuid unless running on a hypervisor */
  +   cpu-maxhv = 0;
  +   }
   
  cpuid(cpu-number, 0x8008,address_bits, NULL, NULL, NULL);
  cpu-phyaddr_bits = address_bits  0xFF;
  diff --git a/x86info.c b/x86info.c
  index 22c4734..80cae36 100644
  --- a/x86info.c
  +++ b/x86info.c
  @@ -44,6 +44,10 @@ static void display_detailed_info(struct cpudata *cpu)
   
  if (cpu-maxei2 =0xC000)
  dump_raw_cpuid(cpu-number, 0xC000, cpu-maxei2);
  +
  +   if (cpu-maxhv = 0x4000)
  +   dump_raw_cpuid(cpu-number, 0x4000, cpu-maxhv);
  +
  }
   
  if (show_cacheinfo) {
  diff --git a/x86info.h b/x86info.h
  index 7d2a455..c4f5d81 100644
  --- a/x86info.h
  +++ b/x86info.h
  @@ -84,7 +84,7 @@ struct cpudata {
  unsigned int cachesize_trace;
  unsigned int phyaddr_bits;
  unsigned int viraddr_bits;
  -   unsigned int cpuid_level, maxei, maxei2;
  +   unsigned int cpuid_level, maxei, maxei2, maxhv;
  char name[CPU_NAME_LEN];
  enum connector connector;
  unsigned int flags_ecx;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v6

2012-12-06 Thread Michael S. Tsirkin
On Sun, Dec 02, 2012 at 09:33:53AM +0800, Asias He wrote:
 diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
 new file mode 100644
 index 000..ff8ab76
 --- /dev/null
 +++ b/drivers/vhost/Kconfig.blk
 @@ -0,0 +1,10 @@
 +config VHOST_BLK
 + tristate Host kernel accelerator for virtio blk (EXPERIMENTAL)
 + depends on BLOCK   EXPERIMENTAL  m


should depend on eventfd as well.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] x86info: dump kvm cpuid's

2012-12-06 Thread Dave Jones
On Thu, Dec 06, 2012 at 01:41:12PM +0200, Michael S. Tsirkin wrote:
  On Wed, Sep 05, 2012 at 08:33:35PM +0300, Michael S. Tsirkin wrote:
   On Mon, Apr 30, 2012 at 05:38:35PM +0300, Michael S. Tsirkin wrote:
The following makes 'x86info -r' dump hypervisor leaf cpu ids
(for kvm this is signature+features) when running in a vm.

On the guest we see the signature and the features:
eax in: 0x4000, eax =  ebx = 4b4d564b ecx = 564b4d56 edx = 
004d
eax in: 0x4001, eax = 017b ebx =  ecx =  edx = 


Hypervisor flag is checked to avoid output changes when not
running on a VM.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

Changes from v1:
 Make work on non KVM hypervisors (only KVM was tested).
 Avi Kivity said kvm will in the future report
 max HV leaf in eax. For now it reports eax = 0
so add a work around for that.
   
   Ping.
   Davej, any comments?
   Would be nice to have this in.
  
  Is this the right address?
  Davej do you maintain x86info?
  Thanks,
  MST

It's effectively abandonware at this point, largely due to my own
lack of time. hwloc and similar tools seem to have taken it's place.
If anyone is interested in taking over x86info, I'm happy to hand
over the reins to someone capable.

Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL net-next] vhost: changes for 3.8

2012-12-06 Thread Michael S. Tsirkin
The following changes since commit b93196dc5af7729ff7cc50d3d322ab1a364aa14f:

  net: fix some compiler warning in net/core/neighbour.c (2012-12-05 21:50:37 
-0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next

for you to fetch changes up to 405d55c99da7a3045275fdb1a30614293a53c6e7:

  tcm_vhost: remove unused variable in vhost_scsi_allocate_cmd() (2012-12-06 
17:09:19 +0200)


Cong Ding (1):
  tools:virtio: fix compilation warning

Michael S. Tsirkin (4):
  vhost: avoid backend flush on vring ops
  vhost-net: flush outstanding DMAs on memory change
  vhost-net: skip head management if no outstanding
  vhost-net: enable zerocopy tx by default

Wei Yongjun (1):
  tcm_vhost: remove unused variable in vhost_scsi_allocate_cmd()

 drivers/vhost/net.c| 51 ++
 drivers/vhost/tcm_vhost.c  |  7 ---
 drivers/vhost/vhost.c  |  7 +++
 drivers/vhost/vhost.h  |  3 ++-
 tools/virtio/virtio_test.c |  2 +-
 5 files changed, 44 insertions(+), 26 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 0/2] x86: vmclear vmcss on all cpus when doing kdump if necessary

2012-12-06 Thread Zhang Yanfei
Currently, kdump just makes all the logical processors leave VMX operation by
executing VMXOFF instruction, so any VMCSs active on the logical processors may
be corrupted. But, sometimes, we need the VMCSs to debug guest images contained
in the host vmcore. To prevent the corruption, we should VMCLEAR the VMCSs 
before
executing the VMXOFF instruction.

The patch set provides a way to VMCLEAR vmcss related to guests on all cpus 
before
executing the VMXOFF when doing kdump. This is used to ensure the VMCSs in the
vmcore updated and non-corrupted.

Changelog from v10 to v11:
1. regenerate the patch set against current queue branch in
   git://git.kernel.org/pub/scm/virt/kvm/kvm.git

Changelog from v9 to v10:
1. add rcu protect to the callback function

Changelog from v8 to v9:
1. KEXEC: use a callback function instead of a notifier.
2. KVM-INTEL: use a new vmclear function instead of just calling 
   vmclear_local_loaded_vmcss to make sure we just do the core vmclear
   operation in kdump.

Changelog from v7 to v8:
1. KEXEC: regression for using name crash_notifier_list
   and remove comments related to KVM
   and just call function atomic_notifier_call_chain directly.

Changelog from v6 to v7:
1. KVM-INTEL: in hardware_disable, we needn't disable the
   vmclear, so remove it.

Changelog from v5 to v6:
1. KEXEC: the atomic notifier list renamed:
   crash_notifier_list -- vmclear_notifier_list
2. KVM-INTEL: provide empty functions if CONFIG_KEXEC is
   not defined and remove unnecessary #ifdef's.

Changelog from v4 to v5:
1. use an atomic notifier instead of function call, so
   have all the vmclear codes in vmx.c.

Changelog from v3 to v4:
1. add a new percpu variable vmclear_skipped to skip
   vmclear in kdump in some conditions.

Changelog from v2 to v3:
1. remove unnecessary conditions in function
   cpu_emergency_clear_loaded_vmcss as Marcelo suggested.

Changelog from v1 to v2:
1. remove the sysctl and clear VMCSs unconditionally.

Zhang Yanfei (2):
  x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
  KVM-INTEL: provide the vmclear function and a bitmap to support
VMCLEAR in kdump

 arch/x86/include/asm/kexec.h |2 + 
 arch/x86/kernel/crash.c  |   32 
 arch/x86/kvm/vmx.c   |   67 ++
 3 files changed, 101 insertions(+), 0 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 1/2] x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary

2012-12-06 Thread Zhang Yanfei
From: Zhang Yanfei zhangyan...@cn.fujitsu.com

This patch provides a way to VMCLEAR VMCSs related to guests
on all cpus before executing the VMXOFF when doing kdump. This
is used to ensure the VMCSs in the vmcore updated and
non-corrupted.

Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Acked-by: Eric W. Biederman ebied...@xmission.com
---
 arch/x86/include/asm/kexec.h |2 ++
 arch/x86/kernel/crash.c  |   32 
 2 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 317ff17..28feeba 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -163,6 +163,8 @@ struct kimage_arch {
 };
 #endif
 
+extern void (*crash_vmclear_loaded_vmcss)(void);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 13ad899..b914b7f 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -16,6 +16,7 @@
 #include linux/delay.h
 #include linux/elf.h
 #include linux/elfcore.h
+#include linux/module.h
 
 #include asm/processor.h
 #include asm/hardirq.h
@@ -30,6 +31,27 @@
 
 int in_crash_kexec;
 
+/*
+ * This is used to VMCLEAR all VMCSs loaded on the
+ * processor. And when loading kvm_intel module, the
+ * callback function pointer will be assigned.
+ *
+ * protected by rcu.
+ */
+void (*crash_vmclear_loaded_vmcss)(void) = NULL;
+EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
+
+static inline void cpu_crash_vmclear_loaded_vmcss(void)
+{
+   void (*do_vmclear_operation)(void) = NULL;
+
+   rcu_read_lock();
+   do_vmclear_operation = rcu_dereference(crash_vmclear_loaded_vmcss);
+   if (do_vmclear_operation)
+   do_vmclear_operation();
+   rcu_read_unlock();
+}
+
 #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
 
 static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
@@ -46,6 +68,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 #endif
crash_save_cpu(regs, cpu);
 
+   /*
+* VMCLEAR VMCSs loaded on all cpus if needed.
+*/
+   cpu_crash_vmclear_loaded_vmcss();
+
/* Disable VMX or SVM if needed.
 *
 * We need to disable virtualization on all CPUs.
@@ -88,6 +115,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 
kdump_nmi_shootdown_cpus();
 
+   /*
+* VMCLEAR VMCSs loaded on this cpu if needed.
+*/
+   cpu_crash_vmclear_loaded_vmcss();
+
/* Booting kdump kernel with VMX or SVM enabled won't work,
 * because (among other limitations) we can't disable paging
 * with the virt flags.
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 2/2] KVM-INTEL: provide the vmclear function and a bitmap to support VMCLEAR in kdump

2012-12-06 Thread Zhang Yanfei
From: Zhang Yanfei zhangyan...@cn.fujitsu.com

The vmclear function will be assigned to the callback function pointer
when loading kvm-intel module. And the bitmap indicates whether we
should do VMCLEAR operation in kdump. The bits in the bitmap are
set/unset according to different conditions.

Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/kvm/vmx.c |   67 
 1 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 94833e2..1a30fd5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -42,6 +42,7 @@
 #include asm/i387.h
 #include asm/xcr.h
 #include asm/perf_event.h
+#include asm/kexec.h
 
 #include trace.h
 
@@ -987,6 +988,46 @@ static void vmcs_load(struct vmcs *vmcs)
   vmcs, phys_addr);
 }
 
+#ifdef CONFIG_KEXEC
+/*
+ * This bitmap is used to indicate whether the vmclear
+ * operation is enabled on all cpus. All disabled by
+ * default.
+ */
+static cpumask_t crash_vmclear_enabled_bitmap = CPU_MASK_NONE;
+
+static inline void crash_enable_local_vmclear(int cpu)
+{
+   cpumask_set_cpu(cpu, crash_vmclear_enabled_bitmap);
+}
+
+static inline void crash_disable_local_vmclear(int cpu)
+{
+   cpumask_clear_cpu(cpu, crash_vmclear_enabled_bitmap);
+}
+
+static inline int crash_local_vmclear_enabled(int cpu)
+{
+   return cpumask_test_cpu(cpu, crash_vmclear_enabled_bitmap);
+}
+
+static void crash_vmclear_local_loaded_vmcss(void)
+{
+   int cpu = raw_smp_processor_id();
+   struct loaded_vmcs *v;
+
+   if (!crash_local_vmclear_enabled(cpu))
+   return;
+
+   list_for_each_entry(v, per_cpu(loaded_vmcss_on_cpu, cpu),
+   loaded_vmcss_on_cpu_link)
+   vmcs_clear(v-vmcs);
+}
+#else
+static inline void crash_enable_local_vmclear(int cpu) { }
+static inline void crash_disable_local_vmclear(int cpu) { }
+#endif /* CONFIG_KEXEC */
+
 static void __loaded_vmcs_clear(void *arg)
 {
struct loaded_vmcs *loaded_vmcs = arg;
@@ -996,6 +1037,7 @@ static void __loaded_vmcs_clear(void *arg)
return; /* vcpu migration can race with cpu offline */
if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
per_cpu(current_vmcs, cpu) = NULL;
+   crash_disable_local_vmclear(cpu);
list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
 
/*
@@ -1007,6 +1049,7 @@ static void __loaded_vmcs_clear(void *arg)
smp_wmb();
 
loaded_vmcs_init(loaded_vmcs);
+   crash_enable_local_vmclear(cpu);
 }
 
 static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
@@ -1530,6 +1573,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
local_irq_disable();
+   crash_disable_local_vmclear(cpu);
 
/*
 * Read loaded_vmcs-cpu should be before fetching
@@ -1540,6 +1584,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link,
 per_cpu(loaded_vmcss_on_cpu, cpu));
+   crash_enable_local_vmclear(cpu);
local_irq_enable();
 
/*
@@ -2353,6 +2398,18 @@ static int hardware_enable(void *garbage)
return -EBUSY;
 
INIT_LIST_HEAD(per_cpu(loaded_vmcss_on_cpu, cpu));
+
+   /*
+* Now we can enable the vmclear operation in kdump
+* since the loaded_vmcss_on_cpu list on this cpu
+* has been initialized.
+*
+* Though the cpu is not in VMX operation now, there
+* is no problem to enable the vmclear operation
+* for the loaded_vmcss_on_cpu list is empty!
+*/
+   crash_enable_local_vmclear(cpu);
+
rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
 
test_bits = FEATURE_CONTROL_LOCKED;
@@ -7383,6 +7440,11 @@ static int __init vmx_init(void)
if (r)
goto out3;
 
+#ifdef CONFIG_KEXEC
+   rcu_assign_pointer(crash_vmclear_loaded_vmcss,
+  crash_vmclear_local_loaded_vmcss);
+#endif
+
vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
@@ -7420,6 +7482,11 @@ static void __exit vmx_exit(void)
free_page((unsigned long)vmx_io_bitmap_b);
free_page((unsigned long)vmx_io_bitmap_a);
 
+#ifdef CONFIG_KEXEC
+   rcu_assign_pointer(crash_vmclear_loaded_vmcss, NULL);
+   synchronize_rcu();
+#endif
+
kvm_exit();
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] KVM: ARM: add vgic state save and restore support

2012-12-06 Thread Peter Maydell
On 4 December 2012 13:37, Dong Aisheng b29...@freescale.com wrote:
 On Tue, Dec 04, 2012 at 12:45:12PM +, Peter Maydell wrote:
 On 4 December 2012 12:27, Dong Aisheng b29...@freescale.com wrote:
  On Mon, Dec 03, 2012 at 01:22:07PM +, Peter Maydell wrote:
  What we're really providing the guest here is a hardware-accelerated
  software emulation of a no-virtualization GICv2. So it's better for
  the state we expose to userspace to be the state of that emulated
  hardware, and not to expose details of exactly what that hardware
  acceleration is.
 
  It looks like a good idea.
  Then in which format? User space qemu and kernel space vgic are using
  different data format to describe gic state.
  We definitely need a standard one to use with good compatibility.
  One simple way may be just registers value of no-virtualization GICv2.

 Values of registers and state of a device are not identical
 (though the internal state is often made visible via registers).
 We care about the latter, not the former.

 I agree your point, the problem is how to define a standard state of
 gic device?

Yes, indeed; that is exactly the major design question which
your patch needs to solve.

 The gic registers format is the exist one and both kernel or user space
 state code changes do not affect each other.

Just to be clear here, the gic registers format is not a
sufficient definition of the GIC internal state. To
take a simple example, the distributor registers include a set
of set-enable registers GICD_ISENABLERn and a set of clear
enable registers GICD_ICENABLERn. These are two views (and
ways to alter) a single set of underlying enable bits.
What we want from userspace is a way to read and write the
enable bits. We do not want an interface that has the
write 1 to set/write 1 to clear semantics of the hardware
registers. Obviously often registers are a simple read/write
view of underlying state, but this isn't always true.

 One concern is that i'm still not sure if there will be no issue
 if not saving virtual interface control registers.
 It may need some time to research.

 Maybe we convert it into standard state and restore it back then.
 But some bits of them may not be exported.

Yes, confirming that the state as exposed via the virtual
interface control registers can be correctly converted into
our canonical state representation is a good cross-check
that we have got it right. We definitely mustn't lose
information, or migration won't work right in some edge case.

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 0/2] x86: vmclear vmcss on all cpus when doing kdump if necessary

2012-12-06 Thread Gleb Natapov
On Thu, Dec 06, 2012 at 11:36:48PM +0800, Zhang Yanfei wrote:
 Currently, kdump just makes all the logical processors leave VMX operation by
 executing VMXOFF instruction, so any VMCSs active on the logical processors 
 may
 be corrupted. But, sometimes, we need the VMCSs to debug guest images 
 contained
 in the host vmcore. To prevent the corruption, we should VMCLEAR the VMCSs 
 before
 executing the VMXOFF instruction.
 
 The patch set provides a way to VMCLEAR vmcss related to guests on all cpus 
 before
 executing the VMXOFF when doing kdump. This is used to ensure the VMCSs in the
 vmcore updated and non-corrupted.
 
Applied to queue. Thanks.

 Changelog from v10 to v11:
 1. regenerate the patch set against current queue branch in
git://git.kernel.org/pub/scm/virt/kvm/kvm.git
 
 Changelog from v9 to v10:
 1. add rcu protect to the callback function
 
 Changelog from v8 to v9:
 1. KEXEC: use a callback function instead of a notifier.
 2. KVM-INTEL: use a new vmclear function instead of just calling 
vmclear_local_loaded_vmcss to make sure we just do the core vmclear
operation in kdump.
 
 Changelog from v7 to v8:
 1. KEXEC: regression for using name crash_notifier_list
and remove comments related to KVM
and just call function atomic_notifier_call_chain directly.
 
 Changelog from v6 to v7:
 1. KVM-INTEL: in hardware_disable, we needn't disable the
vmclear, so remove it.
 
 Changelog from v5 to v6:
 1. KEXEC: the atomic notifier list renamed:
crash_notifier_list -- vmclear_notifier_list
 2. KVM-INTEL: provide empty functions if CONFIG_KEXEC is
not defined and remove unnecessary #ifdef's.
 
 Changelog from v4 to v5:
 1. use an atomic notifier instead of function call, so
have all the vmclear codes in vmx.c.
 
 Changelog from v3 to v4:
 1. add a new percpu variable vmclear_skipped to skip
vmclear in kdump in some conditions.
 
 Changelog from v2 to v3:
 1. remove unnecessary conditions in function
cpu_emergency_clear_loaded_vmcss as Marcelo suggested.
 
 Changelog from v1 to v2:
 1. remove the sysctl and clear VMCSs unconditionally.
 
 Zhang Yanfei (2):
   x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
   KVM-INTEL: provide the vmclear function and a bitmap to support
 VMCLEAR in kdump
 
  arch/x86/include/asm/kexec.h |2 + 
  arch/x86/kernel/crash.c  |   32 
  arch/x86/kvm/vmx.c   |   67 
 ++
  3 files changed, 101 insertions(+), 0 deletions(-)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[kvm:queue 8/9] arch/x86/kernel/crash.c:49:32: sparse: incompatible types in comparison expression (different address spaces)

2012-12-06 Thread kbuild test robot
tree:   git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
head:   8f536b7697a0d40ef6b5fd04cf2c04953d5ca06f
commit: f23d1f4a116038c68df224deae6718fde87d8f0d [8/9] x86/kexec: VMCLEAR VMCSs 
loaded on all cpus if necessary


sparse warnings:

+ arch/x86/kernel/crash.c:49:32: sparse: incompatible types in comparison 
expression (different address spaces)

vim +49 arch/x86/kernel/crash.c

5edd19af Cliff Wickman   2010-07-20  33  
f23d1f4a Zhang Yanfei2012-12-06  34  /*
f23d1f4a Zhang Yanfei2012-12-06  35   * This is used to VMCLEAR all VMCSs 
loaded on the
f23d1f4a Zhang Yanfei2012-12-06  36   * processor. And when loading 
kvm_intel module, the
f23d1f4a Zhang Yanfei2012-12-06  37   * callback function pointer will be 
assigned.
f23d1f4a Zhang Yanfei2012-12-06  38   *
f23d1f4a Zhang Yanfei2012-12-06  39   * protected by rcu.
f23d1f4a Zhang Yanfei2012-12-06  40   */
f23d1f4a Zhang Yanfei2012-12-06  41  void 
(*crash_vmclear_loaded_vmcss)(void) = NULL;
f23d1f4a Zhang Yanfei2012-12-06  42  
EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
f23d1f4a Zhang Yanfei2012-12-06  43  
f23d1f4a Zhang Yanfei2012-12-06  44  static inline void 
cpu_crash_vmclear_loaded_vmcss(void)
f23d1f4a Zhang Yanfei2012-12-06  45  {
f23d1f4a Zhang Yanfei2012-12-06  46 void 
(*do_vmclear_operation)(void) = NULL;
f23d1f4a Zhang Yanfei2012-12-06  47  
f23d1f4a Zhang Yanfei2012-12-06  48 rcu_read_lock();
f23d1f4a Zhang Yanfei2012-12-06 @49 do_vmclear_operation = 
rcu_dereference(crash_vmclear_loaded_vmcss);
f23d1f4a Zhang Yanfei2012-12-06  50 if (do_vmclear_operation)
f23d1f4a Zhang Yanfei2012-12-06  51 do_vmclear_operation();
f23d1f4a Zhang Yanfei2012-12-06  52 rcu_read_unlock();
f23d1f4a Zhang Yanfei2012-12-06  53  }
f23d1f4a Zhang Yanfei2012-12-06  54  
b2bbe71b Eduardo Habkost 2008-11-12  55  #if defined(CONFIG_SMP)  
defined(CONFIG_X86_LOCAL_APIC)
b2bbe71b Eduardo Habkost 2008-11-12  56  
9c48f1c6 Don Zickus  2011-09-30  57  static void kdump_nmi_callback(int 
cpu, struct pt_regs *regs)

---
0-DAY kernel build testing backend Open Source Technology Center
Fengguang Wu, Yuanhan Liu  Intel Corporation
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings
On Thu, 2012-12-06 at 10:13 +0200, Michael S. Tsirkin wrote:
 On Wed, Dec 05, 2012 at 08:39:26PM +, Ben Hutchings wrote:
  On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
   Add RFS support to virtio network device.
   Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
   configuration field max_virtqueue_pairs to detect supported number of
   virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
   packet steering for unidirectional protocols.
  [...]
   +Programming of the receive flow classificator is implicit.
   + Transmitting a packet of a specific flow on transmitqX will cause 
   incoming
   + packets for this flow to be steered to receiveqX.
   + For uni-directional protocols, or where no packets have been transmitted
   + yet, device will steer a packet to a random queue out of the specified
   + receiveq0..receiveqn.
  [...]
  
  It doesn't seem like this is usable to implement accelerated RFS in the
  guest, though perhaps that doesn't matter.
 
 What is the issue? Could you be more explicit please?
 
 It seems to work pretty well: if we have
 # of queues = # of cpus, incoming TCP_STREAM into
 guest scales very nicely without manual tweaks in guest.
 
 The way it works is, when guest sends a packet driver
 select the rx queue that we want to use for incoming
 packets for this slow, and transmit on the matching tx queue.
 This is exactly what text above suggests no?

Yes, I get that.

   On the host side, presumably
  you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
  only without a socket?  But in any case, that requires an rxhash, so I
  don't see how this is supposed to work.
  
  Ben.
 
 Host should just do what guest tells it to.
 On the host side we build up the steering table as we get packets
 to transmit. See the code in drivers/net/tun.c in recent
 kernels.
 
 Again this actually works fine - what are the problems that you see?
 Could you give an example please?

I'm not saying it doesn't work in its own way, I just don't see how you
would make it work with the existing RFS!

Since this doesn't seem to be intended to have *any* connection with the
existing core networking feature called RFS, perhaps you could find a
different name for it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/4] x86, apicv: add virtual interrupt delivery support

2012-12-06 Thread Marcelo Tosatti
On Thu, Dec 06, 2012 at 08:36:52AM +0200, Gleb Natapov wrote:
 On Thu, Dec 06, 2012 at 05:02:15AM +, Zhang, Yang Z wrote:
  Zhang, Yang Z wrote on 2012-12-06:
   Marcelo Tosatti wrote on 2012-12-06:
   On Mon, Dec 03, 2012 at 03:01:03PM +0800, Yang Zhang wrote:
   Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
   manually, which is fully taken care of by the hardware. This needs
   some special awareness into existing interrupr injection path:
   
   - for pending interrupt, instead of direct injection, we may need
 update architecture specific indicators before resuming to guest. -
 A pending interrupt, which is masked by ISR, should be also
 considered in above update action, since hardware will decide when
 to inject it at right time. Current has_interrupt and get_interrupt
 only returns a valid vector from injection p.o.v.
   Signed-off-by: Yang Zhang yang.z.zh...@intel.com
   Signed-off-by: Kevin Tian kevin.t...@intel.com
   ---
arch/x86/include/asm/kvm_host.h |4 + arch/x86/include/asm/vmx.h
  |   11 +++ arch/x86/kvm/irq.c  |   53 ++-
arch/x86/kvm/lapic.c|   56 +---
arch/x86/kvm/lapic.h|6 ++ arch/x86/kvm/svm.c
   |   19 + arch/x86/kvm/vmx.c  |  140
++- arch/x86/kvm/x86.c
 |   34 -- virt/kvm/ioapic.c   |1 + 9 files
changed, 291 insertions(+), 33 deletions(-)
   diff --git a/arch/x86/include/asm/kvm_host.h
   b/arch/x86/include/asm/kvm_host.h index dc87b65..e5352c8 100644 ---
   a/arch/x86/include/asm/kvm_host.h +++
   b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,10 @@ struct
   kvm_x86_ops {
   void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
   void (*enable_irq_window)(struct kvm_vcpu *vcpu);
   void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, 
   int irr);
   +   int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
   +   void (*update_irq)(struct kvm_vcpu *vcpu);
   +   void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
   +   int trig_mode, int always_set);
   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
   int (*get_tdp_level)(void);
   u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
   is_mmio);
   diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
   index 21101b6..1003341 100644
   --- a/arch/x86/include/asm/vmx.h
   +++ b/arch/x86/include/asm/vmx.h
   @@ -62,6 +62,7 @@
#define EXIT_REASON_MCE_DURING_VMENTRY  41 #define
EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define
   EXIT_REASON_APIC_ACCESS
44 +#define EXIT_REASON_EOI_INDUCED 45 #define
EXIT_REASON_EPT_VIOLATION   48 #define
   EXIT_REASON_EPT_MISCONFIG
49 #define EXIT_REASON_WBINVD  54 @@ -143,6
   +144,7 @@
#define SECONDARY_EXEC_WBINVD_EXITING  0x0040 #define
SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080 #define
SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100 +#define
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY0x0200 #define
SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400 #define
SECONDARY_EXEC_ENABLE_INVPCID  0x1000 @@ -180,6 +182,7 @@ 
   enum
vmcs_field {   GUEST_GS_SELECTOR   = 0x080a,
 GUEST_LDTR_SELECTOR
 = 0x080c, GUEST_TR_SELECTOR   =
   0x080e,
+  GUEST_INTR_STATUS   = 0x0810,
 HOST_ES_SELECTOR
  = 0x0c00,HOST_CS_SELECTOR=
   0x0c02,
   HOST_SS_SELECTOR= 0x0c04, @@ -207,6 +210,14 
   @@
enum vmcs_field {  APIC_ACCESS_ADDR_HIGH   = 0x2015,
 EPT_POINTER
  = 0x201a,EPT_POINTER_HIGH
   =
0x201b,
   +   EOI_EXIT_BITMAP0= 0x201c,
   +   EOI_EXIT_BITMAP0_HIGH   = 0x201d,
   +   EOI_EXIT_BITMAP1= 0x201e,
   +   EOI_EXIT_BITMAP1_HIGH   = 0x201f,
   +   EOI_EXIT_BITMAP2= 0x2020,
   +   EOI_EXIT_BITMAP2_HIGH   = 0x2021,
   +   EOI_EXIT_BITMAP3= 0x2022,
   +   EOI_EXIT_BITMAP3_HIGH   = 0x2023,
   GUEST_PHYSICAL_ADDRESS  = 0x2400,
   GUEST_PHYSICAL_ADDRESS_HIGH = 0x2401,
   VMCS_LINK_POINTER   = 0x2800,
   diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
   index 7e06ba1..f782788 100644
   --- a/arch/x86/kvm/irq.c
   +++ b/arch/x86/kvm/irq.c
   @@ -43,45 +43,64 @@ EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
 */
int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
{
   -   struct kvm_pic *s;
   -
   if (!irqchip_in_kernel(v-kvm))
   return v-arch.interrupt.pending;
   -   if 

Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Michael S. Tsirkin
On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
 On Thu, 2012-12-06 at 10:13 +0200, Michael S. Tsirkin wrote:
  On Wed, Dec 05, 2012 at 08:39:26PM +, Ben Hutchings wrote:
   On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
Add RFS support to virtio network device.
Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
configuration field max_virtqueue_pairs to detect supported number of
virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
packet steering for unidirectional protocols.
   [...]
+Programming of the receive flow classificator is implicit.
+ Transmitting a packet of a specific flow on transmitqX will cause 
incoming
+ packets for this flow to be steered to receiveqX.
+ For uni-directional protocols, or where no packets have been 
transmitted
+ yet, device will steer a packet to a random queue out of the specified
+ receiveq0..receiveqn.
   [...]
   
   It doesn't seem like this is usable to implement accelerated RFS in the
   guest, though perhaps that doesn't matter.
  
  What is the issue? Could you be more explicit please?
  
  It seems to work pretty well: if we have
  # of queues = # of cpus, incoming TCP_STREAM into
  guest scales very nicely without manual tweaks in guest.
  
  The way it works is, when guest sends a packet driver
  select the rx queue that we want to use for incoming
  packets for this slow, and transmit on the matching tx queue.
  This is exactly what text above suggests no?
 
 Yes, I get that.
 
On the host side, presumably
   you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
   only without a socket?  But in any case, that requires an rxhash, so I
   don't see how this is supposed to work.
   
   Ben.
  
  Host should just do what guest tells it to.
  On the host side we build up the steering table as we get packets
  to transmit. See the code in drivers/net/tun.c in recent
  kernels.
  
  Again this actually works fine - what are the problems that you see?
  Could you give an example please?
 
 I'm not saying it doesn't work in its own way, I just don't see how you
 would make it work with the existing RFS!
 
 Since this doesn't seem to be intended to have *any* connection with the
 existing core networking feature called RFS, perhaps you could find a
 different name for it.
 
 Ben.


Ah I see what you mean. We started out calling this feature multiqueue
Rusty suggested RFS since it gives similar functionality to RFS but in
device: it has receive steering logic per flow as part of the device.

Maybe simply adding a statement similar to the one above would be
sufficient to avoid confusion?


 -- 
 Ben Hutchings, Staff Engineer, Solarflare
 Not speaking for my employer; that's the marketing department's job.
 They asked us to note that Solarflare product names are trademarked.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings
On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
[...]
  Since this doesn't seem to be intended to have *any* connection with the
  existing core networking feature called RFS, perhaps you could find a
  different name for it.
  
  Ben.
 
 
 Ah I see what you mean. We started out calling this feature multiqueue
 Rusty suggested RFS since it gives similar functionality to RFS but in
 device: it has receive steering logic per flow as part of the device.

The name is quite generic, but in the context of Linux it has so far
been used for a specific software feature and not as a generic name for
flow steering by hardware (or drivers).  The existing documentation
(Documentation/networking/scaling.txt) states quite clearly that 'RFS'
means that specific software implementation (with optional driver
integration) and configuration interface.

 Maybe simply adding a statement similar to the one above would be
 sufficient to avoid confusion?

No, I don't think it's sufficient.  We have documentation that says how
to configure 'RFS', and you're proposing to add a very similar feature
called 'RFS' that is configured differently.  No matter how clearly you
distinguish them in new documentation, this will make the old
documentation confusing.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Michael S. Tsirkin
On Thu, Dec 06, 2012 at 08:53:59PM +, Ben Hutchings wrote:
 On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
  On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
 [...]
   Since this doesn't seem to be intended to have *any* connection with the
   existing core networking feature called RFS, perhaps you could find a
   different name for it.
   
   Ben.
  
  
  Ah I see what you mean. We started out calling this feature multiqueue
  Rusty suggested RFS since it gives similar functionality to RFS but in
  device: it has receive steering logic per flow as part of the device.
 
 The name is quite generic, but in the context of Linux it has so far
 been used for a specific software feature and not as a generic name for
 flow steering by hardware (or drivers).  The existing documentation
 (Documentation/networking/scaling.txt) states quite clearly that 'RFS'
 means that specific software implementation (with optional driver
 integration) and configuration interface.

  Maybe simply adding a statement similar to the one above would be
  sufficient to avoid confusion?
 
 No, I don't think it's sufficient.  We have documentation that says how
 to configure 'RFS', and you're proposing to add a very similar feature
 called 'RFS' that is configured differently.  No matter how clearly you
 distinguish them in new documentation, this will make the old
 documentation confusing.
 
 Ben.

I don't mind, renaming is just s/RFS/whatever/ away -
how should hardware call this in your opinion?

 -- 
 Ben Hutchings, Staff Engineer, Solarflare
 Not speaking for my employer; that's the marketing department's job.
 They asked us to note that Solarflare product names are trademarked.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Fix irqfd resampler list walk

2012-12-06 Thread Alex Williamson
Typo for the next pointer means we're walking random data here.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Cc: sta...@vger.kernel.org [3.7]
---

Not sure if this will make 3.7, so preemptively adding the stable flag

 virt/kvm/eventfd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9718e98..62e7bd6 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -332,7 +332,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
mutex_lock(kvm-irqfds.resampler_lock);
 
list_for_each_entry(resampler,
-   kvm-irqfds.resampler_list, list) {
+   kvm-irqfds.resampler_list, link) {
if (resampler-notifier.gsi == irqfd-gsi) {
irqfd-resampler = resampler;
break;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings
On Thu, 2012-12-06 at 23:01 +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 06, 2012 at 08:53:59PM +, Ben Hutchings wrote:
  On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
   On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
  [...]
Since this doesn't seem to be intended to have *any* connection with the
existing core networking feature called RFS, perhaps you could find a
different name for it.

Ben.
   
   
   Ah I see what you mean. We started out calling this feature multiqueue
   Rusty suggested RFS since it gives similar functionality to RFS but in
   device: it has receive steering logic per flow as part of the device.
  
  The name is quite generic, but in the context of Linux it has so far
  been used for a specific software feature and not as a generic name for
  flow steering by hardware (or drivers).  The existing documentation
  (Documentation/networking/scaling.txt) states quite clearly that 'RFS'
  means that specific software implementation (with optional driver
  integration) and configuration interface.
 
   Maybe simply adding a statement similar to the one above would be
   sufficient to avoid confusion?
  
  No, I don't think it's sufficient.  We have documentation that says how
  to configure 'RFS', and you're proposing to add a very similar feature
  called 'RFS' that is configured differently.  No matter how clearly you
  distinguish them in new documentation, this will make the old
  documentation confusing.
  
  Ben.
 
 I don't mind, renaming is just s/RFS/whatever/ away -
 how should hardware call this in your opinion?

If by 'this' you mean the use of perfect filters or a large hash table
to select the RX queue per flow, then 'flow steering'.

But that is usually combined with the fall-back of a simple mapping from
hash to queue ('RSS' or 'flow hashing') in case there is no specific
queue selection yet, which I can see tun has.  And you're specifying
multiple transmit queues too.  If you want a name for the whole set of
features involved, I don't see any better name than 'multiqueue'/'MQ'.

If you want a name for this specific flow steering mechanism, add some
distinguishing adjective(s) like 'virtual' or 'automatic'.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] kvm: struct kvm_memory_slot.flags - u32

2012-12-06 Thread Alex Williamson
struct kvm_userspace_memory_region.flags is a u32 with a comment that
bits 0 ~ 15 are visible to userspace and the other bits are reserved
for kvm internal use.  KVM_MEMSLOT_INVALID is the only internal use
flag and it has a comment that bits 16 ~ 31 are internally used and
the other bits are visible to userspace.

Therefore, let's define this as a u32 so we don't waste bytes on LP64
systems.  Move to the end of the struct for alignment.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 include/linux/kvm_host.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9ff30f2..641f5fb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -251,10 +251,10 @@ static inline int kvm_vcpu_exiting_guest_mode(struct 
kvm_vcpu *vcpu)
 struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
-   unsigned long flags;
unsigned long *dirty_bitmap;
struct kvm_arch_memory_slot arch;
unsigned long userspace_addr;
+   u32 flags;
int id;
bool user_alloc;
 };

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] kvm: struct kvm_memory_slot.id - short

2012-12-06 Thread Alex Williamson
We're currently offering a whopping 32 memory slots to user space, an
int is a bit excessive for storing this.  We would like to increase
our memslots, but SHRT_MAX should be more than enough.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 include/linux/kvm_host.h |4 ++--
 virt/kvm/kvm_main.c  |2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 641f5fb..87089dd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -255,7 +255,7 @@ struct kvm_memory_slot {
struct kvm_arch_memory_slot arch;
unsigned long userspace_addr;
u32 flags;
-   int id;
+   short id;
bool user_alloc;
 };
 
@@ -315,7 +315,7 @@ struct kvm_memslots {
u64 generation;
struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
/* The mapping table from slot id to the index in memslots[]. */
-   int id_to_index[KVM_MEM_SLOTS_NUM];
+   short id_to_index[KVM_MEM_SLOTS_NUM];
 };
 
 struct kvm {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6eeb101..6e4709f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -469,6 +469,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
INIT_HLIST_HEAD(kvm-irq_ack_notifier_list);
 #endif
 
+   BUILD_BUG_ON(KVM_MEM_SLOTS_NUM  SHRT_MAX);
+
r = -ENOMEM;
kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm-memslots)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] kvm: Increase user memory slots on x86 to 125

2012-12-06 Thread Alex Williamson
With the 3 private slots, this gives us a nice round 128 slots total.
The primary motivation for this is to support more assigned devices.
Each assigned device can theoretically use up to 8 slots (6 MMIO BARs,
1 ROM BAR, 1 spare for a split MSI-X table mapping) though it's far
more typical for a device to use 3-4 slots.  If we assume a typical VM
uses a dozen slots for non-assigned devices purposes, we should always
be able to support 14 worst case assigned devices or 28 to 37 typical
devices.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ce8b037..9558a1e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -31,7 +31,7 @@
 
 #define KVM_MAX_VCPUS 254
 #define KVM_SOFT_MAX_VCPUS 160
-#define KVM_USER_MEM_SLOTS 32
+#define KVM_USER_MEM_SLOTS 125
 /* memory slots that are not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 3
 #define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/10] kvm: struct kvm_memory_slot.user_alloc - bool

2012-12-06 Thread Alex Williamson
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 arch/ia64/kvm/kvm-ia64.c   |6 +++---
 arch/powerpc/kvm/powerpc.c |4 ++--
 arch/s390/kvm/kvm-s390.c   |4 ++--
 arch/x86/kvm/vmx.c |6 +++---
 arch/x86/kvm/x86.c |4 ++--
 include/linux/kvm_host.h   |   12 ++--
 virt/kvm/kvm_main.c|9 +
 7 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index f1a46bd..a8b4022 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -955,7 +955,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
kvm_mem.guest_phys_addr;
kvm_userspace_mem.memory_size = kvm_mem.memory_size;
r = kvm_vm_ioctl_set_memory_region(kvm,
-   kvm_userspace_mem, 0);
+   kvm_userspace_mem, false);
if (r)
goto out;
break;
@@ -1577,7 +1577,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
struct kvm_memory_slot old,
struct kvm_userspace_memory_region *mem,
-   int user_alloc)
+   bool user_alloc)
 {
unsigned long i;
unsigned long pfn;
@@ -1608,7 +1608,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
struct kvm_memory_slot old,
-   int user_alloc)
+   bool user_alloc)
 {
return;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4d213b8..da606a9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -321,7 +321,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
struct kvm_memory_slot old,
struct kvm_userspace_memory_region *mem,
-   int user_alloc)
+   bool user_alloc)
 {
return kvmppc_core_prepare_memory_region(kvm, mem);
 }
@@ -329,7 +329,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
struct kvm_memory_slot old,
-   int user_alloc)
+   bool user_alloc)
 {
kvmppc_core_commit_memory_region(kvm, mem);
 }
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ecced9d..37646cb 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -927,7 +927,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_memory_slot *memslot,
   struct kvm_memory_slot old,
   struct kvm_userspace_memory_region *mem,
-  int user_alloc)
+  bool user_alloc)
 {
/* A few sanity checks. We can have exactly one memory slot which has
   to start at guest virtual zero and which has to be located at a
@@ -957,7 +957,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
struct kvm_memory_slot old,
-   int user_alloc)
+   bool user_alloc)
 {
int rc;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f858159..108becc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3597,7 +3597,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
kvm_userspace_mem.flags = 0;
kvm_userspace_mem.guest_phys_addr = 0xfee0ULL;
kvm_userspace_mem.memory_size = PAGE_SIZE;
-   r = __kvm_set_memory_region(kvm, kvm_userspace_mem, 0);
+   r = __kvm_set_memory_region(kvm, kvm_userspace_mem, false);
if (r)
goto out;
 
@@ -3627,7 +3627,7 @@ static int alloc_identity_pagetable(struct kvm *kvm)
kvm_userspace_mem.guest_phys_addr =
kvm-arch.ept_identity_map_addr;
kvm_userspace_mem.memory_size = PAGE_SIZE;
-   r = __kvm_set_memory_region(kvm, kvm_userspace_mem, 0);
+   r = __kvm_set_memory_region(kvm, kvm_userspace_mem, false);
if (r)
goto out;
 
@@ -4191,7 +4191,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int 
addr)
.flags = 0,
};
 
-   ret = kvm_set_memory_region(kvm, tss_mem, 0);
+   ret = kvm_set_memory_region(kvm, tss_mem, false);
 

[PATCH 03/10] kvm: Fix iommu map/unmap to handle memory slot moves

2012-12-06 Thread Alex Williamson
The iommu integration into memory slots expects memory slots to be
added or removed and doesn't handle the move case.  We can unmap
slots from the iommu after we mark them invalid and map them before
installing the final memslot array.  Also re-order the kmemdup vs
map so we don't leave iommu mappings if we get ENOMEM.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 virt/kvm/kvm_main.c |   19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 93213e1..d27c135 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -817,6 +817,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
old_memslots = kvm-memslots;
rcu_assign_pointer(kvm-memslots, slots);
synchronize_srcu_expedited(kvm-srcu);
+   /* slot was deleted or moved, clear iommu mapping */
+   kvm_iommu_unmap_pages(kvm, old);
/* From this point no new shadow pages pointing to a deleted,
 * or moved, memslot will be created.
 *
@@ -832,20 +834,19 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (r)
goto out_free;
 
-   /* map/unmap the pages in iommu page table */
-   if (npages) {
-   r = kvm_iommu_map_pages(kvm, new);
-   if (r)
-   goto out_free;
-   } else
-   kvm_iommu_unmap_pages(kvm, old);
-
r = -ENOMEM;
slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots),
GFP_KERNEL);
if (!slots)
goto out_free;
 
+   /* map new memory slot into the iommu */
+   if (npages) {
+   r = kvm_iommu_map_pages(kvm, new);
+   if (r)
+   goto out_slots;
+   }
+
/* actual memory is freed via old in kvm_free_physmem_slot below */
if (!npages) {
new.dirty_bitmap = NULL;
@@ -864,6 +865,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
return 0;
 
+out_slots:
+   kfree(slots);
 out_free:
kvm_free_physmem_slot(new, old);
 out:

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] kvm: Minor memory slot optimization

2012-12-06 Thread Alex Williamson
If a slot is removed or moved in the guest physical address space, we
first allocate and install a new slot array with the invalidated
entry.  The old array is then freed.  We then proceed to allocate yet
another slot array to install the permanent replacement.  Re-use the
original array when this occurs and avoid the extra kfree/kmalloc.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 virt/kvm/kvm_main.c |   21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d27c135..24a67f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -711,7 +711,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
unsigned long npages;
struct kvm_memory_slot *memslot, *slot;
struct kvm_memory_slot old, new;
-   struct kvm_memslots *slots, *old_memslots;
+   struct kvm_memslots *slots = NULL, *old_memslots;
 
r = check_memory_region_flags(mem);
if (r)
@@ -827,18 +827,25 @@ int __kvm_set_memory_region(struct kvm *kvm,
 *  - kvm_is_visible_gfn (mmu_check_roots)
 */
kvm_arch_flush_shadow_memslot(kvm, slot);
-   kfree(old_memslots);
+   slots = old_memslots;
}
 
r = kvm_arch_prepare_memory_region(kvm, new, old, mem, user_alloc);
if (r)
-   goto out_free;
+   goto out_slots;
 
r = -ENOMEM;
-   slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots),
-   GFP_KERNEL);
-   if (!slots)
-   goto out_free;
+   /*
+* We can re-use the old_memslots from above, the only difference
+* from the currently installed memslots is the invalid flag.  This
+* will get overwritten by update_memslots anyway.
+*/
+   if (!slots) {
+   slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots),
+   GFP_KERNEL);
+   if (!slots)
+   goto out_free;
+   }
 
/* map new memory slot into the iommu */
if (npages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] kvm: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS

2012-12-06 Thread Alex Williamson
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM.  One is
the user accessible slots and the other is user + private.  Make this
more obvious.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 arch/ia64/include/asm/kvm_host.h|2 +-
 arch/ia64/kvm/kvm-ia64.c|2 +-
 arch/powerpc/include/asm/kvm_host.h |4 ++--
 arch/powerpc/kvm/book3s_hv.c|2 +-
 arch/s390/include/asm/kvm_host.h|2 +-
 arch/x86/include/asm/kvm_host.h |4 ++--
 arch/x86/include/asm/vmx.h  |6 +++---
 arch/x86/kvm/x86.c  |6 +++---
 include/linux/kvm_host.h|2 +-
 virt/kvm/kvm_main.c |8 
 10 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 6d6a5ac..48d7b0e 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -23,7 +23,7 @@
 #ifndef __ASM_KVM_HOST_H
 #define __ASM_KVM_HOST_H
 
-#define KVM_MEMORY_SLOTS 32
+#define KVM_USER_MEM_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
 
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 8b3a9c0..f1a46bd 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1831,7 +1831,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
mutex_lock(kvm-slots_lock);
 
r = -EINVAL;
-   if (log-slot = KVM_MEMORY_SLOTS)
+   if (log-slot = KVM_USER_MEM_SLOTS)
goto out;
 
memslot = id_to_memslot(kvm-memslots, log-slot);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 28e8f5e..5eb1dd8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -37,10 +37,10 @@
 
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
-#define KVM_MEMORY_SLOTS 32
+#define KVM_USER_MEM_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
-#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+#define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
 
 #ifdef CONFIG_KVM_MMIO
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 721d460..75ce80e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1262,7 +1262,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
mutex_lock(kvm-slots_lock);
 
r = -EINVAL;
-   if (log-slot = KVM_MEMORY_SLOTS)
+   if (log-slot = KVM_USER_MEM_SLOTS)
goto out;
 
memslot = id_to_memslot(kvm-memslots, log-slot);
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index b784154..ac33432 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -20,7 +20,7 @@
 #include asm/cpu.h
 
 #define KVM_MAX_VCPUS 64
-#define KVM_MEMORY_SLOTS 32
+#define KVM_USER_MEM_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..e619519 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -31,10 +31,10 @@
 
 #define KVM_MAX_VCPUS 254
 #define KVM_SOFT_MAX_VCPUS 160
-#define KVM_MEMORY_SLOTS 32
+#define KVM_USER_MEM_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
-#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+#define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
 
 #define KVM_MMIO_SIZE 16
 
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 36ec21c..72932d2 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -427,9 +427,9 @@ enum vmcs_field {
 
 #define AR_RESERVD_MASK 0xfffe0f00
 
-#define TSS_PRIVATE_MEMSLOT(KVM_MEMORY_SLOTS + 0)
-#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 1)
-#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2)
+#define TSS_PRIVATE_MEMSLOT(KVM_USER_MEM_SLOTS + 0)
+#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   (KVM_USER_MEM_SLOTS + 1)
+#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_USER_MEM_SLOTS + 2)
 
 #define VMX_NR_VPIDS   (1  16)
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f76417..1aa3fae 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2204,7 +2204,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = KVM_MAX_VCPUS;
break;
case KVM_CAP_NR_MEMSLOTS:
-   r = KVM_MEMORY_SLOTS;
+   r = KVM_USER_MEM_SLOTS;
break;
case KVM_CAP_PV_MMU:/* obsolete */
 

[PATCH 06/10] kvm: Make KVM_PRIVATE_MEM_SLOTS optional

2012-12-06 Thread Alex Williamson
Seems like everyone copied x86 and defined 4 private memory slots
that never actually get used.  Even x86 only uses 3 of the 4.  These
aren't exposed so there's no need to add padding.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 arch/ia64/include/asm/kvm_host.h|2 --
 arch/powerpc/include/asm/kvm_host.h |4 +---
 arch/s390/include/asm/kvm_host.h|2 --
 arch/x86/include/asm/kvm_host.h |4 ++--
 include/linux/kvm_host.h|4 
 5 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 48d7b0e..cfa7498 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -24,8 +24,6 @@
 #define __ASM_KVM_HOST_H
 
 #define KVM_USER_MEM_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 5eb1dd8..23ca70d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -38,9 +38,7 @@
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
 #define KVM_USER_MEM_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
-#define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+#define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS
 
 #ifdef CONFIG_KVM_MMIO
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index ac33432..711c5ab 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -21,8 +21,6 @@
 
 #define KVM_MAX_VCPUS 64
 #define KVM_USER_MEM_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
 
 struct sca_entry {
atomic_t scn;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e619519..ce8b037 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -32,8 +32,8 @@
 #define KVM_MAX_VCPUS 254
 #define KVM_SOFT_MAX_VCPUS 160
 #define KVM_USER_MEM_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
+/* memory slots that are not exposed to userspace */
+#define KVM_PRIVATE_MEM_SLOTS 3
 #define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
 
 #define KVM_MMIO_SIZE 16
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb9354d..bf8380f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -298,6 +298,10 @@ struct kvm_irq_routing_table {};
 
 #endif
 
+#ifndef KVM_PRIVATE_MEM_SLOTS
+#define KVM_PRIVATE_MEM_SLOTS 0
+#endif
+
 #ifndef KVM_MEM_SLOTS_NUM
 #define KVM_MEM_SLOTS_NUM (KVM_USER_MEM_SLOTS + KVM_PRIVATE_MEM_SLOTS)
 #endif

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] kvm: Restrict non-existing slot state transitions

2012-12-06 Thread Alex Williamson
The API documentation states:

When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified.

An existing slot requires a non-zero npages (memory_size).  The only
transition we should therefore allow for a non-existing slot should be
to create the slot, which includes setting a non-zero memory_size.  We
currently allow calls to modify non-existing slots, which is pointless,
confusing, and possibly wrong.

With this we know that the invalidation path of __kvm_set_memory_region
is always for a delete or move and never for adding a zero size slot.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 virt/kvm/kvm_main.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6e8fa7e..e426704 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -753,10 +753,15 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.npages = npages;
new.flags = mem-flags;
 
-   /* Disallow changing a memory slot's size. */
+   /*
+* Disallow changing a memory slot's size or changing anything about
+* zero sized slots that doesn't involve making them non-zero.
+*/
r = -EINVAL;
if (npages  old.npages  npages != old.npages)
goto out_free;
+   if (!npages  !old.npages)
+   goto out_free;
 
/* Check for overlaps */
r = -EEXIST;
@@ -775,7 +780,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
r = -ENOMEM;
 
/* Allocate if a slot is being created */
-   if (npages  !old.npages) {
+   if (!old.npages) {
new.user_alloc = user_alloc;
new.userspace_addr = mem-userspace_addr;
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/10] kvm: memory slot cleanups, fix, and increase

2012-12-06 Thread Alex Williamson
This series does away with any kind of complicated resizing of the
slot array and simply does a one time increase.  I do compact struct
kvm_memory_slot a bit to take better advantage of the space we are
using.  This reduces each slot from 64 bytes (x86_64) to 56 bytes.
By enforcing the API around valid operations for an unused slot and
fields that can be modified runtime, I found and was able to fix a
bug in iommu mapping for slots.  The renames enabled me to find the
previously posted bug fix for catching slot overlaps.

As mentioned in the series, the primary motivation for increasing
memory slots is assigned devices.  With this, I've been able to
assign 30 devices to a single VM and could have gone further, but
ran out of SRIOV VFs.  Typical devices use anywhere from 2-4 slots
and max out at 8 slots.  125 user slots (3 private slots) allows
us to support between 28 and 56 typical devices per VM.

Tested on x86_64, compiled on ia64, powerpc, and s390.

Thanks,
Alex

---

Alex Williamson (10):
  kvm: Restrict non-existing slot state transitions
  kvm: Check userspace_addr when modifying a memory slot
  kvm: Fix iommu map/unmap to handle memory slot moves
  kvm: Minor memory slot optimization
  kvm: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS
  kvm: Make KVM_PRIVATE_MEM_SLOTS optional
  kvm: struct kvm_memory_slot.user_alloc - bool
  kvm: struct kvm_memory_slot.flags - u32
  kvm: struct kvm_memory_slot.id - short
  kvm: Increase user memory slots on x86 to 125


 arch/ia64/include/asm/kvm_host.h|4 --
 arch/ia64/kvm/kvm-ia64.c|8 ++--
 arch/powerpc/include/asm/kvm_host.h |6 +--
 arch/powerpc/kvm/book3s_hv.c|2 -
 arch/powerpc/kvm/powerpc.c  |4 +-
 arch/s390/include/asm/kvm_host.h|4 --
 arch/s390/kvm/kvm-s390.c|4 +-
 arch/x86/include/asm/kvm_host.h |8 ++--
 arch/x86/include/asm/vmx.h  |6 +--
 arch/x86/kvm/vmx.c  |6 +--
 arch/x86/kvm/x86.c  |   10 ++---
 include/linux/kvm_host.h|   24 +++-
 virt/kvm/kvm_main.c |   72 +++
 13 files changed, 90 insertions(+), 68 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] kvm: Check userspace_addr when modifying a memory slot

2012-12-06 Thread Alex Williamson
The API documents that only flags and guest physical memory space can
be modified on an existing slot, but we don't enforce that the
userspace address cannot be modified.  Instead we just ignore it.
This means that a user may think they've successfully moved both the
guest and user addresses, when in fact only the guest address changed.
Check and error instead.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 virt/kvm/kvm_main.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e426704..93213e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -779,13 +779,19 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
r = -ENOMEM;
 
-   /* Allocate if a slot is being created */
+   /*
+* Allocate if a slot is being created.  If modifying a slot,
+* the userspace_addr cannot change.
+*/
if (!old.npages) {
new.user_alloc = user_alloc;
new.userspace_addr = mem-userspace_addr;
 
if (kvm_arch_create_memslot(new, npages))
goto out_free;
+   } else if (mem-userspace_addr != old.userspace_addr) {
+   r = -EINVAL;
+   goto out_free;
}
 
/* Allocate page dirty bitmap if needed */

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: fix mov immediate emulation for 64-bit operands

2012-12-06 Thread Marcelo Tosatti

From: Nadav Amit nadav.a...@gmail.com

MOV immediate instruction (opcodes 0xB8-0xBF) may take 64-bit operand.
The previous emulation implementation assumes the operand is no longer than 32.
Adding OpImm64 for this matter.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=881579

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 39171cb..6fec09c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -43,7 +43,7 @@
 #define OpCL   9ull  /* CL register (for shifts) */
 #define OpImmByte 10ull  /* 8-bit sign extended immediate */
 #define OpOne 11ull  /* Implied 1 */
-#define OpImm 12ull  /* Sign extended immediate */
+#define OpImm 12ull  /* Sign extended up to 32-bit immediate */
 #define OpMem16   13ull  /* Memory operand (16-bit). */
 #define OpMem32   14ull  /* Memory operand (32-bit). */
 #define OpImmU15ull  /* Immediate operand, zero extended */
@@ -58,6 +58,7 @@
 #define OpFS  24ull  /* FS */
 #define OpGS  25ull  /* GS */
 #define OpMem826ull  /* 8-bit zero extended memory operand */
+#define OpImm64   27ull  /* Sign extended 16/32/64-bit immediate */
 
 #define OpBits 5  /* Width of operand field */
 #define OpMask ((1ull  OpBits) - 1)
@@ -101,6 +102,7 @@
 #define SrcMemFAddr (OpMemFAddr  SrcShift)
 #define SrcAcc  (OpAcc  SrcShift)
 #define SrcImmU16   (OpImmU16  SrcShift)
+#define SrcImm64(OpImm64  SrcShift)
 #define SrcDX   (OpDX  SrcShift)
 #define SrcMem8 (OpMem8  SrcShift)
 #define SrcMask (OpMask  SrcShift)
@@ -3786,7 +3788,7 @@ static const struct opcode opcode_table[256] = {
/* 0xB0 - 0xB7 */
X8(I(ByteOp | DstReg | SrcImm | Mov, em_mov)),
/* 0xB8 - 0xBF */
-   X8(I(DstReg | SrcImm | Mov, em_mov)),
+   X8(I(DstReg | SrcImm64 | Mov, em_mov)),
/* 0xC0 - 0xC7 */
D2bv(DstMem | SrcImmByte | ModRM),
I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm),
@@ -3950,6 +3952,9 @@ static int decode_imm(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
case 4:
op-val = insn_fetch(s32, ctxt);
break;
+   case 8:
+   op-val = insn_fetch(s64, ctxt);
+   break;
}
if (!sign_extension) {
switch (op-bytes) {
@@ -4028,6 +4033,9 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
case OpImm:
rc = decode_imm(ctxt, op, imm_size(ctxt), true);
break;
+   case OpImm64:
+   rc = decode_imm(ctxt, op, ctxt-op_bytes, true);
+   break;
case OpMem8:
ctxt-memop.bytes = 1;
goto mem_common;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] kvm: Re-introduce memslots-nmemslots

2012-12-06 Thread Marcelo Tosatti
On Wed, Dec 05, 2012 at 08:51:37PM -0700, Alex Williamson wrote:
   id_to_memslot seems like a good place to catch all the users since
   that's the only way to get a slot from a slot id after the array is
   sorted.  We need to check both is the slot in bounds (EINVAL), but also
   is it allocated (ENOENT).  id_to_memslot could both of these if we
   wanted to switch it to ERR_PTR.  Thanks,
   
   Alex
  
  There should never be a reference to a slot out of bounds by KVM itself
  (BUG_ON). Only userspace can attempt a reference to such slot.
 
 If I understand correctly, you're saying this last chunk is unique
 because kvm_get_dirty_log() is an internal interface and the test should
 be restricted to callers from userspace interfaces, namely
 kvm_vm_ioctl_get_dirty_log().  That sounds reasonable; book3s_pr seems
 to be the only caller that relies on kvm_get_dirty_log() validating the
 slot.  Thanks,
 
 Alex

Yep - so you can move the check to such userspace interfaces, and bug on 
on WARN otherwise (in id_to_memslot).

Does that make sense??

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] kvm: Re-introduce memslots-nmemslots

2012-12-06 Thread Marcelo Tosatti
On Thu, Dec 06, 2012 at 09:58:48PM -0200, Marcelo Tosatti wrote:
 On Wed, Dec 05, 2012 at 08:51:37PM -0700, Alex Williamson wrote:
id_to_memslot seems like a good place to catch all the users since
that's the only way to get a slot from a slot id after the array is
sorted.  We need to check both is the slot in bounds (EINVAL), but also
is it allocated (ENOENT).  id_to_memslot could both of these if we
wanted to switch it to ERR_PTR.  Thanks,

Alex
   
   There should never be a reference to a slot out of bounds by KVM itself
   (BUG_ON). Only userspace can attempt a reference to such slot.
  
  If I understand correctly, you're saying this last chunk is unique
  because kvm_get_dirty_log() is an internal interface and the test should
  be restricted to callers from userspace interfaces, namely
  kvm_vm_ioctl_get_dirty_log().  That sounds reasonable; book3s_pr seems
  to be the only caller that relies on kvm_get_dirty_log() validating the
  slot.  Thanks,
  
  Alex
 
 Yep - so you can move the check to such userspace interfaces, and bug on 
 on WARN otherwise (in id_to_memslot).

WARN_ON. The point is, if its not a valid condition, it should be
explicitly so.

 Does that make sense??
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] kvm: Re-introduce memslots-nmemslots

2012-12-06 Thread Alex Williamson
On Thu, 2012-12-06 at 21:59 -0200, Marcelo Tosatti wrote:
 On Thu, Dec 06, 2012 at 09:58:48PM -0200, Marcelo Tosatti wrote:
  On Wed, Dec 05, 2012 at 08:51:37PM -0700, Alex Williamson wrote:
 id_to_memslot seems like a good place to catch all the users since
 that's the only way to get a slot from a slot id after the array is
 sorted.  We need to check both is the slot in bounds (EINVAL), but 
 also
 is it allocated (ENOENT).  id_to_memslot could both of these if we
 wanted to switch it to ERR_PTR.  Thanks,
 
 Alex

There should never be a reference to a slot out of bounds by KVM itself
(BUG_ON). Only userspace can attempt a reference to such slot.
   
   If I understand correctly, you're saying this last chunk is unique
   because kvm_get_dirty_log() is an internal interface and the test should
   be restricted to callers from userspace interfaces, namely
   kvm_vm_ioctl_get_dirty_log().  That sounds reasonable; book3s_pr seems
   to be the only caller that relies on kvm_get_dirty_log() validating the
   slot.  Thanks,
   
   Alex
  
  Yep - so you can move the check to such userspace interfaces, and bug on 
  on WARN otherwise (in id_to_memslot).
 
 WARN_ON. The point is, if its not a valid condition, it should be
 explicitly so.
 
  Does that make sense??

Yep, I'll add that if we decide to go that route.  This patch isn't
necessary with the series I just posted since the array is still static.
Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 3/4] x86, apicv: add virtual interrupt delivery support

2012-12-06 Thread Zhang, Yang Z
Marcelo Tosatti wrote on 2012-12-07:
 On Thu, Dec 06, 2012 at 08:36:52AM +0200, Gleb Natapov wrote:
 On Thu, Dec 06, 2012 at 05:02:15AM +, Zhang, Yang Z wrote:
 Zhang, Yang Z wrote on 2012-12-06:
 Marcelo Tosatti wrote on 2012-12-06:
 On Mon, Dec 03, 2012 at 03:01:03PM +0800, Yang Zhang wrote:
 Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
 manually, which is fully taken care of by the hardware. This needs
 some special awareness into existing interrupr injection path:
 
 - for pending interrupt, instead of direct injection, we may need
   update architecture specific indicators before resuming to guest. -
   A pending interrupt, which is masked by ISR, should be also
   considered in above update action, since hardware will decide when
   to inject it at right time. Current has_interrupt and get_interrupt
   only returns a valid vector from injection p.o.v.
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Kevin Tian kevin.t...@intel.com
 ---
  arch/x86/include/asm/kvm_host.h |4 +
 arch/x86/include/asm/vmx.h
|   11 +++ arch/x86/kvm/irq.c  |   53
 ++-
  arch/x86/kvm/lapic.c|   56 +---
  arch/x86/kvm/lapic.h|6 ++ arch/x86/kvm/svm.c
 |   19 + arch/x86/kvm/vmx.c  |  140
  ++- arch/x86/kvm/x86.c
   |   34 -- virt/kvm/ioapic.c   |1 + 9 files
  changed, 291 insertions(+), 33 deletions(-)
 diff --git a/arch/x86/include/asm/kvm_host.h
 b/arch/x86/include/asm/kvm_host.h index dc87b65..e5352c8 100644 ---
 a/arch/x86/include/asm/kvm_host.h +++
 b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,10 @@ struct
 kvm_x86_ops {
  void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
  void (*enable_irq_window)(struct kvm_vcpu *vcpu);
  void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, 
 int irr);
 +int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
 +void (*update_irq)(struct kvm_vcpu *vcpu);
 +void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
 +int trig_mode, int always_set);
  int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
  int (*get_tdp_level)(void);
  u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
 is_mmio);
 diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
 index 21101b6..1003341 100644
 --- a/arch/x86/include/asm/vmx.h
 +++ b/arch/x86/include/asm/vmx.h
 @@ -62,6 +62,7 @@
  #define EXIT_REASON_MCE_DURING_VMENTRY  41 #define
  EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define
 EXIT_REASON_APIC_ACCESS
  44 +#define EXIT_REASON_EOI_INDUCED 45 #define
  EXIT_REASON_EPT_VIOLATION   48 #define
 EXIT_REASON_EPT_MISCONFIG
  49 #define EXIT_REASON_WBINVD  54 @@ -143,6
 +144,7 @@
  #define SECONDARY_EXEC_WBINVD_EXITING   0x0040 #define
  SECONDARY_EXEC_UNRESTRICTED_GUEST   0x0080 #define
  SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100 +#define
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY0x0200 #define
  SECONDARY_EXEC_PAUSE_LOOP_EXITING   0x0400 #define
  SECONDARY_EXEC_ENABLE_INVPCID   0x1000 @@ -180,6 +182,7 @@ 
 enum
  vmcs_field {GUEST_GS_SELECTOR   = 0x080a,
GUEST_LDTR_SELECTOR
   = 0x080c,  GUEST_TR_SELECTOR   =
 0x080e,
  +   GUEST_INTR_STATUS   = 0x0810,
HOST_ES_SELECTOR
= 0x0c00, HOST_CS_SELECTOR=
 0x0c02,
  HOST_SS_SELECTOR= 0x0c04, @@ -207,6 +210,14
  @@ enum vmcs_field {APIC_ACCESS_ADDR_HIGH   = 0x2015,
EPT_POINTER
= 0x201a, EPT_POINTER_HIGH
 =
  0x201b,
 +EOI_EXIT_BITMAP0= 0x201c,
 +EOI_EXIT_BITMAP0_HIGH   = 0x201d,
 +EOI_EXIT_BITMAP1= 0x201e,
 +EOI_EXIT_BITMAP1_HIGH   = 0x201f,
 +EOI_EXIT_BITMAP2= 0x2020,
 +EOI_EXIT_BITMAP2_HIGH   = 0x2021,
 +EOI_EXIT_BITMAP3= 0x2022,
 +EOI_EXIT_BITMAP3_HIGH   = 0x2023,
  GUEST_PHYSICAL_ADDRESS  = 0x2400,
  GUEST_PHYSICAL_ADDRESS_HIGH = 0x2401,
  VMCS_LINK_POINTER   = 0x2800,
 diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
 index 7e06ba1..f782788 100644
 --- a/arch/x86/kvm/irq.c
 +++ b/arch/x86/kvm/irq.c
 @@ -43,45 +43,64 @@ EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
   */
  int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
  {
 -struct kvm_pic *s;
 -
  if (!irqchip_in_kernel(v-kvm))
  return v-arch.interrupt.pending;
 -if (kvm_apic_has_interrupt(v) == -1) {  /* LAPIC */
 -if (kvm_apic_accept_pic_intr(v)) {
 -s = pic_irqchip(v-kvm);

Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Rusty Russell
Ben Hutchings bhutchi...@solarflare.com writes:
  If you want a name for the whole set of
 features involved, I don't see any better name than 'multiqueue'/'MQ'.

OK, let's go back to multiqueue then, and perhaps refer to the current
receive steering as 'automatic'.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] x86/kexec: crash_vmclear_local_vmcss needs __rcu

2012-12-06 Thread Zhang Yanfei
This removes the sparse warning:
arch/x86/kernel/crash.c:49:32: sparse: incompatible types in comparison 
expression (different address spaces)

Reported-by: kbuild test robot fengguang...@intel.com
Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/include/asm/kexec.h |4 +++-
 arch/x86/kernel/crash.c  |4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 28feeba..16882cd 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -163,7 +163,9 @@ struct kimage_arch {
 };
 #endif
 
-extern void (*crash_vmclear_loaded_vmcss)(void);
+extern void __rcu (*crash_vmclear_loaded_vmcss)(void);
+#define vmclear_func_rcu(vmclear_func) \
+   ((void (*)(void)) rcu_dereference(vmclear_func))
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 2f6b8e8..50ce1d6 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -38,7 +38,7 @@ int in_crash_kexec;
  *
  * protected by rcu.
  */
-void (*crash_vmclear_loaded_vmcss)(void) = NULL;
+void __rcu (*crash_vmclear_loaded_vmcss)(void) = NULL;
 EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
 
 static inline void cpu_crash_vmclear_loaded_vmcss(void)
@@ -46,7 +46,7 @@ static inline void cpu_crash_vmclear_loaded_vmcss(void)
void (*do_vmclear_operation)(void) = NULL;
 
rcu_read_lock();
-   do_vmclear_operation = rcu_dereference(crash_vmclear_loaded_vmcss);
+   do_vmclear_operation = vmclear_func_rcu(crash_vmclear_loaded_vmcss);
if (do_vmclear_operation)
do_vmclear_operation();
rcu_read_unlock();
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v6

2012-12-06 Thread Asias He
On 12/06/2012 09:00 PM, Michael S. Tsirkin wrote:
 On Sun, Dec 02, 2012 at 09:33:53AM +0800, Asias He wrote:
 diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
 new file mode 100644
 index 000..ff8ab76
 --- /dev/null
 +++ b/drivers/vhost/Kconfig.blk
 @@ -0,0 +1,10 @@
 +config VHOST_BLK
 +tristate Host kernel accelerator for virtio blk (EXPERIMENTAL)
 +depends on BLOCK   EXPERIMENTAL  m
 
 
 should depend on eventfd as well.

Okay, added.

-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/4] x86, apicv: add virtual interrupt delivery support

2012-12-06 Thread Gleb Natapov
On Fri, Dec 07, 2012 at 01:00:18AM +, Zhang, Yang Z wrote:
 Marcelo Tosatti wrote on 2012-12-07:
  How about to recaculate irr_pending according the VIRR on each vmexit?
  
  No need really. Since HW can only clear VIRR the only situation that may
  happen is that irr_pending will be true but VIRR is empty and
  apic_find_highest_irr() will return correct result in this case.
  
  If we will see a lot of unneeded irr scans because of stale irr_pending
  value we can do irr_pending = rvi != 0 on vmexit.
  
  --
 Gleb.
  
  Yes, it is harmless ATM. But its clearer if irr_pending is not used when
  virtual interrupt delivery is in effect (that is, just skip irr_pending
  if virtual interrupt delivery is enabled).
 irr_pending still is useful in virtual interrupt delivery case. Or else, as 
 gleb said ,there may be lots of unneeded irr scan.
 
Actually, looking at it close, irr_pending will always be true (and
thus effectively disabled without any additional checks) since it is
cleared only in kvm_get_apic_interrupt() which will not be called with
vid enabled. It is racy to do irr_pending = rvi != 0 on vmexit either.
The code should be something like that:
 irr_pending = (rvi != 0) ? : apic_search_irr(apic) != -1;
But we do not want to do that on each exit since rvi will be mostly
zero and irr is, more often than not, empty.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: fix mov immediate emulation for 64-bit operands

2012-12-06 Thread Gleb Natapov
On Thu, Dec 06, 2012 at 09:55:10PM -0200, Marcelo Tosatti wrote:
 
 From: Nadav Amit nadav.a...@gmail.com
 
 MOV immediate instruction (opcodes 0xB8-0xBF) may take 64-bit operand.
 The previous emulation implementation assumes the operand is no longer than 
 32.
 Adding OpImm64 for this matter.
 
 Fixes https://bugzilla.redhat.com/show_bug.cgi?id=881579
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
Needs author's sign-off and test case.

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 39171cb..6fec09c 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -43,7 +43,7 @@
  #define OpCL   9ull  /* CL register (for shifts) */
  #define OpImmByte 10ull  /* 8-bit sign extended immediate */
  #define OpOne 11ull  /* Implied 1 */
 -#define OpImm 12ull  /* Sign extended immediate */
 +#define OpImm 12ull  /* Sign extended up to 32-bit immediate */
  #define OpMem16   13ull  /* Memory operand (16-bit). */
  #define OpMem32   14ull  /* Memory operand (32-bit). */
  #define OpImmU15ull  /* Immediate operand, zero extended */
 @@ -58,6 +58,7 @@
  #define OpFS  24ull  /* FS */
  #define OpGS  25ull  /* GS */
  #define OpMem826ull  /* 8-bit zero extended memory operand */
 +#define OpImm64   27ull  /* Sign extended 16/32/64-bit immediate */
  
  #define OpBits 5  /* Width of operand field */
  #define OpMask ((1ull  OpBits) - 1)
 @@ -101,6 +102,7 @@
  #define SrcMemFAddr (OpMemFAddr  SrcShift)
  #define SrcAcc  (OpAcc  SrcShift)
  #define SrcImmU16   (OpImmU16  SrcShift)
 +#define SrcImm64(OpImm64  SrcShift)
  #define SrcDX   (OpDX  SrcShift)
  #define SrcMem8 (OpMem8  SrcShift)
  #define SrcMask (OpMask  SrcShift)
 @@ -3786,7 +3788,7 @@ static const struct opcode opcode_table[256] = {
   /* 0xB0 - 0xB7 */
   X8(I(ByteOp | DstReg | SrcImm | Mov, em_mov)),
   /* 0xB8 - 0xBF */
 - X8(I(DstReg | SrcImm | Mov, em_mov)),
 + X8(I(DstReg | SrcImm64 | Mov, em_mov)),
   /* 0xC0 - 0xC7 */
   D2bv(DstMem | SrcImmByte | ModRM),
   I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm),
 @@ -3950,6 +3952,9 @@ static int decode_imm(struct x86_emulate_ctxt *ctxt, 
 struct operand *op,
   case 4:
   op-val = insn_fetch(s32, ctxt);
   break;
 + case 8:
 + op-val = insn_fetch(s64, ctxt);
 + break;
   }
   if (!sign_extension) {
   switch (op-bytes) {
 @@ -4028,6 +4033,9 @@ static int decode_operand(struct x86_emulate_ctxt 
 *ctxt, struct operand *op,
   case OpImm:
   rc = decode_imm(ctxt, op, imm_size(ctxt), true);
   break;
 + case OpImm64:
 + rc = decode_imm(ctxt, op, ctxt-op_bytes, true);
 + break;
   case OpMem8:
   ctxt-memop.bytes = 1;
   goto mem_common;

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio powerpc: implemented IOMMU driver for VFIO

2012-12-06 Thread Alexey Kardashevskiy
VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling. This patch implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWERPC
guest).

The counterpart in QEMU is required to support this functionality.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/Kconfig|6 +
 drivers/vfio/Makefile   |1 +
 drivers/vfio/vfio_iommu_spapr_tce.c |  348 +++
 include/linux/vfio.h|   30 +++
 4 files changed, 385 insertions(+)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
depends on VFIO
default n
 
+config VFIO_IOMMU_SPAPR_TCE
+   tristate
+   depends on VFIO  SPAPR_TCE_IOMMU
+   default n
+
 menuconfig VFIO
tristate VFIO Non-Privileged userspace driver framework
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
+   select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
new file mode 100644
index 000..b0f81fe
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -0,0 +1,348 @@
+/*
+ * VFIO: IOMMU DMA mapping support for TCE on POWER
+ *
+ * Copyright (C) 2012 IBM Corp.  All rights reserved.
+ * Author: Alexey Kardashevskiy a...@ozlabs.ru
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_iommu_type1.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ * Author: Alex Williamson alex.william...@redhat.com
+ */
+
+#include linux/module.h
+#include linux/pci.h
+#include linux/slab.h
+#include linux/uaccess.h
+#include linux/err.h
+#include linux/vfio.h
+#include asm/iommu.h
+
+#define DRIVER_VERSION  0.1
+#define DRIVER_AUTHOR   a...@ozlabs.ru
+#define DRIVER_DESC VFIO IOMMU SPAPR TCE
+
+static void tce_iommu_detach_group(void *iommu_data,
+   struct iommu_group *iommu_group);
+
+/*
+ * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
+ */
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)   ((size_t)(npage)  PAGE_SHIFT)
+
+struct vwork {
+   struct mm_struct*mm;
+   longnpage;
+   struct work_struct  work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void lock_acct_bg(struct work_struct *work)
+{
+   struct vwork *vwork = container_of(work, struct vwork, work);
+   struct mm_struct *mm;
+
+   mm = vwork-mm;
+   down_write(mm-mmap_sem);
+   mm-locked_vm += vwork-npage;
+   up_write(mm-mmap_sem);
+   mmput(mm);
+   kfree(vwork);
+}
+
+static void lock_acct(long npage)
+{
+   struct vwork *vwork;
+   struct mm_struct *mm;
+
+   if (!current-mm)
+   return; /* process exited */
+
+   if (down_write_trylock(current-mm-mmap_sem)) {
+   current-mm-locked_vm += npage;
+   up_write(current-mm-mmap_sem);
+   return;
+   }
+
+   /*
+* Couldn't get mmap_sem lock, so must setup to update
+* mm-locked_vm later. If locked_vm were atomic, we
+* wouldn't need this silliness
+*/
+   vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+   if (!vwork)
+   return;
+   mm = get_task_mm(current);
+   if (!mm) {
+   kfree(vwork);
+   return;
+   }
+   INIT_WORK(vwork-work, lock_acct_bg);
+   vwork-mm = mm;
+   vwork-npage = npage;
+   schedule_work(vwork-work);
+}
+
+/*
+ * The container descriptor supports only a single group per container.
+ * Required by the API as the container is not supplied with the IOMMU group
+ * at the moment of initialization.
+ */
+struct tce_container {
+   struct mutex lock;
+   struct iommu_table *tbl;
+};
+
+static void 

[PATCH] vfio powerpc: enabled on powernv platform

2012-12-06 Thread Alexey Kardashevskiy
This patch initializes IOMMU groups based on the IOMMU
configuration discovered during the PCI scan on POWERNV
(POWER non virtualized) platform. The IOMMU groups are
to be used later by VFIO driver (PCI pass through).

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h |   10 ++
 arch/powerpc/kernel/iommu.c  |  214 ++
 arch/powerpc/platforms/powernv/pci.c |  134 +
 drivers/iommu/Kconfig|8 ++
 4 files changed, 366 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..be3b11b 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,9 @@ struct iommu_table {
struct iommu_pool large_pool;
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for now */
+#ifdef CONFIG_IOMMU_API
+   struct iommu_group *it_group;
+#endif
 };
 
 struct scatterlist;
@@ -147,5 +150,12 @@ static inline void iommu_restore(void)
 }
 #endif
 
+extern void iommu_reset_table(struct iommu_table *tbl, bool release);
+extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long pages);
+extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
+   uint64_t tce, enum dma_data_direction direction,
+   unsigned long pages);
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff5a6ce..123431a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -44,6 +44,7 @@
 #include asm/kdump.h
 #include asm/fadump.h
 #include asm/vio.h
+#include asm/tce.h
 
 #define DBG(...)
 
@@ -856,3 +857,216 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
free_pages((unsigned long)vaddr, get_order(size));
}
 }
+
+#ifdef CONFIG_IOMMU_API
+/*
+ * SPAPR TCE API
+ */
+
+/*
+ * iommu_reset_table is called when it started/stopped being used
+ */
+void iommu_reset_table(struct iommu_table *tbl, bool release)
+{
+   /*
+* Page at 0 is marked as used in iommu_init_table,
+* so here we clear it when called with release=false...
+*/
+   if (!release  (tbl-it_offset == 0))
+   clear_bit(0, tbl-it_map);
+
+   iommu_clear_tces(tbl, tbl-it_offset, tbl-it_size);
+
+   memset(tbl-it_map, 0, (tbl-it_size + 7)  3);
+
+   /*
+* ... or restore when release=true
+*/
+   if (release  (tbl-it_offset == 0))
+   set_bit(0, tbl-it_map);
+}
+EXPORT_SYMBOL_GPL(iommu_reset_table);
+
+/*
+ * Returns the number of used IOMMU pages (4K) within
+ * the same system page (4K or 64K).
+ * bitmap_weight is not used as it does not support bigendian maps.
+ * offset is an IOMMU page number relative to DMA window start.
+ */
+static int syspage_weight(unsigned long *map, unsigned long offset)
+{
+   int ret = 0, nbits = PAGE_SIZE/IOMMU_PAGE_SIZE;
+
+   /* Aligns TCE entry number to system page boundary */
+   offset = PAGE_MASK  IOMMU_PAGE_SHIFT;
+
+   /* Count used 4K pages */
+   while (nbits) {
+   if (test_bit(offset, map))
+   ++ret;
+   --nbits;
+   ++offset;
+   }
+
+   return ret;
+}
+
+static void tce_flush(struct iommu_table *tbl)
+{
+   /* Flush/invalidate TLB caches if necessary */
+   if (ppc_md.tce_flush)
+   ppc_md.tce_flush(tbl);
+
+   /* Make sure updates are seen by hardware */
+   mb();
+}
+
+/*
+ * iommu_clear_tces clears tces and returned the number of system pages
+ * which it called put_page() on
+ */
+static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
+   unsigned long pages)
+{
+   int i, retpages = 0, clr;
+   unsigned long oldtce, oldweight;
+   struct page *page;
+
+   for (i = 0; i  pages; ++i) {
+   if (!test_bit(entry + i - tbl-it_offset, tbl-it_map))
+   continue;
+
+   oldtce = ppc_md.tce_get(tbl, entry + i);
+   ppc_md.tce_free(tbl, entry + i, 1);
+
+   oldweight = syspage_weight(tbl-it_map,
+   entry + i - tbl-it_offset);
+   clr = __test_and_clear_bit(entry + i - tbl-it_offset,
+   tbl-it_map);
+
+ 

[PATCH v4 1/2] x86, apicv: add APICv register virtualization support

2012-12-06 Thread Yang Zhang
- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/x86/include/asm/vmx.h |2 ++
 arch/x86/kvm/lapic.c   |   15 +++
 arch/x86/kvm/lapic.h   |2 ++
 arch/x86/kvm/vmx.c |   33 -
 4 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 36ec21c..21101b6 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -66,6 +66,7 @@
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD  54
 #define EXIT_REASON_XSETBV  55
+#define EXIT_REASON_APIC_WRITE  56
 #define EXIT_REASON_INVPCID 58
 
 #define VMX_EXIT_REASONS \
@@ -141,6 +142,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID  0x0020
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
+#define SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9392f52..0664c13 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1212,6 +1212,21 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 
+/* emulate APIC access in a trap manner */
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
+{
+   u32 val = 0;
+
+   /* hw has done the conditional check and inst decode */
+   offset = 0xff0;
+
+   apic_reg_read(vcpu-arch.apic, offset, 4, val);
+
+   /* TODO: optimize to just emulate side effect w/o one more write */
+   apic_reg_write(vcpu-arch.apic, offset, val);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu-arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e5ebf9f..9a8ee22 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,6 +64,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fd2046..4838e4f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -83,6 +83,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_reg_vid;
+module_param(enable_apicv_reg_vid, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -761,6 +764,12 @@ static inline bool 
cpu_has_vmx_virtualize_apic_accesses(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_apic_register_virt(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_APIC_REGISTER_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
return cpu_has_vmx_tpr_shadow() 
@@ -2498,7 +2507,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_UNRESTRICTED_GUEST |
SECONDARY_EXEC_PAUSE_LOOP_EXITING |
SECONDARY_EXEC_RDTSCP |
-   SECONDARY_EXEC_ENABLE_INVPCID;
+   SECONDARY_EXEC_ENABLE_INVPCID |
+   SECONDARY_EXEC_APIC_REGISTER_VIRT;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
_cpu_based_2nd_exec_control)  0)
@@ -2509,6 +2519,11 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
_cpu_based_exec_control = ~CPU_BASED_TPR_SHADOW;
 #endif
+
+   if (!(_cpu_based_exec_control  CPU_BASED_TPR_SHADOW))
+   _cpu_based_2nd_exec_control = ~(
+   SECONDARY_EXEC_APIC_REGISTER_VIRT);
+
if (_cpu_based_2nd_exec_control  SECONDARY_EXEC_ENABLE_EPT) {
/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
   enabled */
@@ -2706,6 +2721,9 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_ple())
ple_gap = 0;
 
+   if (!cpu_has_vmx_apic_register_virt())
+   

[PATCH v4 2/2] x86, apicv: add virtual interrupt delivery support

2012-12-06 Thread Yang Zhang
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/x86/include/asm/kvm_host.h |4 +
 arch/x86/include/asm/vmx.h  |   11 
 arch/x86/kvm/irq.c  |   79 +++-
 arch/x86/kvm/lapic.c|  101 +---
 arch/x86/kvm/lapic.h|   11 
 arch/x86/kvm/svm.c  |   19 ++
 arch/x86/kvm/vmx.c  |  124 +-
 arch/x86/kvm/x86.c  |   18 --
 virt/kvm/ioapic.c   |   35 +++
 virt/kvm/ioapic.h   |1 +
 10 files changed, 366 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc87b65..7e26d1a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -697,6 +697,9 @@ struct kvm_x86_ops {
void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
void (*enable_irq_window)(struct kvm_vcpu *vcpu);
void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+   int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
+   void (*update_irq)(struct kvm_vcpu *vcpu);
+   void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
@@ -991,6 +994,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
+int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 21101b6..1003341 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -62,6 +62,7 @@
 #define EXIT_REASON_MCE_DURING_VMENTRY  41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS 44
+#define EXIT_REASON_EOI_INDUCED 45
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD  54
@@ -143,6 +144,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
 #define SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100
+#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY0x0200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
@@ -180,6 +182,7 @@ enum vmcs_field {
GUEST_GS_SELECTOR   = 0x080a,
GUEST_LDTR_SELECTOR = 0x080c,
GUEST_TR_SELECTOR   = 0x080e,
+   GUEST_INTR_STATUS   = 0x0810,
HOST_ES_SELECTOR= 0x0c00,
HOST_CS_SELECTOR= 0x0c02,
HOST_SS_SELECTOR= 0x0c04,
@@ -207,6 +210,14 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
+   EOI_EXIT_BITMAP0= 0x201c,
+   EOI_EXIT_BITMAP0_HIGH   = 0x201d,
+   EOI_EXIT_BITMAP1= 0x201e,
+   EOI_EXIT_BITMAP1_HIGH   = 0x201f,
+   EOI_EXIT_BITMAP2= 0x2020,
+   EOI_EXIT_BITMAP2_HIGH   = 0x2021,
+   EOI_EXIT_BITMAP3= 0x2022,
+   EOI_EXIT_BITMAP3_HIGH   = 0x2023,
GUEST_PHYSICAL_ADDRESS  = 0x2400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x2401,
VMCS_LINK_POINTER   = 0x2800,
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 7e06ba1..5cbc631 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -38,50 +38,95 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
 
 /*
+ * check if there is injectable interrupt:
+ * a. when virtual interrupt delivery enabled,
+ * interrupt from apic will handled 

[PATCH v4 0/2] x86, apicv: Add APIC virtualization support

2012-12-06 Thread Yang Zhang
APIC virtualization is a new feature which can eliminate most of VM exit
when vcpu handle a interrupt:

APIC register virtualization:
APIC read access doesn't cause APIC-access VM exits.
APIC write becomes trap-like.

Virtual interrupt delivery:
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware.

Please refer to Intel SDM volume 3, chapter 29 for more details.
Changes v3 to v4:
 * use one option to control both register virtualization and virtual interrupt
   delivery.
 * Update eoi exit bitmap when programing ioapic or programing apic's 
id/dfr/ldr.
 * Rebased on top of KVM upstream.

Changes v2 to v3:
 * Drop Posted Interrupt patch from v3.
   According Gleb's suggestion, we will use global vector for all VCPUs as 
notification
   event vector. So we will rewrite the Posted Interrupt patch. And resend it 
later.
 * Use TMR to set the eoi exiting bitmap. We only want to set eoi exiting 
bitmap for
   those interrupt which is level trigger or has notifier in EOI write path. So 
TMR is
   enough to distinguish the interrupt trigger mode.
 * Simplify some code according Gleb's comments.
 * rebased on top of KVM upstream.

Changes v1 to v2:
 * Add Posted Interrupt support in this series patch.
 * Since there is a notifer hook in vAPIC EOI for PIT interrupt. So always Set 
PIT
   interrupt in eoi exit bitmap to force vmexit when EOI to interrupt.
 * Rebased on top of KVM upstream

Yang Zhang (2):
  x86, apicv: add APICv register virtualization support
  x86, apicv: add virtual interrupt delivery support

 arch/x86/include/asm/kvm_host.h |4 +
 arch/x86/include/asm/vmx.h  |   13 
 arch/x86/kvm/irq.c  |   79 -
 arch/x86/kvm/lapic.c|  116 ---
 arch/x86/kvm/lapic.h|   13 
 arch/x86/kvm/svm.c  |   19 +
 arch/x86/kvm/vmx.c  |  149 ++-
 arch/x86/kvm/x86.c  |   18 +++--
 virt/kvm/ioapic.c   |   35 +
 virt/kvm/ioapic.h   |1 +
 10 files changed, 413 insertions(+), 34 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html